Skip to main content

Stage 7: Find What It Gets Wrong

Course progressStage 7 of 10
~90 min
Your workspace

Keep your Colab notebook tab open all session. Open in a new tab — don’t use the buttons in this page to leave the course.

Build

prediction analysis cells that show wrong examples and class-level accuracy

Learn

how error analysis turns a score into an explanation

Ship

three wrong predictions plus a hardest-class finding

Teacher demo

Show one wrong prediction and ask: is the model being silly, or is the picture genuinely confusing?

The big idea

One score is not enough. A useful machine-learning engineer asks: Which examples failed? Which classes are hardest? Are the mistakes understandable?

Today you turn mistakes into evidence.

How the Python ML workflow connects
  1. 1
    Photos / CIFAR-10labeled image examplesStage 1
  2. 2
    Notebook variablesx_train, y_train, class_namesSetup-2
  3. 3
    Prepared datanormalized pixels and fair pilesStage 3
  4. 4
    Keras modelCNN layers and summaryStage 4
  5. 5
    Training historyepochs, loss, accuracyStage 5
  6. 6
    Test evidencesealed score and mistakesStages 6-7
  7. 7
    Improved modelaugmentation comparisonStage 8
  8. 8
    Inferenceuploaded image to top-3 guessesStage 9
  9. 9
    Demo evidencetable, confidence, limitationStage 10

Stage 7 stays in test evidence. Wrong predictions and per-class accuracy explain the score instead of hiding behind it.

New words
error analysis
studying wrong predictions to understand model behavior
prediction
the class the model chooses
true label
the correct answer from the dataset
per-class accuracy
accuracy measured separately for each category
Before you start

You need the trained baseline model from Stage 5 and test data from Stage 3.

Build it

Step 1 — Predict all test labels

model.predict asks the trained model for class scores without changing it. argmax picks the slot with the largest score, which becomes the predicted label number. NumPy helps us compare all of those predicted label numbers to the true label numbers.

Before running the cell, predict which class from Stage 2 will have the most mistakes.

test_predictions = model.predict(x_test)
predicted_labels = test_predictions.argmax(axis=1)
true_labels = y_test.argmax(axis=1)

print("Predicted labels shape:", predicted_labels.shape)

Each test image now has a predicted class number.

Step 2 — Find wrong predictions

wrong_indexes = np.where(predicted_labels != true_labels)[0]
print("Wrong predictions:", len(wrong_indexes))
print("First three wrong indexes:", wrong_indexes[:3])

Wrong predictions are not failures to hide. They are the best learning material.

Step 3 — Display three wrong examples

fig, axes = plt.subplots(1, 3, figsize=(10, 3))

for spot, image_index in enumerate(wrong_indexes[:3]):
axes[spot].imshow(x_test[image_index])
actual = class_names[true_labels[image_index]]
predicted = class_names[predicted_labels[image_index]]
confidence = test_predictions[image_index][predicted_labels[image_index]]
axes[spot].set_title(f"Pred: {predicted}\nTrue: {actual}\n{confidence:.0%}")
axes[spot].axis('off')

plt.tight_layout()
plt.show()

Look closely. Were the mistakes reasonable?

Step 4 — Calculate per-class accuracy

for class_index, class_name in enumerate(class_names):
class_mask = true_labels == class_index
class_accuracy = (predicted_labels[class_mask] == true_labels[class_mask]).mean()
print(f"{class_name}: {class_accuracy:.3f}")

The hardest class is the one with the lowest accuracy.

Step 5 — Write an error report

Error report:
- Three wrong predictions I inspected: __________.
- The hardest class was __________.
- One reason this class may be hard is __________.
- This matches / does not match my Stage 2 prediction because __________.

Understand it

Accuracy tells you how often the model is right. Error analysis tells you what kind of model you built.

If cats and dogs fail often, the model may struggle with soft shapes and similar textures. If trucks and automobiles fail, the model may need more detail or larger images. Explaining failure is part of understanding the system.

Try this

Learning beat

Try this

Three short experiments. Predict before you run, then test your guess.

Predict first

Before running per-class accuracy, predict the hardest class from your Stage 2 report. Was your prediction right?

Compare

Compare a confident wrong prediction with a low-confidence wrong prediction. Which is more dangerous in a real product?

Connect

How could data augmentation help with the mistakes you saw today?

Test your stage

  • You displayed three wrong predictions.
  • You calculated per-class accuracy.
  • You named the hardest class.
  • Workflow check. Point to this stage on the workflow map and explain how mistakes improve the final demo.
  • Evidence check. Your report includes one correct or understandable prediction and one wrong prediction.
  • Design check. Your error report explains at least one mistake using visual evidence.

If it breaks

  • np is not defined. Run import numpy as np.
  • y_test.argmax fails. Make sure labels are one-hot from Stage 3.
  • Titles overlap. Use fewer images or increase figsize.
Coach notes

This stage is a major course upgrade. It makes model quality concrete and gives students language for honest final demos.