Stage 7: Find What It Gets Wrong
Keep your Colab notebook tab open all session. Open in a new tab — don’t use the buttons in this page to leave the course.
prediction analysis cells that show wrong examples and class-level accuracy
how error analysis turns a score into an explanation
three wrong predictions plus a hardest-class finding
Show one wrong prediction and ask: is the model being silly, or is the picture genuinely confusing?
The big idea
One score is not enough. A useful machine-learning engineer asks: Which examples failed? Which classes are hardest? Are the mistakes understandable?
Today you turn mistakes into evidence.
- 1Photos / CIFAR-10labeled image examplesStage 1
- 2Notebook variablesx_train, y_train, class_namesSetup-2
- 3Prepared datanormalized pixels and fair pilesStage 3
- 4Keras modelCNN layers and summaryStage 4
- 5Training historyepochs, loss, accuracyStage 5
- 6Test evidencesealed score and mistakesStages 6-7
- 7Improved modelaugmentation comparisonStage 8
- 8Inferenceuploaded image to top-3 guessesStage 9
- 9Demo evidencetable, confidence, limitationStage 10
Stage 7 stays in test evidence. Wrong predictions and per-class accuracy explain the score instead of hiding behind it.
- error analysis
- studying wrong predictions to understand model behavior
- prediction
- the class the model chooses
- true label
- the correct answer from the dataset
- per-class accuracy
- accuracy measured separately for each category
You need the trained baseline model from Stage 5 and test data from Stage 3.
Build it
Step 1 — Predict all test labels
model.predict asks the trained model for class scores without changing it. argmax picks the slot with the largest score, which becomes the predicted label number. NumPy helps us compare all of those predicted label numbers to the true label numbers.
Before running the cell, predict which class from Stage 2 will have the most mistakes.
test_predictions = model.predict(x_test)
predicted_labels = test_predictions.argmax(axis=1)
true_labels = y_test.argmax(axis=1)
print("Predicted labels shape:", predicted_labels.shape)
Each test image now has a predicted class number.
Step 2 — Find wrong predictions
wrong_indexes = np.where(predicted_labels != true_labels)[0]
print("Wrong predictions:", len(wrong_indexes))
print("First three wrong indexes:", wrong_indexes[:3])
Wrong predictions are not failures to hide. They are the best learning material.
Step 3 — Display three wrong examples
fig, axes = plt.subplots(1, 3, figsize=(10, 3))
for spot, image_index in enumerate(wrong_indexes[:3]):
axes[spot].imshow(x_test[image_index])
actual = class_names[true_labels[image_index]]
predicted = class_names[predicted_labels[image_index]]
confidence = test_predictions[image_index][predicted_labels[image_index]]
axes[spot].set_title(f"Pred: {predicted}\nTrue: {actual}\n{confidence:.0%}")
axes[spot].axis('off')
plt.tight_layout()
plt.show()
Look closely. Were the mistakes reasonable?
Step 4 — Calculate per-class accuracy
for class_index, class_name in enumerate(class_names):
class_mask = true_labels == class_index
class_accuracy = (predicted_labels[class_mask] == true_labels[class_mask]).mean()
print(f"{class_name}: {class_accuracy:.3f}")
The hardest class is the one with the lowest accuracy.
Step 5 — Write an error report
Error report:
- Three wrong predictions I inspected: __________.
- The hardest class was __________.
- One reason this class may be hard is __________.
- This matches / does not match my Stage 2 prediction because __________.
Understand it
Accuracy tells you how often the model is right. Error analysis tells you what kind of model you built.
If cats and dogs fail often, the model may struggle with soft shapes and similar textures. If trucks and automobiles fail, the model may need more detail or larger images. Explaining failure is part of understanding the system.
Try this
Try this
Three short experiments. Predict before you run, then test your guess.
Before running per-class accuracy, predict the hardest class from your Stage 2 report. Was your prediction right?
Compare a confident wrong prediction with a low-confidence wrong prediction. Which is more dangerous in a real product?
How could data augmentation help with the mistakes you saw today?
Test your stage
- You displayed three wrong predictions.
- You calculated per-class accuracy.
- You named the hardest class.
- Workflow check. Point to this stage on the workflow map and explain how mistakes improve the final demo.
- Evidence check. Your report includes one correct or understandable prediction and one wrong prediction.
- Design check. Your error report explains at least one mistake using visual evidence.
If it breaks
npis not defined. Runimport numpy as np.y_test.argmaxfails. Make sure labels are one-hot from Stage 3.- Titles overlap. Use fewer images or increase
figsize.
This stage is a major course upgrade. It makes model quality concrete and gives students language for honest final demos.