Stage 6: Grade It Honestly
Keep your Colab notebook tab open all session. Open in a new tab — don’t use the buttons in this page to leave the course.
a sealed test-set evaluation and a train-validation-test comparison
why the honest score is different from training accuracy
a baseline test score and an overfitting gap calculation
Put training accuracy, validation accuracy, and test accuracy on the board as three different claims. Ask which one belongs on a poster and why.
The big idea
The model practiced on training data and checked validation data during training. The test set is different: it stayed sealed. Today you open it once for the honest baseline score.
If training accuracy is much higher than test accuracy, the model learned some real patterns and memorized some training details. That gap is called overfitting.
- 1Photos / CIFAR-10labeled image examplesStage 1
- 2Notebook variablesx_train, y_train, class_namesSetup-2
- 3Prepared datanormalized pixels and fair pilesStage 3
- 4Keras modelCNN layers and summaryStage 4
- 5Training historyepochs, loss, accuracyStage 5
- 6Test evidencesealed score and mistakesStages 6-7
- 7Improved modelaugmentation comparisonStage 8
- 8Inferenceuploaded image to top-3 guessesStage 9
- 9Demo evidencetable, confidence, limitationStage 10
Stage 6 moves from training history to test evidence. The sealed test score is the honest grade you can report later.
- evaluate
- grade a trained model without changing it
- test accuracy
- the model's score on the sealed test set
- generalize
- work on new examples, not only practiced ones
- overfitting
- memorizing training examples instead of learning the general idea
You need the trained baseline model and baseline_history from Stage 5.
Build it
Step 1 — Predict the honest score
Test-score prediction:
- My final validation accuracy was _____.
- I predict test accuracy will be _____ because __________.
Step 2 — Evaluate the sealed test set
model.evaluate grades the model on data without changing the model. That is why this cell is different from model.fit: fit practices, evaluate grades.
baseline_test_loss, baseline_test_accuracy = model.evaluate(x_test, y_test)
print(f"Baseline test loss: {baseline_test_loss:.3f}")
print(f"Baseline test accuracy: {baseline_test_accuracy:.3f}")
No learning happens here. This is a grade, not practice.
Step 3 — Compare all three scores
baseline_train_accuracy = baseline_history.history['accuracy'][-1]
baseline_val_accuracy = baseline_history.history['val_accuracy'][-1]
print(f"Training accuracy: {baseline_train_accuracy:.3f}")
print(f"Validation accuracy: {baseline_val_accuracy:.3f}")
print(f"Test accuracy: {baseline_test_accuracy:.3f}")
print(f"Train-test gap: {baseline_train_accuracy - baseline_test_accuracy:.3f}")
The gap is your overfitting signal.
Step 4 — Write the honest claim
Honest model claim:
My baseline model gets about _____% test accuracy.
It does _____ points better on training data than test data.
That means __________.
Understand it
Training accuracy is not a lie, but it is not the final truth. It measures data the model practiced on. Test accuracy measures data that stayed untouched until grading.
Validation accuracy helps us while building. Test accuracy is the number we report when we want to be honest about performance.
Try this
Try this
Three short experiments. Predict before you run, then test your guess.
If a model had 99% training accuracy and 55% test accuracy, would you trust it more or less than yours? Explain before discussing.
Compare validation accuracy and test accuracy. Are they close? What would it mean if test was much lower?
Why is it cheating to keep changing your model until the test score goes up?
Test your stage
- You printed baseline test accuracy.
- You compared training, validation, and test accuracy.
- You calculated the train-test gap.
- Workflow check. Point to this stage on the workflow map and explain why test evidence comes after training.
- Evidence check. Your honest claim includes training, validation, and test accuracy.
- Design check. Explain from memory why changing the model after watching test accuracy would be cheating.
If it breaks
- Test accuracy is near 0.10. The model probably is not trained in this session.
baseline_historyis missing. Re-run Stage 5 training.- The gap is tiny. That is good; it means the model generalized well.
The code is short; the learning is the distinction between practiced, checked, and sealed data. Keep the conversation focused there.