Skip to main content

Stage 6: Grade It Honestly

Course progressStage 6 of 10
~75 min
Your workspace

Keep your Colab notebook tab open all session. Open in a new tab — don’t use the buttons in this page to leave the course.

Build

a sealed test-set evaluation and a train-validation-test comparison

Learn

why the honest score is different from training accuracy

Ship

a baseline test score and an overfitting gap calculation

Teacher demo

Put training accuracy, validation accuracy, and test accuracy on the board as three different claims. Ask which one belongs on a poster and why.

The big idea

The model practiced on training data and checked validation data during training. The test set is different: it stayed sealed. Today you open it once for the honest baseline score.

If training accuracy is much higher than test accuracy, the model learned some real patterns and memorized some training details. That gap is called overfitting.

How the Python ML workflow connects
  1. 1
    Photos / CIFAR-10labeled image examplesStage 1
  2. 2
    Notebook variablesx_train, y_train, class_namesSetup-2
  3. 3
    Prepared datanormalized pixels and fair pilesStage 3
  4. 4
    Keras modelCNN layers and summaryStage 4
  5. 5
    Training historyepochs, loss, accuracyStage 5
  6. 6
    Test evidencesealed score and mistakesStages 6-7
  7. 7
    Improved modelaugmentation comparisonStage 8
  8. 8
    Inferenceuploaded image to top-3 guessesStage 9
  9. 9
    Demo evidencetable, confidence, limitationStage 10

Stage 6 moves from training history to test evidence. The sealed test score is the honest grade you can report later.

New words
evaluate
grade a trained model without changing it
test accuracy
the model's score on the sealed test set
generalize
work on new examples, not only practiced ones
overfitting
memorizing training examples instead of learning the general idea
Before you start

You need the trained baseline model and baseline_history from Stage 5.

Build it

Step 1 — Predict the honest score

Test-score prediction:
- My final validation accuracy was _____.
- I predict test accuracy will be _____ because __________.

Step 2 — Evaluate the sealed test set

model.evaluate grades the model on data without changing the model. That is why this cell is different from model.fit: fit practices, evaluate grades.

baseline_test_loss, baseline_test_accuracy = model.evaluate(x_test, y_test)

print(f"Baseline test loss: {baseline_test_loss:.3f}")
print(f"Baseline test accuracy: {baseline_test_accuracy:.3f}")

No learning happens here. This is a grade, not practice.

Step 3 — Compare all three scores

baseline_train_accuracy = baseline_history.history['accuracy'][-1]
baseline_val_accuracy = baseline_history.history['val_accuracy'][-1]

print(f"Training accuracy: {baseline_train_accuracy:.3f}")
print(f"Validation accuracy: {baseline_val_accuracy:.3f}")
print(f"Test accuracy: {baseline_test_accuracy:.3f}")
print(f"Train-test gap: {baseline_train_accuracy - baseline_test_accuracy:.3f}")

The gap is your overfitting signal.

Step 4 — Write the honest claim

Honest model claim:
My baseline model gets about _____% test accuracy.
It does _____ points better on training data than test data.
That means __________.

Understand it

Training accuracy is not a lie, but it is not the final truth. It measures data the model practiced on. Test accuracy measures data that stayed untouched until grading.

Validation accuracy helps us while building. Test accuracy is the number we report when we want to be honest about performance.

Try this

Learning beat

Try this

Three short experiments. Predict before you run, then test your guess.

Predict first

If a model had 99% training accuracy and 55% test accuracy, would you trust it more or less than yours? Explain before discussing.

Compare

Compare validation accuracy and test accuracy. Are they close? What would it mean if test was much lower?

Connect

Why is it cheating to keep changing your model until the test score goes up?

Test your stage

  • You printed baseline test accuracy.
  • You compared training, validation, and test accuracy.
  • You calculated the train-test gap.
  • Workflow check. Point to this stage on the workflow map and explain why test evidence comes after training.
  • Evidence check. Your honest claim includes training, validation, and test accuracy.
  • Design check. Explain from memory why changing the model after watching test accuracy would be cheating.

If it breaks

  • Test accuracy is near 0.10. The model probably is not trained in this session.
  • baseline_history is missing. Re-run Stage 5 training.
  • The gap is tiny. That is good; it means the model generalized well.
Coach notes

The code is short; the learning is the distinction between practiced, checked, and sealed data. Keep the conversation focused there.