Stage 2: Data Detective
Keep your Colab notebook tab open all session. Open in a new tab — don’t use the buttons in this page to leave the course.
dataset inspection cells that count classes, sample examples, and find tricky images
why machine-learning engineers inspect data before training
a short data report with class counts and one likely-hard example
Show two CIFAR-10 images that are easy and two that are blurry or ambiguous. Ask the room which ones they expect the model to miss later.
The big idea
Bad machine learning often starts with unchecked data. Before preparing or training, engineers ask: What classes do we have? Are the classes balanced? Are the images clear? Which examples look confusing even to humans?
Today you are not building the model. You are learning to read the dataset.
- 1Photos / CIFAR-10labeled image examplesStage 1
- 2Notebook variablesx_train, y_train, class_namesSetup-2
- 3Prepared datanormalized pixels and fair pilesStage 3
- 4Keras modelCNN layers and summaryStage 4
- 5Training historyepochs, loss, accuracyStage 5
- 6Test evidencesealed score and mistakesStages 6-7
- 7Improved modelaugmentation comparisonStage 8
- 8Inferenceuploaded image to top-3 guessesStage 9
- 9Demo evidencetable, confidence, limitationStage 10
Stage 2 stays on the notebook-variable step. You use `y_train`, `class_names`, and image samples to decide whether the data is trustworthy before changing it.
- class
- one category the model can predict, like cat or truck
- class balance
- whether each class has about the same number of examples
- sample
- a small set of examples used to inspect a larger dataset
- ambiguous
- hard to identify clearly, even for a person
Stage 1 must be run so x_train, y_train, and class_names exist.
Build it
Step 1 — Predict which classes will be hardest
In a text cell, rank three classes you think the model will confuse most. Give a reason based on the images you saw in Stage 1.
My hardest classes prediction:
1. __________ because __________
2. __________ because __________
3. __________ because __________
Now make a second prediction: do you think CIFAR-10 has about the same number of examples in each class, or do you think some classes have many more? Write the prediction before counting.
Step 2 — Count the training classes
NumPy is Python's array and counting helper. We use it here because labels are stored as arrays of numbers, and NumPy can flatten, count, and search those arrays quickly.
import numpy as np
label_numbers = y_train.flatten()
counts = np.bincount(label_numbers, minlength=10)
for index, count in enumerate(counts):
print(f"{index}: {class_names[index]} - {count} training images")
If the counts are close, the dataset is balanced. If one class had far fewer examples, the model might struggle to learn it fairly.
Step 3 — Show one example from each class
fig, axes = plt.subplots(2, 5, figsize=(10, 5))
for class_index in range(10):
first_match = np.where(label_numbers == class_index)[0][0]
row = class_index // 5
col = class_index % 5
axes[row, col].imshow(x_train[first_match])
axes[row, col].set_title(class_names[class_index])
axes[row, col].axis('off')
plt.show()
This proves every class has real examples and gives you a visual baseline.
Step 4 — Build a quick data report
In a text cell, write:
Data report:
- The most common class is __________ with _____ images.
- The least common class is __________ with _____ images.
- One class I expect to be hard is __________ because __________.
- One image looked ambiguous because __________.
Your report is evidence. It will make Stage 7 and Stage 10 explanations stronger.
Understand it
Machine learning is not only model code. A model can only learn from the examples it receives. If a dataset is unbalanced, mislabeled, blurry, or missing important cases, training can look successful while the model behaves badly.
CIFAR-10 is designed to be balanced, but it is still hard. The images are tiny. Cats and dogs blur together. Trucks and automobiles can look similar. Reading those weaknesses now helps you predict failures later.
Try this
Try this
Three short experiments. Predict before you run, then test your guess.
Pick one class you think will be easiest. Display five examples from that class. Did they look more consistent than the hard classes?
Compare five cats and five dogs. What visual clues separate them? What clues overlap?
If a class had only 200 examples while another had 5,000, how might that change the model's predictions?
Test your stage
- You printed class counts for all 10 classes.
- You displayed one example from every class.
- You wrote a data report with at least one predicted hard class.
- Workflow check. Point to this stage on the workflow map and explain why data inspection comes before preparation.
- Evidence check. Your data report cites one count result and one visual example.
- Design check. Explain why inspecting data before training is part of machine learning, not extra decoration.
If it breaks
NameError: class_names is not defined. Re-run Stage 1's class list cell.npis not defined. Run theimport numpy as npline.- The class counts all look similar. Good. That means CIFAR-10 is balanced; the hard part is visual confusion, not missing classes.
This stage fixes the old tutorial problem by slowing down before preprocessing. Students should make predictions about model behavior before they train.