Skip to main content

Stage 2: Data Detective

Course progressStage 2 of 10
~75 min
Your workspace

Keep your Colab notebook tab open all session. Open in a new tab — don’t use the buttons in this page to leave the course.

Build

dataset inspection cells that count classes, sample examples, and find tricky images

Learn

why machine-learning engineers inspect data before training

Ship

a short data report with class counts and one likely-hard example

Teacher demo

Show two CIFAR-10 images that are easy and two that are blurry or ambiguous. Ask the room which ones they expect the model to miss later.

The big idea

Bad machine learning often starts with unchecked data. Before preparing or training, engineers ask: What classes do we have? Are the classes balanced? Are the images clear? Which examples look confusing even to humans?

Today you are not building the model. You are learning to read the dataset.

How the Python ML workflow connects
  1. 1
    Photos / CIFAR-10labeled image examplesStage 1
  2. 2
    Notebook variablesx_train, y_train, class_namesSetup-2
  3. 3
    Prepared datanormalized pixels and fair pilesStage 3
  4. 4
    Keras modelCNN layers and summaryStage 4
  5. 5
    Training historyepochs, loss, accuracyStage 5
  6. 6
    Test evidencesealed score and mistakesStages 6-7
  7. 7
    Improved modelaugmentation comparisonStage 8
  8. 8
    Inferenceuploaded image to top-3 guessesStage 9
  9. 9
    Demo evidencetable, confidence, limitationStage 10

Stage 2 stays on the notebook-variable step. You use `y_train`, `class_names`, and image samples to decide whether the data is trustworthy before changing it.

New words
class
one category the model can predict, like cat or truck
class balance
whether each class has about the same number of examples
sample
a small set of examples used to inspect a larger dataset
ambiguous
hard to identify clearly, even for a person
Before you start

Stage 1 must be run so x_train, y_train, and class_names exist.

Build it

Step 1 — Predict which classes will be hardest

In a text cell, rank three classes you think the model will confuse most. Give a reason based on the images you saw in Stage 1.

My hardest classes prediction:
1. __________ because __________
2. __________ because __________
3. __________ because __________

Now make a second prediction: do you think CIFAR-10 has about the same number of examples in each class, or do you think some classes have many more? Write the prediction before counting.

Step 2 — Count the training classes

NumPy is Python's array and counting helper. We use it here because labels are stored as arrays of numbers, and NumPy can flatten, count, and search those arrays quickly.

import numpy as np

label_numbers = y_train.flatten()
counts = np.bincount(label_numbers, minlength=10)

for index, count in enumerate(counts):
print(f"{index}: {class_names[index]} - {count} training images")

If the counts are close, the dataset is balanced. If one class had far fewer examples, the model might struggle to learn it fairly.

Step 3 — Show one example from each class

fig, axes = plt.subplots(2, 5, figsize=(10, 5))

for class_index in range(10):
first_match = np.where(label_numbers == class_index)[0][0]
row = class_index // 5
col = class_index % 5
axes[row, col].imshow(x_train[first_match])
axes[row, col].set_title(class_names[class_index])
axes[row, col].axis('off')

plt.show()

This proves every class has real examples and gives you a visual baseline.

Step 4 — Build a quick data report

In a text cell, write:

Data report:
- The most common class is __________ with _____ images.
- The least common class is __________ with _____ images.
- One class I expect to be hard is __________ because __________.
- One image looked ambiguous because __________.

Your report is evidence. It will make Stage 7 and Stage 10 explanations stronger.

Understand it

Machine learning is not only model code. A model can only learn from the examples it receives. If a dataset is unbalanced, mislabeled, blurry, or missing important cases, training can look successful while the model behaves badly.

CIFAR-10 is designed to be balanced, but it is still hard. The images are tiny. Cats and dogs blur together. Trucks and automobiles can look similar. Reading those weaknesses now helps you predict failures later.

Try this

Learning beat

Try this

Three short experiments. Predict before you run, then test your guess.

Predict first

Pick one class you think will be easiest. Display five examples from that class. Did they look more consistent than the hard classes?

Compare

Compare five cats and five dogs. What visual clues separate them? What clues overlap?

Connect

If a class had only 200 examples while another had 5,000, how might that change the model's predictions?

Test your stage

  • You printed class counts for all 10 classes.
  • You displayed one example from every class.
  • You wrote a data report with at least one predicted hard class.
  • Workflow check. Point to this stage on the workflow map and explain why data inspection comes before preparation.
  • Evidence check. Your data report cites one count result and one visual example.
  • Design check. Explain why inspecting data before training is part of machine learning, not extra decoration.

If it breaks

  • NameError: class_names is not defined. Re-run Stage 1's class list cell.
  • np is not defined. Run the import numpy as np line.
  • The class counts all look similar. Good. That means CIFAR-10 is balanced; the hard part is visual confusion, not missing classes.
Coach notes

This stage fixes the old tutorial problem by slowing down before preprocessing. Students should make predictions about model behavior before they train.