Stage 2: Data Detective

Course progressStage 2 of 10

~75 min

Your workspace

Keep your Colab notebook tab open all session. Open in a new tab — don’t use the buttons in this page to leave the course.

Train your AIGoogle ColabOpen in a new tab ↗No-code warm-upMachine Learning for KidsOpen in a new tab ↗

Build

dataset inspection cells that count classes, sample examples, and find tricky images

Learn

why machine-learning engineers inspect data before training

Ship

a short data report with class counts and one likely-hard example

Teacher demo

Show two CIFAR-10 images that are easy and two that are blurry or ambiguous. Ask the room which ones they expect the model to miss later.

The big idea

Bad machine learning often starts with unchecked data. Before preparing or training, engineers ask: What classes do we have? Are the classes balanced? Are the images clear? Which examples look confusing even to humans?

Today you are not building the model. You are learning to read the dataset.

How the Python ML workflow connects

1
Photos / CIFAR-10labeled image examplesStage 1
2
Notebook variablesx_train, y_train, class_namesSetup-2
3
Prepared datanormalized pixels and fair pilesStage 3
4
Keras modelCNN layers and summaryStage 4
5
Training historyepochs, loss, accuracyStage 5
6
Test evidencesealed score and mistakesStages 6-7
7
Improved modelaugmentation comparisonStage 8
8
Inferenceuploaded image to top-3 guessesStage 9
9
Demo evidencetable, confidence, limitationStage 10

Stage 2 stays on the notebook-variable step. You use `y_train`, `class_names`, and image samples to decide whether the data is trustworthy before changing it.

New words

class: one category the model can predict, like cat or truck
class balance: whether each class has about the same number of examples
sample: a small set of examples used to inspect a larger dataset
ambiguous: hard to identify clearly, even for a person

Before you start

Stage 1 must be run so x_train, y_train, and class_names exist.

Build it

Step 1 — Predict which classes will be hardest

In a text cell, rank three classes you think the model will confuse most. Give a reason based on the images you saw in Stage 1.

My hardest classes prediction:
__________ because __________
__________ because __________
__________ because __________

Now make a second prediction: do you think CIFAR-10 has about the same number of examples in each class, or do you think some classes have many more? Write the prediction before counting.

Step 2 — Count the training classes

NumPy is Python's array and counting helper. We use it here because labels are stored as arrays of numbers, and NumPy can flatten, count, and search those arrays quickly.

import numpy as np

label_numbers = y_train.flatten()
counts = np.bincount(label_numbers, minlength=10)

for index, count in enumerate(counts):
    print(f"{index}: {class_names[index]} - {count} training images")

If the counts are close, the dataset is balanced. If one class had far fewer examples, the model might struggle to learn it fairly.

Step 3 — Show one example from each class

fig, axes = plt.subplots(2, 5, figsize=(10, 5))

for class_index in range(10):
    first_match = np.where(label_numbers == class_index)[0][0]
    row = class_index // 5
    col = class_index % 5
    axes[row, col].imshow(x_train[first_match])
    axes[row, col].set_title(class_names[class_index])
    axes[row, col].axis('off')

plt.show()

This proves every class has real examples and gives you a visual baseline.

Step 4 — Build a quick data report

In a text cell, write:

Data report:
- The most common class is __________ with _____ images.
- The least common class is __________ with _____ images.
- One class I expect to be hard is __________ because __________.
- One image looked ambiguous because __________.

Your report is evidence. It will make Stage 7 and Stage 10 explanations stronger.

Understand it

Machine learning is not only model code. A model can only learn from the examples it receives. If a dataset is unbalanced, mislabeled, blurry, or missing important cases, training can look successful while the model behaves badly.

CIFAR-10 is designed to be balanced, but it is still hard. The images are tiny. Cats and dogs blur together. Trucks and automobiles can look similar. Reading those weaknesses now helps you predict failures later.

Try this

Learning beat

Try this

Three short experiments. Predict before you run, then test your guess.

Predict first

Pick one class you think will be easiest. Display five examples from that class. Did they look more consistent than the hard classes?

Compare

Compare five cats and five dogs. What visual clues separate them? What clues overlap?

Connect

If a class had only 200 examples while another had 5,000, how might that change the model's predictions?

Test your stage

You printed class counts for all 10 classes.
You displayed one example from every class.
You wrote a data report with at least one predicted hard class.
Workflow check. Point to this stage on the workflow map and explain why data inspection comes before preparation.
Evidence check. Your data report cites one count result and one visual example.
Design check. Explain why inspecting data before training is part of machine learning, not extra decoration.

If it breaks

NameError: class_names is not defined. Re-run Stage 1's class list cell.
np is not defined. Run the import numpy as np line.
The class counts all look similar. Good. That means CIFAR-10 is balanced; the hard part is visual confusion, not missing classes.

Coach notes

This stage fixes the old tutorial problem by slowing down before preprocessing. Students should make predictions about model behavior before they train.

Stage 2 complete

You finished Stage 2!

You inspected the data, counted the classes, and predicted where the model may struggle. Next, you prepare the data without losing track of what changed.

Stage builtPlaytested

Stage 3: Prepare Data Like an Engineer

The big idea​

Build it​

Step 1 — Predict which classes will be hardest​

Step 2 — Count the training classes​

Step 3 — Show one example from each class​

Step 4 — Build a quick data report​

Understand it​

Try this​

Test your stage​

If it breaks​

You finished Stage 2!

The big idea

Build it

Step 1 — Predict which classes will be hardest

Step 2 — Count the training classes

Step 3 — Show one example from each class

Step 4 — Build a quick data report

Understand it

Try this

Test your stage

If it breaks