Skip to main content

Stage 3: Prepare Data Like an Engineer

Course progressStage 3 of 10
~75 min
Your workspace

Keep your Colab notebook tab open all session. Open in a new tab — don’t use the buttons in this page to leave the course.

Build

normalized images, one-hot labels, and a validation split with before/after proof

Learn

why models need tidy inputs and fair evaluation piles

Ship

prepared train, validation, and test arrays with inspected examples

Teacher demo

Print one raw pixel and one normalized pixel side by side. Then print one label before and after one-hot encoding.

The big idea

Data preparation changes numbers so the model can learn from them. Good engineers never change data blindly. They inspect before, transform, then inspect after.

You will make three changes: normalize pixels, convert labels to one-hot rows, and split a validation set out of the training data.

How the Python ML workflow connects
  1. 1
    Photos / CIFAR-10labeled image examplesStage 1
  2. 2
    Notebook variablesx_train, y_train, class_namesSetup-2
  3. 3
    Prepared datanormalized pixels and fair pilesStage 3
  4. 4
    Keras modelCNN layers and summaryStage 4
  5. 5
    Training historyepochs, loss, accuracyStage 5
  6. 6
    Test evidencesealed score and mistakesStages 6-7
  7. 7
    Improved modelaugmentation comparisonStage 8
  8. 8
    Inferenceuploaded image to top-3 guessesStage 9
  9. 9
    Demo evidencetable, confidence, limitationStage 10

Stage 3 turns raw notebook arrays into model-ready arrays. The prepared data is what the CNN will accept in Stage 4 and train on in Stage 5.

New words
normalize
shrink numbers into a friendlier range, usually 0 to 1
one-hot
a label row with one 1 and the rest 0s
validation set
a practice-test pile checked during training
test set
the sealed final-grade pile
Before you start

Stages 1 and 2 must be run. If Colab disconnected, use Runtime -> Run all through Stage 2 first.

Build it

Step 1 — Save a before snapshot

Before running code, predict what will change and what will stay the same:

Preparation prediction:
- Normalizing should change pixel numbers from _____ to _____.
- One-hot encoding should change one label number into _____ slots.
- The test set should stay sealed because __________.
raw_pixel = x_train[0][0][0].copy()
raw_label = y_train[0][0]

print("Raw pixel:", raw_pixel)
print("Raw label:", raw_label, class_names[raw_label])

This gives you proof of what the data looked like before preparation.

Step 2 — Normalize the images

x_train = x_train / 255.0
x_test = x_test / 255.0

print("Raw pixel was:", raw_pixel)
print("Normalized pixel is:", x_train[0][0][0])

Run this cell once. The image is the same picture, but the numbers are now between 0 and 1.

Step 3 — One-hot encode the labels

y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

print("Raw label was:", raw_label)
print("One-hot label is:", y_train[0])

The 1 marks the correct class slot. Later, the model will output 10 scores that line up with these 10 slots.

Step 4 — Split off validation data

train_test_split is a scikit-learn helper for making a fair split without hand-picking examples. We use it to create validation data from the training pile while leaving the final test pile untouched.

from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(
x_train,
y_train,
test_size=0.2,
random_state=42
)

The test set stays sealed. Validation is the pile we are allowed to check while building.

Step 5 — Confirm the final shapes

print("Training images:", x_train.shape)
print("Validation images:", x_val.shape)
print("Test images:", x_test.shape)
print("Training labels:", y_train.shape)
print("Validation labels:", y_val.shape)
print("Test labels:", y_test.shape)

You should see 40,000 training images, 10,000 validation images, and 10,000 test images.

Understand it

Normalization helps because neural networks do thousands of multiplications and additions. Keeping input numbers small makes that math easier to optimize.

One-hot encoding prevents a category mistake. Class 9 is not greater than class 2; it is just a different category. Ten slots make the answer shape match the model's final ten output scores.

Try this

Learning beat

Try this

Three short experiments. Predict before you run, then test your guess.

Predict first

If test_size=0.5, how many images would training and validation get? Predict, test it, then restore 0.2.

Compare

Compare raw_pixel to x_train[0][0][0]. Which version is easier for a model to use, and why?

Connect

Stage 4's final layer will have 10 output slots. Why does it help that your labels also have 10 slots?

Test your stage

  • You printed a raw pixel and its normalized version.
  • You printed a raw label and its one-hot version.
  • Your final image shapes are 40,000 / 10,000 / 10,000.
  • Workflow check. Point to this stage on the workflow map and explain what prepared data gives the model.
  • Evidence check. Your notes cite one before/after pixel and one before/after label.
  • Design check. Explain why the test set stays sealed while validation is allowed.

If it breaks

  • Images look almost black. You probably normalized twice. Re-run from Stage 1's load cell.
  • train_test_split shape error. You may have split twice. Re-run from Stage 1.
  • Labels are still single numbers. Re-run the one-hot cell.
Coach notes

The required before/after prints are the main upgrade. Do not let students run preparation as a blind recipe.