Stage 3: Prepare Data Like an Engineer
Keep your Colab notebook tab open all session. Open in a new tab — don’t use the buttons in this page to leave the course.
normalized images, one-hot labels, and a validation split with before/after proof
why models need tidy inputs and fair evaluation piles
prepared train, validation, and test arrays with inspected examples
Print one raw pixel and one normalized pixel side by side. Then print one label before and after one-hot encoding.
The big idea
Data preparation changes numbers so the model can learn from them. Good engineers never change data blindly. They inspect before, transform, then inspect after.
You will make three changes: normalize pixels, convert labels to one-hot rows, and split a validation set out of the training data.
- 1Photos / CIFAR-10labeled image examplesStage 1
- 2Notebook variablesx_train, y_train, class_namesSetup-2
- 3Prepared datanormalized pixels and fair pilesStage 3
- 4Keras modelCNN layers and summaryStage 4
- 5Training historyepochs, loss, accuracyStage 5
- 6Test evidencesealed score and mistakesStages 6-7
- 7Improved modelaugmentation comparisonStage 8
- 8Inferenceuploaded image to top-3 guessesStage 9
- 9Demo evidencetable, confidence, limitationStage 10
Stage 3 turns raw notebook arrays into model-ready arrays. The prepared data is what the CNN will accept in Stage 4 and train on in Stage 5.
- normalize
- shrink numbers into a friendlier range, usually 0 to 1
- one-hot
- a label row with one 1 and the rest 0s
- validation set
- a practice-test pile checked during training
- test set
- the sealed final-grade pile
Stages 1 and 2 must be run. If Colab disconnected, use Runtime -> Run all through Stage 2 first.
Build it
Step 1 — Save a before snapshot
Before running code, predict what will change and what will stay the same:
Preparation prediction:
- Normalizing should change pixel numbers from _____ to _____.
- One-hot encoding should change one label number into _____ slots.
- The test set should stay sealed because __________.
raw_pixel = x_train[0][0][0].copy()
raw_label = y_train[0][0]
print("Raw pixel:", raw_pixel)
print("Raw label:", raw_label, class_names[raw_label])
This gives you proof of what the data looked like before preparation.
Step 2 — Normalize the images
x_train = x_train / 255.0
x_test = x_test / 255.0
print("Raw pixel was:", raw_pixel)
print("Normalized pixel is:", x_train[0][0][0])
Run this cell once. The image is the same picture, but the numbers are now between 0 and 1.
Step 3 — One-hot encode the labels
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)
print("Raw label was:", raw_label)
print("One-hot label is:", y_train[0])
The 1 marks the correct class slot. Later, the model will output 10 scores that line up with these 10 slots.
Step 4 — Split off validation data
train_test_split is a scikit-learn helper for making a fair split without hand-picking examples. We use it to create validation data from the training pile while leaving the final test pile untouched.
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(
x_train,
y_train,
test_size=0.2,
random_state=42
)
The test set stays sealed. Validation is the pile we are allowed to check while building.
Step 5 — Confirm the final shapes
print("Training images:", x_train.shape)
print("Validation images:", x_val.shape)
print("Test images:", x_test.shape)
print("Training labels:", y_train.shape)
print("Validation labels:", y_val.shape)
print("Test labels:", y_test.shape)
You should see 40,000 training images, 10,000 validation images, and 10,000 test images.
Understand it
Normalization helps because neural networks do thousands of multiplications and additions. Keeping input numbers small makes that math easier to optimize.
One-hot encoding prevents a category mistake. Class 9 is not greater than class 2; it is just a different category. Ten slots make the answer shape match the model's final ten output scores.
Try this
Try this
Three short experiments. Predict before you run, then test your guess.
If test_size=0.5, how many images would training and validation get? Predict, test it, then restore 0.2.
Compare raw_pixel to x_train[0][0][0]. Which version is easier for a model to use, and why?
Stage 4's final layer will have 10 output slots. Why does it help that your labels also have 10 slots?
Test your stage
- You printed a raw pixel and its normalized version.
- You printed a raw label and its one-hot version.
- Your final image shapes are 40,000 / 10,000 / 10,000.
- Workflow check. Point to this stage on the workflow map and explain what prepared data gives the model.
- Evidence check. Your notes cite one before/after pixel and one before/after label.
- Design check. Explain why the test set stays sealed while validation is allowed.
If it breaks
- Images look almost black. You probably normalized twice. Re-run from Stage 1's load cell.
train_test_splitshape error. You may have split twice. Re-run from Stage 1.- Labels are still single numbers. Re-run the one-hot cell.
The required before/after prints are the main upgrade. Do not let students run preparation as a blind recipe.