Splitting Data - Train, Validation, and Test Sets
Why we split
We split data to answer a single question:
“How will this model perform on new data?”
If you evaluate on the same data you trained on, you’re only measuring memorization.
The three splits
- Train: fit model parameters
- Validation: tune hyperparameters / choose model
- Test: final, unbiased estimate (touch only at the end)
false
flowchart LR D[Dataset] --> T[Train] D --> V[Validation] D --> E[Test] T --> M[Fit model] V --> H[Tune hyperparameters] E --> F[Final evaluation]
false
Typical ratios
Common starting points:
- 70/15/15
- 80/10/10
If data is small, prefer cross-validation.
Scikit-learn example
Train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)Train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)Notes:
random_staterandom_statemakes results reproducible.stratify=ystratify=ypreserves class ratios (important for classification).
Time series special case
For time-dependent data, don’t shuffle.
You split by time:
- train on past
- validate on near-future
- test on future
Leakage checklist
Before you trust your results:
- Did you fit imputers/scalers on train only?
- Did you compute aggregates using future data?
- Did you duplicate records across splits?
Mini-checkpoint
Write down:
- what your test set represents in the real world
- when you’re allowed to look at it (answer: ideally once)
If this helped you, consider buying me a coffee ☕
Buy me a coffeeWas this page helpful?
Let us know how we did
