Skip to content

Splitting Data - Train, Validation, and Test Sets

Why we split

We split data to answer a single question:

“How will this model perform on new data?”

If you evaluate on the same data you trained on, you’re only measuring memorization.

The three splits

  • Train: fit model parameters
  • Validation: tune hyperparameters / choose model
  • Test: final, unbiased estimate (touch only at the end)

false


  flowchart LR
  D[Dataset] --> T[Train]
  D --> V[Validation]
  D --> E[Test]
  T --> M[Fit model]
  V --> H[Tune hyperparameters]
  E --> F[Final evaluation]

false

Typical ratios

Common starting points:

  • 70/15/15
  • 80/10/10

If data is small, prefer cross-validation.

Scikit-learn example

Train/test split
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
Train/test split
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Notes:

  • random_staterandom_state makes results reproducible.
  • stratify=ystratify=y preserves class ratios (important for classification).

Time series special case

For time-dependent data, don’t shuffle.

You split by time:

  • train on past
  • validate on near-future
  • test on future

Leakage checklist

Before you trust your results:

  • Did you fit imputers/scalers on train only?
  • Did you compute aggregates using future data?
  • Did you duplicate records across splits?

Mini-checkpoint

Write down:

  • what your test set represents in the real world
  • when you’re allowed to look at it (answer: ideally once)

If this helped you, consider buying me a coffee ☕

Buy me a coffee

Was this page helpful?

Let us know how we did