Skip to content

K-Fold Cross-Validation

Why cross-validation

A single train/validation split can be noisy.

Cross-validation averages performance across multiple splits.

K-fold CV

Steps:

  1. split data into K folds
  2. train on K-1 folds
  3. validate on the remaining fold
  4. repeat for all folds
  5. average the scores

false


  flowchart TD
  A[Data] --> B[Split into K folds]
  B --> C[Train on K-1]
  C --> D[Validate on 1]
  D --> E[Repeat K times]
  E --> F[Average score]

false

Scikit-learn example

cross_val_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
 
scores = cross_val_score(LogisticRegression(max_iter=1000), X, y, cv=5, scoring="accuracy")
print("mean:", scores.mean())
print("std:", scores.std())
cross_val_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
 
scores = cross_val_score(LogisticRegression(max_iter=1000), X, y, cv=5, scoring="accuracy")
print("mean:", scores.mean())
print("std:", scores.std())

Stratified CV

For classification, use stratified splits so each fold preserves class ratios.

Scikit-learn does this automatically for many classifiers.

Time series warning

Don’t use random k-fold for time series.

Use time-series split.

Mini-checkpoint

When data is small:

  • prefer CV over one split.

πŸ§ͺ Try It Yourself

Exercise 1 – Train-Test Split

Exercise 2 – Fit a Linear Model

Exercise 3 – Evaluate with MSE

If this helped you, consider buying me a coffee β˜•

Buy me a coffee

Was this page helpful?

Let us know how we did