Train-Test Split Concepts
Why split data?
When you evaluate a model, you want to test on data it hasn’t seen.
- Train set: used to learn patterns
- Test set: used only for final evaluation
This simulates real-world performance.
The biggest danger: data leakage
Leakage happens when information from the test set influences training.
Examples:
- Scaling using mean/std computed on the full dataset
- Filling missing values using overall mean (including test)
- Feature engineering that uses future information
Basic split with scikit-learn
train_test_split
import pandas as pd
from sklearn.model_selection import train_test_split
X = pd.DataFrame({"age": [20, 21, 22, 23, 24], "score": [80, 85, 78, 90, 88]})
y = pd.Series([0, 0, 0, 1, 1])
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y,
)
print(X_train)
print(X_test)train_test_split
import pandas as pd
from sklearn.model_selection import train_test_split
X = pd.DataFrame({"age": [20, 21, 22, 23, 24], "score": [80, 85, 78, 90, 88]})
y = pd.Series([0, 0, 0, 1, 1])
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y,
)
print(X_train)
print(X_test)Stratification
If your target classes are imbalanced, use stratify=ystratify=y so train/test have similar class distribution.
Time-series splits
For time series, you often do not shuffle. You train on past and test on future.
Good practice
- Keep a final test set untouched.
- Use cross-validation on training data for tuning.
- Put preprocessing inside a pipeline.
If this helped you, consider buying me a coffee ☕
Buy me a coffeeWas this page helpful?
Let us know how we did
