Handling Missing Values (Imputation)
Why missing values happen
Missing values aren’t just “bad data”. They can be meaningful:
- user didn’t answer a question
- sensor failed
- value never existed
- collection pipeline bug
First: understand why the data is missing.
Common strategies
1) Drop rows/columns
- Drop rows if missingness is small and random.
- Drop columns if a feature is mostly missing.
Risk: you may remove signal or introduce bias.
2) Simple imputation
- numeric: mean/median
- categorical: most frequent
- custom: constant (e.g.
"Unknown""Unknown")
Median is often more robust than mean (less sensitive to outliers).
3) Advanced imputation
- KNN Imputer
- Iterative Imputer (model-based)
Useful but can be slower and can create leakage if done incorrectly.
The scikit-learn way (safe)
Use SimpleImputerSimpleImputer and fit it only on train.
Impute numeric and categorical columns
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
# Example columns
numeric_features = ["age", "income"]
categorical_features = ["city", "job"]
numeric_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="median")),
]
)
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
preprocess = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
clf = Pipeline(
steps=[
("preprocess", preprocess),
("model", LogisticRegression(max_iter=1000)),
]
)Impute numeric and categorical columns
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
# Example columns
numeric_features = ["age", "income"]
categorical_features = ["city", "job"]
numeric_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="median")),
]
)
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
preprocess = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
clf = Pipeline(
steps=[
("preprocess", preprocess),
("model", LogisticRegression(max_iter=1000)),
]
)Pitfalls and best practices
- Don’t impute using info from test/validation sets.
- Missingness can be predictive. Sometimes you should add a feature:
is_age_missing = age.isna()is_age_missing = age.isna()
Mini-checkpoint
For each feature with missing values:
- decide whether to drop, impute, or engineer a missing-indicator
- justify your decision (business + statistical reason)
If this helped you, consider buying me a coffee ☕
Buy me a coffeeWas this page helpful?
Let us know how we did
