Handling Missing Values (Imputation)

Why missing values happen

Missing values aren’t just “bad data”. They can be meaningful:

user didn’t answer a question
sensor failed
value never existed
collection pipeline bug

First: understand why the data is missing.

Common strategies

1) Drop rows/columns

Drop rows if missingness is small and random.
Drop columns if a feature is mostly missing.

Risk: you may remove signal or introduce bias.

2) Simple imputation

numeric: mean/median
categorical: most frequent
custom: constant (e.g. "Unknown""Unknown")

Median is often more robust than mean (less sensitive to outliers).

3) Advanced imputation

KNN Imputer
Iterative Imputer (model-based)

Useful but can be slower and can create leakage if done incorrectly.

The scikit-learn way (safe)

Use SimpleImputerSimpleImputer and fit it only on train.

Impute numeric and categorical columns

import numpy as np
import pandas as pd
 
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
 
# Example columns
numeric_features = ["age", "income"]
categorical_features = ["city", "job"]
 
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
    ]
)
 
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)
 
preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)
 
clf = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", LogisticRegression(max_iter=1000)),
    ]
)

Impute numeric and categorical columns

import numpy as np
import pandas as pd
 
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
 
# Example columns
numeric_features = ["age", "income"]
categorical_features = ["city", "job"]
 
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
    ]
)
 
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)
 
preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)
 
clf = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", LogisticRegression(max_iter=1000)),
    ]
)

Pitfalls and best practices

Don’t impute using info from test/validation sets.
Missingness can be predictive. Sometimes you should add a feature:
- is_age_missing = age.isna()is_age_missing = age.isna()

Mini-checkpoint

For each feature with missing values:

decide whether to drop, impute, or engineer a missing-indicator
justify your decision (business + statistical reason)

If this helped you, consider buying me a coffee ☕

Buy me a coffee