Skip to content

Handling Missing Values (Imputation)

Why missing values happen

Missing values aren’t just “bad data”. They can be meaningful:

  • user didn’t answer a question
  • sensor failed
  • value never existed
  • collection pipeline bug

First: understand why the data is missing.

Common strategies

1) Drop rows/columns

  • Drop rows if missingness is small and random.
  • Drop columns if a feature is mostly missing.

Risk: you may remove signal or introduce bias.

2) Simple imputation

  • numeric: mean/median
  • categorical: most frequent
  • custom: constant (e.g. "Unknown""Unknown")

Median is often more robust than mean (less sensitive to outliers).

3) Advanced imputation

  • KNN Imputer
  • Iterative Imputer (model-based)

Useful but can be slower and can create leakage if done incorrectly.

The scikit-learn way (safe)

Use SimpleImputerSimpleImputer and fit it only on train.

Impute numeric and categorical columns
import numpy as np
import pandas as pd
 
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
 
# Example columns
numeric_features = ["age", "income"]
categorical_features = ["city", "job"]
 
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
    ]
)
 
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)
 
preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)
 
clf = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", LogisticRegression(max_iter=1000)),
    ]
)
Impute numeric and categorical columns
import numpy as np
import pandas as pd
 
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
 
# Example columns
numeric_features = ["age", "income"]
categorical_features = ["city", "job"]
 
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
    ]
)
 
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)
 
preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)
 
clf = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", LogisticRegression(max_iter=1000)),
    ]
)

Pitfalls and best practices

  • Don’t impute using info from test/validation sets.
  • Missingness can be predictive. Sometimes you should add a feature:
    • is_age_missing = age.isna()is_age_missing = age.isna()

Mini-checkpoint

For each feature with missing values:

  • decide whether to drop, impute, or engineer a missing-indicator
  • justify your decision (business + statistical reason)

If this helped you, consider buying me a coffee ☕

Buy me a coffee

Was this page helpful?

Let us know how we did