Handling Outliers
Outliers are not always errors
Outliers could be:
- True rare events (high-value orders)
- Measurement/unit problems
- Recording errors
- Different segment (VIP customers)
So the first step is always: investigate.
Common strategies
1) Investigate and correct (best)
- Check the source system
- Confirm units
- Validate against business rules
2) Remove outliers (use carefully)
Remove via IQR bounds
clean = df[(df["amount"] >= lower) & (df["amount"] <= upper)].copy()
print(clean)Remove via IQR bounds
clean = df[(df["amount"] >= lower) & (df["amount"] <= upper)].copy()
print(clean)Removing is risky when you report business totals.
3) Cap / winsorize
Capping keeps all rows but limits extreme values.
Cap values
df["amount_capped"] = df["amount"].clip(lower, upper)Cap values
df["amount_capped"] = df["amount"].clip(lower, upper)4) Transform (e.g., log)
Useful when values span many orders of magnitude.
Log transform
import numpy as np
df["amount_log"] = np.log1p(df["amount"]) # log(1+x) to handle 0Log transform
import numpy as np
df["amount_log"] = np.log1p(df["amount"]) # log(1+x) to handle 0Which strategy should you pick?
- Reporting metrics → investigate, maybe cap
- ML features → cap or transform often helps
- Fraud/anomaly detection → keep outliers (they may be the signal)
Always document
- detection rule
- chosen handling method
- expected impact
If this helped you, consider buying me a coffee ☕
Buy me a coffeeWas this page helpful?
Let us know how we did
