Skip to content

Handling Outliers

Outliers are not always errors

Outliers could be:

  • True rare events (high-value orders)
  • Measurement/unit problems
  • Recording errors
  • Different segment (VIP customers)

So the first step is always: investigate.

Common strategies

1) Investigate and correct (best)

  • Check the source system
  • Confirm units
  • Validate against business rules

2) Remove outliers (use carefully)

Remove via IQR bounds
clean = df[(df["amount"] >= lower) & (df["amount"] <= upper)].copy()
print(clean)
Remove via IQR bounds
clean = df[(df["amount"] >= lower) & (df["amount"] <= upper)].copy()
print(clean)

Removing is risky when you report business totals.

3) Cap / winsorize

Capping keeps all rows but limits extreme values.

Cap values
df["amount_capped"] = df["amount"].clip(lower, upper)
Cap values
df["amount_capped"] = df["amount"].clip(lower, upper)

4) Transform (e.g., log)

Useful when values span many orders of magnitude.

Log transform
import numpy as np
 
df["amount_log"] = np.log1p(df["amount"])  # log(1+x) to handle 0
Log transform
import numpy as np
 
df["amount_log"] = np.log1p(df["amount"])  # log(1+x) to handle 0

Which strategy should you pick?

  • Reporting metrics → investigate, maybe cap
  • ML features → cap or transform often helps
  • Fraud/anomaly detection → keep outliers (they may be the signal)

Always document

  • detection rule
  • chosen handling method
  • expected impact

If this helped you, consider buying me a coffee ☕

Buy me a coffee

Was this page helpful?

Let us know how we did