Handling Imbalanced Datasets (SMOTE)

What is an imbalanced dataset?

An imbalanced dataset has classes with very different frequencies.

Example: fraud detection

If you predict “normal” always, you get 99.5% accuracy… but the model is useless.

With imbalance, you should prioritize:

Always start here.

Many models support weighting the minority class more.

Example: Logistic Regression

SMOTE creates synthetic minority samples by interpolating between existing ones.

false

  flowchart LR
  A[Minority sample i] --> C[Synthetic sample]
  B[Nearest minority neighbor] --> C

SMOTE must be applied only to the training set.

Best practice:

SMOTE is commonly used via imblearnimblearn (imbalanced-learn library). If you don’t want extra dependencies, start with class weights.

Given a dataset with 95/5 split:

If this helped you, consider buying me a coffee ☕