Handling Categorical Data (Label & One-Hot Encoding)

Why categorical encoding exists

Most ML models expect numeric inputs.

But many real features are categories:

Encoding converts these into numbers without destroying meaning.

Label encoding maps categories to integers:

This is appropriate when:

For nominal categories (no order), label encoding introduces a fake numeric order.

Some models may incorrectly treat “Green (2)” as “bigger” than “Red (0)”.

One-hot creates one column per category:

Only one is 1, others are 0.

false

  flowchart TD
  A[city = "Mumbai"] --> B[city_Delhi=0]
  A --> C[city_Mumbai=1]
  A --> D[city_Pune=0]

Use OneHotEncoder(handle_unknown="ignore")OneHotEncoder(handle_unknown="ignore") so new categories won’t crash production.

OneHotEncoder with unknown handling

from sklearn.preprocessing import OneHotEncoder
 
enc = OneHotEncoder(handle_unknown="ignore")

OneHotEncoder with unknown handling

from sklearn.preprocessing import OneHotEncoder
 
enc = OneHotEncoder(handle_unknown="ignore")

Some features have thousands of categories (e.g., user_id, product_id). One-hot can explode.

Options:

Pick a categorical column and answer:

If this helped you, consider buying me a coffee ☕