Skip to content

Handling Categorical Data (Label & One-Hot Encoding)

Why categorical encoding exists

Most ML models expect numeric inputs.

But many real features are categories:

  • city: Delhi, Mumbai, Pune
  • plan: Free, Pro, Enterprise
  • browser: Chrome, Firefox

Encoding converts these into numbers without destroying meaning.

Label encoding

When it’s okay

Label encoding maps categories to integers:

  • Red → 0
  • Blue → 1
  • Green → 2

This is appropriate when:

  • the category is ordinal (has a natural order)
    • e.g. small < medium < large

When it’s dangerous

For nominal categories (no order), label encoding introduces a fake numeric order.

Some models may incorrectly treat “Green (2)” as “bigger” than “Red (0)”.

One-hot encoding

One-hot creates one column per category:

  • city_Delhi
  • city_Mumbai
  • city_Pune

Only one is 1, others are 0.

false


  flowchart TD
  A[city = "Mumbai"] --> B[city_Delhi=0]
  A --> C[city_Mumbai=1]
  A --> D[city_Pune=0]

false

Scikit-learn: best-practice one-hot

Use OneHotEncoder(handle_unknown="ignore")OneHotEncoder(handle_unknown="ignore") so new categories won’t crash production.

OneHotEncoder with unknown handling
from sklearn.preprocessing import OneHotEncoder
 
enc = OneHotEncoder(handle_unknown="ignore")
OneHotEncoder with unknown handling
from sklearn.preprocessing import OneHotEncoder
 
enc = OneHotEncoder(handle_unknown="ignore")

High-cardinality categories (important)

Some features have thousands of categories (e.g., user_id, product_id). One-hot can explode.

Options:

  • frequency encoding / target encoding (careful: leakage risk)
  • hashing tricks
  • embeddings (deep learning)
  • group rare categories into “Other”

Mini-checkpoint

Pick a categorical column and answer:

  • ordinal or nominal?
  • expected number of categories over time?
  • what happens if a new category appears tomorrow?

If this helped you, consider buying me a coffee ☕

Buy me a coffee

Was this page helpful?

Let us know how we did