Handling Categorical Data (Label & One-Hot Encoding)
Why categorical encoding exists
Most ML models expect numeric inputs.
But many real features are categories:
- city: Delhi, Mumbai, Pune
- plan: Free, Pro, Enterprise
- browser: Chrome, Firefox
Encoding converts these into numbers without destroying meaning.
Label encoding
When it’s okay
Label encoding maps categories to integers:
- Red → 0
- Blue → 1
- Green → 2
This is appropriate when:
- the category is ordinal (has a natural order)
- e.g. small < medium < large
When it’s dangerous
For nominal categories (no order), label encoding introduces a fake numeric order.
Some models may incorrectly treat “Green (2)” as “bigger” than “Red (0)”.
One-hot encoding
One-hot creates one column per category:
- city_Delhi
- city_Mumbai
- city_Pune
Only one is 1, others are 0.
false
flowchart TD A[city = "Mumbai"] --> B[city_Delhi=0] A --> C[city_Mumbai=1] A --> D[city_Pune=0]
false
Scikit-learn: best-practice one-hot
Use OneHotEncoder(handle_unknown="ignore")OneHotEncoder(handle_unknown="ignore") so new categories won’t crash production.
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown="ignore")from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown="ignore")High-cardinality categories (important)
Some features have thousands of categories (e.g., user_id, product_id). One-hot can explode.
Options:
- frequency encoding / target encoding (careful: leakage risk)
- hashing tricks
- embeddings (deep learning)
- group rare categories into “Other”
Mini-checkpoint
Pick a categorical column and answer:
- ordinal or nominal?
- expected number of categories over time?
- what happens if a new category appears tomorrow?
If this helped you, consider buying me a coffee ☕
Buy me a coffeeWas this page helpful?
Let us know how we did
