Skip to content

One-Hot Encoding

What is one-hot encoding?

One-hot encoding converts a categorical column into multiple binary columns.

Example:

- city = {Pune, Delhi}
- city = {Pune, Delhi}

becomes:

  • city_Pune (0/1)
  • city_Delhi (0/1)

One-hot encoding with Pandas

get_dummies
import pandas as pd
 
df = pd.DataFrame({"city": ["Pune", "Delhi", "Pune"], "amount": [100, 200, 150]})
 
encoded = pd.get_dummies(df, columns=["city"], drop_first=False)
print(encoded)
get_dummies
import pandas as pd
 
df = pd.DataFrame({"city": ["Pune", "Delhi", "Pune"], "amount": [100, 200, 150]})
 
encoded = pd.get_dummies(df, columns=["city"], drop_first=False)
print(encoded)

Avoid dummy variable trap (optional)

For some linear models, you can drop one category:

drop_first
encoded = pd.get_dummies(df, columns=["city"], drop_first=True)
print(encoded)
drop_first
encoded = pd.get_dummies(df, columns=["city"], drop_first=True)
print(encoded)

One-hot encoding with scikit-learn

OneHotEncoder
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
 
X = pd.DataFrame({"city": ["Pune", "Delhi", "Pune"]})
 
enc = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
 
arr = enc.fit_transform(X[["city"]])
print(arr)
print(enc.get_feature_names_out(["city"]))
OneHotEncoder
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
 
X = pd.DataFrame({"city": ["Pune", "Delhi", "Pune"]})
 
enc = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
 
arr = enc.fit_transform(X[["city"]])
print(arr)
print(enc.get_feature_names_out(["city"]))

Tips

  • High-cardinality categories (thousands of unique values) can explode feature count.
  • Consider grouping rare categories into β€œOther”.

If this helped you, consider buying me a coffee β˜•

Buy me a coffee

Was this page helpful?

Let us know how we did