One-Hot Encoding
What is one-hot encoding?
One-hot encoding converts a categorical column into multiple binary columns.
Example:
- city = {Pune, Delhi}- city = {Pune, Delhi}becomes:
- city_Pune (0/1)
- city_Delhi (0/1)
One-hot encoding with Pandas
get_dummies
import pandas as pd
df = pd.DataFrame({"city": ["Pune", "Delhi", "Pune"], "amount": [100, 200, 150]})
encoded = pd.get_dummies(df, columns=["city"], drop_first=False)
print(encoded)get_dummies
import pandas as pd
df = pd.DataFrame({"city": ["Pune", "Delhi", "Pune"], "amount": [100, 200, 150]})
encoded = pd.get_dummies(df, columns=["city"], drop_first=False)
print(encoded)Avoid dummy variable trap (optional)
For some linear models, you can drop one category:
drop_first
encoded = pd.get_dummies(df, columns=["city"], drop_first=True)
print(encoded)drop_first
encoded = pd.get_dummies(df, columns=["city"], drop_first=True)
print(encoded)One-hot encoding with scikit-learn
OneHotEncoder
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
X = pd.DataFrame({"city": ["Pune", "Delhi", "Pune"]})
enc = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
arr = enc.fit_transform(X[["city"]])
print(arr)
print(enc.get_feature_names_out(["city"]))OneHotEncoder
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
X = pd.DataFrame({"city": ["Pune", "Delhi", "Pune"]})
enc = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
arr = enc.fit_transform(X[["city"]])
print(arr)
print(enc.get_feature_names_out(["city"]))Tips
- High-cardinality categories (thousands of unique values) can explode feature count.
- Consider grouping rare categories into βOtherβ.
If this helped you, consider buying me a coffee β
Buy me a coffeeWas this page helpful?
Let us know how we did
