Decision Trees - Entropy and Gini Impurity
What a decision tree is
A decision tree predicts by repeatedly asking “if/else” style questions.
false
flowchart TD R[Root: Feature <= threshold?] -->|yes| L[Left branch] R -->|no| RR[Right branch] L --> P1[Prediction] RR --> P2[Prediction]
false
How trees choose splits
A split is chosen to make child nodes “purer”.
Two common impurity measures:
Gini impurity
- common default
- fast
Entropy (information gain)
- based on information theory
- sometimes behaves similarly to Gini
Overfitting risk
Trees can memorize training data.
Common controls:
max_depthmax_depthmin_samples_splitmin_samples_splitmin_samples_leafmin_samples_leaf
Scikit-learn example
Decision tree classifier
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(
criterion="gini", # or "entropy" / "log_loss" depending on sklearn version
max_depth=5,
random_state=42,
)Decision tree classifier
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(
criterion="gini", # or "entropy" / "log_loss" depending on sklearn version
max_depth=5,
random_state=42,
)Mini-checkpoint
Train two trees:
- deep tree (no max_depth)
- shallow tree (max_depth=3)
Compare train vs validation scores.
If this helped you, consider buying me a coffee ☕
Buy me a coffeeWas this page helpful?
Let us know how we did
