Dimensionality Reduction (PCA - Principal Component Analysis)
When dimensionality becomes a problem
High-dimensional data can cause:
- slower training
- overfitting
- noisy distance calculations
- difficulty visualizing
PCA intuition
PCA finds new axes (components) that:
- are linear combinations of original features
- capture maximum variance
- are orthogonal (uncorrelated)
You keep the first k components.
false
flowchart TD X[Original Features] --> C1[Principal Component 1 (max variance)] X --> C2[Principal Component 2] X --> Ck[...]
false
Key requirements for PCA
- features should be scaled (very important)
- PCA is linear (wonβt capture complex nonlinear blobs)
Scikit-learn example
Standardize then PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
pca_pipeline = Pipeline(
steps=[
("scaler", StandardScaler()),
("pca", PCA(n_components=0.95)), # keep 95% variance
]
)Standardize then PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
pca_pipeline = Pipeline(
steps=[
("scaler", StandardScaler()),
("pca", PCA(n_components=0.95)), # keep 95% variance
]
)Choosing number of components
Common approaches:
- keep a variance threshold (e.g., 95%)
- look at the explained variance curve (βelbowβ)
PCA caveats
- components are harder to interpret
- if interpretability matters, prefer feature selection
Mini-checkpoint
- Do you need interpretability?
- Do you have thousands of features?
If yes, PCA may be useful.
If this helped you, consider buying me a coffee β
Buy me a coffeeWas this page helpful?
Let us know how we did
