Dimensionality Reduction (PCA - Principal Component Analysis)

When dimensionality becomes a problem

High-dimensional data can cause:

slower training
overfitting
noisy distance calculations
difficulty visualizing

PCA intuition

PCA finds new axes (components) that:

are linear combinations of original features
capture maximum variance
are orthogonal (uncorrelated)

You keep the first k components.

false

  flowchart TD
  X[Original Features] --> C1[Principal Component 1 (max variance)]
  X --> C2[Principal Component 2]
  X --> Ck[...]

false

Key requirements for PCA

features should be scaled (very important)
PCA is linear (won’t capture complex nonlinear blobs)

Scikit-learn example

Standardize then PCA

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
 
pca_pipeline = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("pca", PCA(n_components=0.95)),  # keep 95% variance
    ]
)

Standardize then PCA

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
 
pca_pipeline = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("pca", PCA(n_components=0.95)),  # keep 95% variance
    ]
)

Choosing number of components

Common approaches:

keep a variance threshold (e.g., 95%)
look at the explained variance curve (“elbow”)

PCA caveats

components are harder to interpret
if interpretability matters, prefer feature selection

Mini-checkpoint

Do you need interpretability?
Do you have thousands of features?

If yes, PCA may be useful.

If this helped you, consider buying me a coffee ☕

Buy me a coffee

Dimensionality Reduction (PCA - Principal Component Analysis)

When dimensionality becomes a problem

PCA intuition

false

Key requirements for PCA

Scikit-learn example

Choosing number of components

PCA caveats

Mini-checkpoint

Was this page helpful?