Skip to content

Dimensionality Reduction (PCA - Principal Component Analysis)

When dimensionality becomes a problem

High-dimensional data can cause:

  • slower training
  • overfitting
  • noisy distance calculations
  • difficulty visualizing

PCA intuition

PCA finds new axes (components) that:

  • are linear combinations of original features
  • capture maximum variance
  • are orthogonal (uncorrelated)

You keep the first k components.

false


  flowchart TD
  X[Original Features] --> C1[Principal Component 1 (max variance)]
  X --> C2[Principal Component 2]
  X --> Ck[...]

false

Key requirements for PCA

  • features should be scaled (very important)
  • PCA is linear (won’t capture complex nonlinear blobs)

Scikit-learn example

Standardize then PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
 
pca_pipeline = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("pca", PCA(n_components=0.95)),  # keep 95% variance
    ]
)
Standardize then PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
 
pca_pipeline = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("pca", PCA(n_components=0.95)),  # keep 95% variance
    ]
)

Choosing number of components

Common approaches:

  • keep a variance threshold (e.g., 95%)
  • look at the explained variance curve (β€œelbow”)

PCA caveats

  • components are harder to interpret
  • if interpretability matters, prefer feature selection

Mini-checkpoint

  • Do you need interpretability?
  • Do you have thousands of features?

If yes, PCA may be useful.

If this helped you, consider buying me a coffee β˜•

Buy me a coffee

Was this page helpful?

Let us know how we did