Skip to content

K-Means Clustering Algorithm

K-means in one sentence

K-means partitions data into K clusters by minimizing within-cluster variance.

The algorithm (intuition)

  1. choose K
  2. initialize K centroids
  3. assign each point to nearest centroid
  4. recompute centroids as the mean of assigned points
  5. repeat until convergence

false


  flowchart TD
  A[Choose K] --> B[Initialize centroids]
  B --> C[Assign points to nearest centroid]
  C --> D[Update centroid = mean of cluster]
  D --> E{Converged?}
  E -->|no| C
  E -->|yes| F[Final clusters]

false

What K-means assumes

K-means works best when clusters are:

  • roughly spherical (ball-shaped)
  • similar size
  • separable by distance

It struggles when clusters are:

  • non-spherical
  • different densities
  • contain lots of outliers

Scikit-learn example

KMeans
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
 
kmeans = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("model", KMeans(n_clusters=3, n_init=10, random_state=42)),
    ]
)
KMeans
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
 
kmeans = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("model", KMeans(n_clusters=3, n_init=10, random_state=42)),
    ]
)

Mini-checkpoint

K-means always assigns every point to a cluster.

  • If you have outliers, what happens?

(Outliers still get assigned and can distort centroids.)

๐Ÿงช Try It Yourself

Exercise 1 โ€“ Train-Test Split

Exercise 2 โ€“ Fit a Linear Model

Exercise 3 โ€“ Evaluate with MSE

If this helped you, consider buying me a coffee โ˜•

Buy me a coffee

Was this page helpful?

Let us know how we did