K-Means Clustering Algorithm
K-means in one sentence
K-means partitions data into K clusters by minimizing within-cluster variance.
The algorithm (intuition)
- choose K
- initialize K centroids
- assign each point to nearest centroid
- recompute centroids as the mean of assigned points
- repeat until convergence
false
flowchart TD
A[Choose K] --> B[Initialize centroids]
B --> C[Assign points to nearest centroid]
C --> D[Update centroid = mean of cluster]
D --> E{Converged?}
E -->|no| C
E -->|yes| F[Final clusters]
false
What K-means assumes
K-means works best when clusters are:
- roughly spherical (ball-shaped)
- similar size
- separable by distance
It struggles when clusters are:
- non-spherical
- different densities
- contain lots of outliers
Scikit-learn example
KMeans
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
kmeans = Pipeline(
steps=[
("scaler", StandardScaler()),
("model", KMeans(n_clusters=3, n_init=10, random_state=42)),
]
)KMeans
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
kmeans = Pipeline(
steps=[
("scaler", StandardScaler()),
("model", KMeans(n_clusters=3, n_init=10, random_state=42)),
]
)Mini-checkpoint
K-means always assigns every point to a cluster.
- If you have outliers, what happens?
(Outliers still get assigned and can distort centroids.)
๐งช Try It Yourself
Exercise 1 โ Train-Test Split
Exercise 2 โ Fit a Linear Model
Exercise 3 โ Evaluate with MSE
If this helped you, consider buying me a coffee โ
Buy me a coffeeWas this page helpful?
Let us know how we did
