Introduction to Clustering
What clustering does
Clustering groups points such that:
- points within a cluster are similar
- points across clusters are less similar
false
flowchart LR X[Data points] --> C[Clustering algorithm] --> G[Cluster labels (groups)]
false
Similarity and distance
Most clustering methods rely on a notion of similarity, often distance.
Common distances:
- Euclidean (geometry)
- Manhattan
- cosine distance (common for text/embeddings)
Important: scaling impacts clustering
If your features are on different scales, distance-based clustering can fail.
Use scaling (StandardScaler/MinMaxScaler) when appropriate.
What makes clustering hard
There is usually no βground truthβ.
You validate using:
- domain sense (do clusters mean something?)
- internal metrics (silhouette score)
- stability across runs
Mini-checkpoint
If youβre clustering customers:
- what features would you use?
- what would a βusefulβ cluster look like in business terms?
π§ͺ Try It Yourself
Exercise 1 β Train-Test Split
Exercise 2 β Fit a Linear Model
Exercise 3 β Evaluate with MSE
If this helped you, consider buying me a coffee β
Buy me a coffeeWas this page helpful?
Let us know how we did
