Skip to content

Introduction to Clustering

What clustering does

Clustering groups points such that:

  • points within a cluster are similar
  • points across clusters are less similar

false


  flowchart LR
  X[Data points] --> C[Clustering algorithm] --> G[Cluster labels (groups)]

false

Similarity and distance

Most clustering methods rely on a notion of similarity, often distance.

Common distances:

  • Euclidean (geometry)
  • Manhattan
  • cosine distance (common for text/embeddings)

Important: scaling impacts clustering

If your features are on different scales, distance-based clustering can fail.

Use scaling (StandardScaler/MinMaxScaler) when appropriate.

What makes clustering hard

There is usually no β€œground truth”.

You validate using:

  • domain sense (do clusters mean something?)
  • internal metrics (silhouette score)
  • stability across runs

Mini-checkpoint

If you’re clustering customers:

  • what features would you use?
  • what would a β€œuseful” cluster look like in business terms?

πŸ§ͺ Try It Yourself

Exercise 1 – Train-Test Split

Exercise 2 – Fit a Linear Model

Exercise 3 – Evaluate with MSE

If this helped you, consider buying me a coffee β˜•

Buy me a coffee

Was this page helpful?

Let us know how we did