Skip to content

Text Preprocessing (Tokenization, Stemming, Lemmatization)

Why preprocessing exists

Text is messy:

  • punctuation
  • capitalization
  • emojis
  • spelling variations

Preprocessing reduces noise and makes feature extraction more consistent.

Tokenization

Tokenization splits text into units (tokens):

  • words
  • subwords
  • characters

Example:

β€œI love NLP!” β†’ [β€œI”, β€œlove”, β€œNLP”]

Stemming

Stemming cuts words down to a root form:

  • β€œrunning” β†’ β€œrun”
  • β€œstudies” β†’ β€œstudi” (can be rough)

Pros:

  • simple, fast

Cons:

  • can create non-words

Lemmatization

Lemmatization converts to the dictionary form (lemma):

  • β€œbetter” β†’ β€œgood”
  • β€œrunning” β†’ β€œrun”

Pros:

  • more accurate than stemming

Cons:

  • slower, needs language rules

Common steps

  • lowercasing
  • removing extra spaces
  • removing/keeping punctuation depending on task
  • stop word removal (sometimes)

false


  flowchart LR
  A[Raw text] --> B[Normalize]
  B --> C[Tokenize]
  C --> D[Stem/Lemmatize]
  D --> E[Clean tokens]

false

Mini-checkpoint

Should you always remove stop words?

  • Not always. For sentiment tasks, words like β€œnot” are critical.

πŸ§ͺ Try It Yourself

Exercise 1 – Train-Test Split

Exercise 2 – Fit a Linear Model

Exercise 3 – Evaluate with MSE

If this helped you, consider buying me a coffee β˜•

Buy me a coffee

Was this page helpful?

Let us know how we did