Text Preprocessing (Tokenization, Stemming, Lemmatization)

Why preprocessing exists

Text is messy:

punctuation
capitalization
emojis
spelling variations

Preprocessing reduces noise and makes feature extraction more consistent.

Tokenization

Tokenization splits text into units (tokens):

words
subwords
characters

Example:

“I love NLP!” → [“I”, “love”, “NLP”]

Stemming

Stemming cuts words down to a root form:

“running” → “run”
“studies” → “studi” (can be rough)

Pros:

simple, fast

Cons:

can create non-words

Lemmatization

Lemmatization converts to the dictionary form (lemma):

“better” → “good”
“running” → “run”

Pros:

more accurate than stemming

Cons:

slower, needs language rules

Common steps

lowercasing
removing extra spaces
removing/keeping punctuation depending on task
stop word removal (sometimes)

false

  flowchart LR
  A[Raw text] --> B[Normalize]
  B --> C[Tokenize]
  C --> D[Stem/Lemmatize]
  D --> E[Clean tokens]

false

Mini-checkpoint

Should you always remove stop words?

Not always. For sentiment tasks, words like “not” are critical.

🧪 Try It Yourself

Exercise 1 – Train-Test Split

Exercise 2 – Fit a Linear Model

Exercise 3 – Evaluate with MSE

If this helped you, consider buying me a coffee ☕

Buy me a coffee

Text Preprocessing (Tokenization, Stemming, Lemmatization)

Why preprocessing exists

Tokenization

Stemming

Lemmatization

Common steps

false

Mini-checkpoint

🧪 Try It Yourself

Exercise 1 – Train-Test Split

Exercise 2 – Fit a Linear Model

Exercise 3 – Evaluate with MSE

Was this page helpful?