Text Preprocessing (Tokenization, Stemming, Lemmatization)
Why preprocessing exists
Text is messy:
- punctuation
- capitalization
- emojis
- spelling variations
Preprocessing reduces noise and makes feature extraction more consistent.
Tokenization
Tokenization splits text into units (tokens):
- words
- subwords
- characters
Example:
βI love NLP!β β [βIβ, βloveβ, βNLPβ]
Stemming
Stemming cuts words down to a root form:
- βrunningβ β βrunβ
- βstudiesβ β βstudiβ (can be rough)
Pros:
- simple, fast
Cons:
- can create non-words
Lemmatization
Lemmatization converts to the dictionary form (lemma):
- βbetterβ β βgoodβ
- βrunningβ β βrunβ
Pros:
- more accurate than stemming
Cons:
- slower, needs language rules
Common steps
- lowercasing
- removing extra spaces
- removing/keeping punctuation depending on task
- stop word removal (sometimes)
false
flowchart LR A[Raw text] --> B[Normalize] B --> C[Tokenize] C --> D[Stem/Lemmatize] D --> E[Clean tokens]
false
Mini-checkpoint
Should you always remove stop words?
- Not always. For sentiment tasks, words like βnotβ are critical.
π§ͺ Try It Yourself
Exercise 1 β Train-Test Split
Exercise 2 β Fit a Linear Model
Exercise 3 β Evaluate with MSE
If this helped you, consider buying me a coffee β
Buy me a coffeeWas this page helpful?
Let us know how we did
