Bag of Words (BoW) & TF-IDF
Bag of Words (BoW)
BoW represents a document by:
- counting how many times each word appears
Key idea:
- word order is ignored
false
flowchart LR D[Document] --> V[Vector of word counts]
false
Pros:
- simple and fast
- strong baseline
Cons:
- ignores word order and context
- common words dominate
TF-IDF
TF-IDF reduces the impact of very common words by weighting them lower.
Intuition:
- words common in this doc but rare across all docs are informative
false
flowchart LR TF[Term Frequency] --> W[Weight] IDF[Inverse Document Frequency] --> W W --> V[TF-IDF vector]
false
Scikit-learn examples
BoW with CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(["i love python", "python is great"])BoW with CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(["i love python", "python is great"])TF-IDF with TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(["i love python", "python is great"])TF-IDF with TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(["i love python", "python is great"])Mini-checkpoint
Why is TF-IDF often better than raw counts?
- it downweights words that arenβt discriminative across the corpus.
If this helped you, consider buying me a coffee β
Buy me a coffeeWas this page helpful?
Let us know how we did
