Skip to content

Bag of Words (BoW) & TF-IDF

Bag of Words (BoW)

BoW represents a document by:

  • counting how many times each word appears

Key idea:

  • word order is ignored

false


  flowchart LR
  D[Document] --> V[Vector of word counts]

false

Pros:

  • simple and fast
  • strong baseline

Cons:

  • ignores word order and context
  • common words dominate

TF-IDF

TF-IDF reduces the impact of very common words by weighting them lower.

Intuition:

  • words common in this doc but rare across all docs are informative

false


  flowchart LR
  TF[Term Frequency] --> W[Weight]
  IDF[Inverse Document Frequency] --> W
  W --> V[TF-IDF vector]

false

Scikit-learn examples

BoW with CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
 
vec = CountVectorizer()
X = vec.fit_transform(["i love python", "python is great"])
BoW with CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
 
vec = CountVectorizer()
X = vec.fit_transform(["i love python", "python is great"])
TF-IDF with TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
 
vec = TfidfVectorizer()
X = vec.fit_transform(["i love python", "python is great"])
TF-IDF with TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
 
vec = TfidfVectorizer()
X = vec.fit_transform(["i love python", "python is great"])

Mini-checkpoint

Why is TF-IDF often better than raw counts?

  • it downweights words that aren’t discriminative across the corpus.

If this helped you, consider buying me a coffee β˜•

Buy me a coffee

Was this page helpful?

Let us know how we did