Bag of Words (BoW) & TF-IDF

Bag of Words (BoW)

BoW represents a document by:

Key idea:

false

  flowchart LR
  D[Document] --> V[Vector of word counts]

Pros:

Cons:

TF-IDF reduces the impact of very common words by weighting them lower.

Intuition:

false

  flowchart LR
  TF[Term Frequency] --> W[Weight]
  IDF[Inverse Document Frequency] --> W
  W --> V[TF-IDF vector]

BoW with CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
 
vec = CountVectorizer()
X = vec.fit_transform(["i love python", "python is great"])

BoW with CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
 
vec = CountVectorizer()
X = vec.fit_transform(["i love python", "python is great"])

TF-IDF with TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
 
vec = TfidfVectorizer()
X = vec.fit_transform(["i love python", "python is great"])

TF-IDF with TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
 
vec = TfidfVectorizer()
X = vec.fit_transform(["i love python", "python is great"])

Why is TF-IDF often better than raw counts?

If this helped you, consider buying me a coffee ☕