Skip to content

Extracting Text from PDFs

Text extraction is tricky

Some PDFs contain real text.

Others are scans (images), which require OCR.

Extract with pypdf

pdf_extract_text.py
from pypdf import PdfReader
 
reader = PdfReader("input.pdf")
 
text_parts = []
for page in reader.pages:
    text_parts.append(page.extract_text() or "")
 
text = "\n".join(text_parts)
print(text[:1000])
pdf_extract_text.py
from pypdf import PdfReader
 
reader = PdfReader("input.pdf")
 
text_parts = []
for page in reader.pages:
    text_parts.append(page.extract_text() or "")
 
text = "\n".join(text_parts)
print(text[:1000])

When you need OCR

If extract_text()extract_text() returns empty or garbage:

  • use OCR tools (like Tesseract)
  • or use specialized PDF extractors

๐Ÿงช Try It Yourself

Exercise 1 โ€“ List Files with os.listdir

Exercise 2 โ€“ Join Paths with os.path.join

Exercise 3 โ€“ Write and Read a File

If this helped you, consider buying me a coffee โ˜•

Buy me a coffee

Was this page helpful?

Let us know how we did