Extracting Text from PDFs

Text extraction is tricky

Some PDFs contain real text.

Others are scans (images), which require OCR.

Extract with pypdf

pdf_extract_text.py

from pypdf import PdfReader
 
reader = PdfReader("input.pdf")
 
text_parts = []
for page in reader.pages:
    text_parts.append(page.extract_text() or "")
 
text = "\n".join(text_parts)
print(text[:1000])

pdf_extract_text.py

from pypdf import PdfReader
 
reader = PdfReader("input.pdf")
 
text_parts = []
for page in reader.pages:
    text_parts.append(page.extract_text() or "")
 
text = "\n".join(text_parts)
print(text[:1000])

When you need OCR

If extract_text()extract_text() returns empty or garbage:

use OCR tools (like Tesseract)
or use specialized PDF extractors

🧪 Try It Yourself

Exercise 1 – List Files with os.listdir

Exercise 2 – Join Paths with os.path.join

Exercise 3 – Write and Read a File

If this helped you, consider buying me a coffee ☕

Buy me a coffee