Skip to content

Introduction to Web Scraping Ethics & robots.txt

Scraping principles

  • Prefer official APIs when available
  • Respect Terms of Service
  • Rate-limit requests
  • Identify your scraper (User-Agent)
  • Don’t scrape private data

robots.txt basics

robots.txtrobots.txt is a convention that tells crawlers which paths are allowed/disallowed.

It’s not a security feature, but a strong signal.

Rate limiting

Use delays and backoff:

polite_delay.py
import time
import random
 
 
def polite_sleep(base=1.0):
    time.sleep(base + random.random())
polite_delay.py
import time
import random
 
 
def polite_sleep(base=1.0):
    time.sleep(base + random.random())

Avoid getting blocked

  • keep concurrency low
  • cache responses
  • handle 429/503
  • rotate proxies only if permitted and ethical

If this helped you, consider buying me a coffee β˜•

Buy me a coffee

Was this page helpful?

Let us know how we did