Skip to content

Basic Web Crawler

Abstract

Create a web crawler that automatically navigates through websites, extracts content, and saves structured data. This project demonstrates advanced web scraping, URL management, and systematic data collection techniques.

Prerequisites

  • Solid understanding of Python syntax
  • Knowledge of web scraping with BeautifulSoup
  • Familiarity with HTTP requests and web protocols
  • Understanding of data structures (queues, sets)
  • Basic knowledge of CSV file operations

Getting Started

  1. Install Required Dependencies

    pip install requests beautifulsoup4
    pip install requests beautifulsoup4
  2. Run the Web Crawler

    python basicwebcrawler.py
    python basicwebcrawler.py
  3. Configure Crawling

    • Enter the starting URL
    • Set maximum pages to crawl
    • Choose whether to save results to CSV

Code Explanation

Crawler Architecture

basicwebcrawler.py
class WebCrawler:
    def __init__(self, start_url, max_pages=10, delay=1):
        self.visited_urls = set()
        self.to_visit = deque([start_url])
        self.crawled_data = []
basicwebcrawler.py
class WebCrawler:
    def __init__(self, start_url, max_pages=10, delay=1):
        self.visited_urls = set()
        self.to_visit = deque([start_url])
        self.crawled_data = []

Uses queue-based architecture to systematically process URLs while avoiding duplicates.

URL Validation and Management

basicwebcrawler.py
def is_valid_url(self, url):
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)
 
def extract_links(self, html, base_url):
    soup = BeautifulSoup(html, 'html.parser')
    for link in soup.find_all('a', href=True):
        full_url = urljoin(base_url, href)
basicwebcrawler.py
def is_valid_url(self, url):
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)
 
def extract_links(self, html, base_url):
    soup = BeautifulSoup(html, 'html.parser')
    for link in soup.find_all('a', href=True):
        full_url = urljoin(base_url, href)

Implements robust URL handling with validation and proper absolute URL construction.

Data Extraction Pipeline

basicwebcrawler.py
def extract_page_data(self, html, url):
    soup = BeautifulSoup(html, 'html.parser')
    
    title = soup.find('title').get_text().strip()
    meta_desc = soup.find('meta', attrs={'name': 'description'})
    headings = [h.get_text().strip() for h in soup.find_all(['h1', 'h2', 'h3'])]
basicwebcrawler.py
def extract_page_data(self, html, url):
    soup = BeautifulSoup(html, 'html.parser')
    
    title = soup.find('title').get_text().strip()
    meta_desc = soup.find('meta', attrs={'name': 'description'})
    headings = [h.get_text().strip() for h in soup.find_all(['h1', 'h2', 'h3'])]

Extracts structured data including titles, descriptions, headings, and content previews.

Respectful Crawling

basicwebcrawler.py
def crawl(self):
    # Be respectful - add delay
    time.sleep(self.delay)
    
    # Only crawl within the same domain
    if urlparse(link).netloc == urlparse(self.start_url).netloc:
        self.to_visit.append(link)
basicwebcrawler.py
def crawl(self):
    # Be respectful - add delay
    time.sleep(self.delay)
    
    # Only crawl within the same domain
    if urlparse(link).netloc == urlparse(self.start_url).netloc:
        self.to_visit.append(link)

Implements ethical crawling practices with delays and domain restrictions.

Data Export

basicwebcrawler.py
def save_to_csv(self, filename="crawl_results.csv"):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerow(data_copy)
basicwebcrawler.py
def save_to_csv(self, filename="crawl_results.csv"):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerow(data_copy)

Exports crawled data to structured CSV format for analysis and reporting.

Features

  • Systematic Crawling: Queue-based URL processing with duplicate detection
  • Data Extraction: Captures titles, descriptions, headings, and content
  • Domain Restriction: Stays within the starting domain for focused crawling
  • Rate Limiting: Respectful delays between requests
  • Error Handling: Robust handling of network errors and invalid URLs
  • CSV Export: Structured data export for further analysis
  • Progress Tracking: Real-time crawling progress and statistics
  • Configurable Limits: Set maximum pages and crawling parameters

Next Steps

Enhancements

  • Add support for robots.txt compliance
  • Implement depth-first vs breadth-first crawling options
  • Create advanced filtering based on content type
  • Add database storage for large-scale crawling
  • Implement parallel/concurrent crawling
  • Create web interface for crawler management
  • Add image and file download capabilities
  • Implement crawling analytics and reporting

Learning Extensions

  • Study advanced web scraping techniques and anti-bot measures
  • Explore distributed crawling systems
  • Learn about search engine indexing principles
  • Practice with database integration for large datasets
  • Understand legal and ethical considerations in web crawling
  • Explore machine learning for content classification

Educational Value

This project teaches:

  • Web Crawling Architecture: Designing systematic data collection systems
  • Queue Management: Using data structures for efficient URL processing
  • HTTP Programming: Advanced request handling and error management
  • Data Extraction: Parsing and structuring web content systematically
  • File Operations: Writing structured data to various file formats
  • Ethical Programming: Implementing respectful web interaction practices
  • URL Management: Handling relative/absolute URLs and link resolution
  • System Design: Building scalable and maintainable data collection tools

Perfect for understanding large-scale data collection, web technologies, and building tools for automated information gathering.

Was this page helpful?

Let us know how we did