Web Scraping in Python - The Complete Guide

Feb 20, 2024 · 19 min read

In this tutorial you'll build robust web crawlers using libraries like BeautifulSoup, learn techniques to overcome real-world scraping challenges and best practices for large scale scraping.

You'll gain the skills to scrape complex sites, handle issues like rate limits, blocks, and javascript pages.

Why Python for Web Scraping?

Python is a popular language for web scraping due to its advantages:

  • Simple Syntax: Python's intuitive syntax allows quick coding for scraping.
  • Built-in Libraries: Python comes with built-in libraries and modules, like urllib and lxml, that aid in scraping.
  • Mature Scraping Libraries: Libraries like Beautiful Soup and Scrapy simplify scraping at any scale.
  • General Purpose: Python can be used to build complete data pipelines around scraping.
  • Interoperability: Python integrates well with other languages and performs well when performance is crucial.
  • In contrast, languages like C++ require more effort for basic scraping tasks. JavaScript platforms like Node.js can be complex for beginners when building scraping scripts.

    Python's simplicity, power, and interoperability makes it ideal for scraping needs. Its high-quality libraries allow quick start to scraping at scale.

    Best Python Web Scraping Libraries

    Some of the most popular and robust Python libraries for web scraping are:

    BeautifulSoup

  • Features: Excellent HTML/XML parser, easy web scraping interface, flexible navigation and search. We will be using this library in our example scraper below.
  • Use Case: Small to medium scale web scraping.
  • Link to BeautifulSoup docs
  • Scrapy

  • Features: Fast and scalable, middlewares, distributed crawling capability.
  • Use Case: Large scale advanced web scraping projects.
  • Link to Scrapy docs
  • Selenium

  • Features: Full browser automation, handles javascript heavy sites.
  • Use Case: Sites with highly dynamic content loaded by JS.
  • Link to Selenium docs
  • lxml

  • Features: Very fast XML and HTML parser.
  • Use Case: Lightning fast parsing of XML/HTML data.
  • Link to lxml docs
  • pyquery

  • Features: jQuery-style syntax for accessing HTML elements.
  • Use Case: Makes scrape code look cleaner and more readable.
  • Link to pyquery docs
  • Prerequisites

    To follow along with the code examples in this article, you will need:

    Virtual Environment (Recommended)

    While optional, we highly recommended creating a virtual env for the project:

    python -m venv my_web_scraping_env
    

    The Libraries

    We will be using the Requests, BeautifulSoup and OS libraries primarily:

    pip install requests beautifulsoup4
    

    This will fetch the libraries from PyPI and install them locally.

    With the prerequisites installed, you are all setup! Let's start scraping.

    Lets pick a target website

    For demonstration purposes, we will be scraping the Wikipedia page List of dog breeds to extract information about various dog breeds.

    The rationale behind choosing this page is:

  • Well structured HTML layout that makes scraping easy
  • Nice table layout with one breed per row
  • Contains mulitple data fields per breed including names, breed group, alternate names and images
  • Images can allow us to showcase scraping binary files as well
  • This is the page we are talking about…

    Other great pages to practice web scraping include:

  • Wikipedia category pages like Lists of films
  • Ecommerce product listings like Amazon books
  • Real estate listings like Zillow rentals
  • The concepts covered will be applicable across any site.

    Write the scraping code

    Let's now closely examine the full code to understand how to systematically scrape data from the dogs breed webpage.

    
    # Full code
    
    import os
    import requests
    from bs4 import BeautifulSoup
    
    url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'
    
    # Headers to masquerade as a browser
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    }
    
    # Download page HTML using requests
    response = requests.get(url, headers=headers)
    
    # Check valid response received
    if response.status_code == 200:
    
        # Parse HTML using Beautiful Soup
        soup = BeautifulSoup(response.text, 'html.parser')
    
        # CSS selector for the main tables
        table = soup.find('table', {'class': 'wikitable sortable'})
    
        # Initialize data lists to store scraped info
        names = []
        groups = []
        local_names = []
        photographs = []
    
        # Create directory to store images
        os.makedirs('dog_images', exist_ok=True)
    
        # Loop through rows omitting header
        for row in table.find_all('tr')[1:]:
    
            # Extract each column data using CSS selectors
            columns = row.find_all(['td', 'th'])
    
            name = columns[0].find('a').text.strip()
            group = columns[1].text.strip()
    
            # Extract local name if exists
            span_tag = columns[2].find('span')
            local_name = span_tag.text.strip() if span_tag else ''
    
            # Extract photo url if exists
            img_tag = columns[3].find('img')
            photograph = img_tag['src'] if img_tag else ''
    
            # Download + Save image if url exists
            if photograph:
    
                response = requests.get(photograph)
    
                if response.status_code == 200:
    
                    image_filename = os.path.join('dog_images', f'{name}.jpg')
    
                    with open(image_filename, 'wb') as img_file:
    
                        img_file.write(response.content)
    
            names.append(name)
            groups.append(group)
            local_names.append(local_name)
            photographs.append(photograph)
    
    print(names)
    print(groups)
    print(local_names)
    print(photographs)
    

    The imports include standard Python libraries that provide HTTP requests functionality (requests), parsing capability (BeautifulSoup), and file system access (os) which we will leverage.

    The requests library allows us to make HTTP requests to the web page and check if the response is valid before parsing. BeautifulSoup then enables us to parse the full HTML content and isolate the main data table using CSS selectors. Finally, os provides file system access to save images locally.

    Together they form a very handy toolkit for scraping!

    Downloading the page

    We first construct the target URL and initialize a requests Session which allows connection reuse and efficiencies when making multiple HTTP requests to the same domain:

    url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    }
    
    response = requests.get(url, headers=headers)
    

    We also setup a custom User-Agent HTTP header to masquerade as a Chrome browser. This helps avoid blocks from servers trying to prevent scraping.

    After getting the response, we can check the status code to ensure we received a proper HTML document:

    if response.status_code == 200:
       # Success!
       print(response.text)
    

    In case of errors (e.g. 404 or 500), we do not proceed with scraping and handle the failure.

    Parsing the html

    Since we received a valid HTML response, we can parse the text content using Beautiful Soup:

    soup = BeautifulSoup(response.text, 'html.parser')
    

    BeautifulSoup accepts the raw HTML text and an optional parser like lxml or the built-in html.parser, and provides simple methods and Pythonic idioms for navigating, searching, and modifying the parse tree.

    Beautiful Soup transforms the messy HTML into a parse tree that mirrors the DOM structure of tags, attributes and text. We can use CSS selectors and traversal methods to quickly isolate the data we need from this tree.

    The Magic of Selectors for Data Extraction

    One of the most magical parts of web scraping with Python's BeautifulSoup library is using CSS selectors to extract specific content from HTML pages.

    Selectors allow us to visually target the tags enclosing the data we want scraped. BeautifulSoup makes selecting elements a breeze.

    For example, consider extracting book titles from this snippet:

    <div class="book-listing">
      <img src="/covers/harry-potter.jpg">
      <span class="title">Harry Potter and the Goblet of Fire</span>
      <span class="rating">9.1</span>
    </div>
    
    <div class="book-listing">
      <img src="/covers/lord-of-the-rings.jpg">
      <span class="title">The Fellowship of the Ring</span>
      <span class="rating">9.3</span>
    </div>
    

    We can directly target the span with class title through the CSS selector:

    soup.select("div.book-listing > span.title")
    

    This says - find all span tags having class title which are direct children of any div tag having book-listing as the CSS class.

    And voila, we have exactly the titles isolated:

    [<span class="title">Harry Potter and the Goblet of Fire</span>,
     <span class="title">The Fellowship of the Ring</span>]
    

    We can chain .text to extract just the readable text within the tags:

    [Harry Potter and the Goblet of Fire, The Fellowship of the Ring]
    

    Selectors provide incredible precision during data extraction by leveraging the innate hierarchy of structured HTML tags surrounding it.

    Some other examples of selectors:

    # Select id attribute
    soup.select("#book-title")
    
    # Attribute equality match
    soup.select('a[href="/login"]')
    
    # Partial attribute match
    soup.select('span[class^="title"]')
    
    # Select direct descendant
    soup.select("ul > li")
    

    As you can see, by mastering different selector types and combining multiple selectors where needed - you gain immense power to zone in on and extract the exact data you need from any HTML document, eliminating nearly all guesswork. Lets get back to the task at hand now…

    Finding the table

    Looking at the Raw HTML, we notice a table tag with CSS class wikitable sortable contains the main breed data.

    We can simply select this using:

    table = soup.find('table', {'class': 'wikitable sortable'})
    

    This searches the parse tree for any table tag having a class attribute matching wikitable sortable. Beautiful soup makes Selection using CSS selectors super easy!

    Extracting all the fields

    With the table isolated, we loop through every tr row after the header row to extract the data from each breed:

    for row in table.find_all('tr')[1:]:
    
        columns = row.find_all(['td', 'th'])
    
        name = columns[0].find('a').text.strip()
        group = columns[1].text.strip()
    

    Here, .find_all() helps search all the row children tags for any td or th elements, which represent table cells. We select these into a list columns.

    Using positional indexes in this columns list, we can extract the data within each cell cleanly:

        name = columns[0].find('a').text.strip()
    

    This grabs the anchor a tag inside the first table cell, gets .text property to extract raw string content and chains .strip() to remove whitespace. Beautiful Soup chains such operations elegantly.

    Similarly for cells containing just text:

        group = columns[1].text.strip()
    

    We fetch .text property directly on table cell element.

    The power of CSS selectors in quickly isolating specific tags, ids, classes or attributes makes data extraction very precise and straightforward in Beautiful Soup.

    Downloading and saving the images

    After scraping textual data like names, groups etc in each row, we check the last cell for an image link:

        img_tag = columns[3].find('img')
        photograph = img_tag['src'] if img_tag else ''
    

    This tries detecting and fetching src attribute on any image tag if exists.

    We can then download and save images using this url if present:

        if photograph:
    
            response = requests.get(photograph)
    
            image_filename = os.path.join('dog_images', f'{name}.jpg')
    
            with open(file_path, 'wb') as img_file:
               img_file.write(response.content)
    

    We reuse the requests library to make another GET request - this time to download the image binary content and save it locally using built-in file handling capability. Pretty nifty!

    And that's it! By using requests and BeautifulSoup together with Python's intuitive standard library, we were able to build a complete web scraper to extract complex data!

    Alternative libraries and tools for web scraping

    While requests and BeautifulSoup form the most popular combination, here are some alternatives worth considering:

    Scrapy

    An open source modular scraping framework meant for large scale crawling that handles throttling, cookies, proxy rotation automatically. Recommended for complex needs.

    Selenium

    Performs actual browser automation by controlling Chrome, Firefox etc. Enables scraping dynamic content that renders via JavaScript. More complex setup.

    pyppeteer

    Headless browser automation like Selenium driven through Python code. Good for javascript rendered websites.

    pyquery

    Offers jQuery style element selection. Scrape code looks very clean due to chaining syntax similar to jQuery.

    lxml

    A very fast XML/HTML parser. Great when raw parsing performance is critical.

    Challenges of Web Scraping in the real world: Some tips & best practices

    While basic web scraping is easy, building robust production-grade scalable crawlers brings its own challenges:

    Handling Dynamic Content

    Many websites rely heavily on JavaScript to render content dynamically. Static scraping then fails. Solutions: Use browser automation tools like Selenium or scraper specific solutions like Scrapy's splash integration.

    Here is a simple Hello World example to handle dynamic content using Selenium browser automation:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    
    # Initialize chrome webdriver
    driver = webdriver.Chrome()
    
    # Load page
    driver.get("<https://example.com>")
    
    # Wait for title to load from dynamic JS execution
    driver.implicitly_wait(10)
    
    # Selenium can extract dynamically loaded elements
    print(driver.title)
    
    # Selenium allows clicking buttons triggering JS events
    driver.find_element(By.ID, "dynamicBtn").click()
    
    # Inputs can be handled as well
    search = driver.find_element(By.NAME, 'search')
    search.send_keys('Automate using Selenium')
    search.submit()
    
    # Teardown browser after done
    driver.quit()
    

    The key capabilities offered by Selenium here are:

    1. Launches a real Chrome browser to load JavaScript
    2. Finds elements only available after execution of JS
    3. Can interact with page by clicking, entering text etc thereby triggering JavaScript events
    4. Experience mimics an actual user browsing dynamically generated content

    Together this allows handling complex sites primarily driven by JavaScript for dynamic content. Selenium provides full programmatic control to automate browsers directly thereby scraping correctly.

    Getting Blocked

    Websites often block scrapers via blocked IP ranges or blocking characteristic bot activity through heuristics. Solutions: Slow down requests, properly mimic browsers, rotate user agents and proxies.

    Rate Limiting

    Servers fight overload by restricting number of requests served per time. Hitting these limits lead to temporary bans or denied requests. Solutions: Honor crawl delays, use proxies and ration requests appropriately.

    Here is sample code to handle rate limiting while scraping:

    Many websites have protection mechanisms that temporarily block scrapers when they detect too many frequent requests coming from a single IP address.

    We can counter getting blocked by rate limits by adding throttling, proxies and random delays in our code.

    import time
    import random
    import requests
    from urllib.request import ProxyHandler, build_opener
    
    # List of free public proxies
    PROXIES = ["104.236.141.243:8080", "104.131.178.157:8085"]
    
    # Pause 5-15 seconds between requests randomly
    def get_request():
        time.sleep(random.randint(5, 15))
        proxy = random.choice(PROXIES)
        opener = build_opener(ProxyHandler({'https': proxy}))
    
        resp = opener.open("<https://example.com>")
        return resp
    
    for i in range(50):
       response = get_request()
       print("Request Success")
    

    Here each request first waits for a random interval before executing. This prevents continuous rapid requests.

    We also route every alternate request through randomly chosen proxy servers via rotated IP addresses.

    Together, throttling down overall crawl pace and distributing requests over different proxy IPs prevents hitting site-imposed rate limits.

    Additional improvements like automatically detecting rate limit warnings in responses and reacting accordingly can enhance the scraper's resilience further.

    Rotating User Agents

    Websites often try to detect and block scraping bots by tracking characteristic user agent strings.

    To prevent blocks, it is good practice to rotate multiple well-disguised user agents randomly to mimic a real browser flow.

    Here is sample code to pick a random desktop user agent from a predefined list using Python's random library before making each request:

    import requests
    import random
    
    # List of desktop user agents
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 OPR/43.0.2442.991"
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/604.4.7 (KHTML, like Gecko) Version/11.0.2 Safari/604.4.7"
    ]
    
    # Pick a random user agent string
    user_agent = random.choice(user_agents)
    
    # Set request headers with user agent before making request
    headers = {"User-Agent": user_agent}
    
    response = requests.get(url, headers=headers)
    

    By varying the user agent across requests in code runs, websites have a tougher time profiling traffic as coming from an automated bot using a static user agent. This allows the scraper to fly under the radar without getting blocked.

    Some additional enhancements include:

  • Having separate user agent lists for mobile, tablets, desktop browsers
  • Updating the lists with latest user agents periodically
  • Dynamically generating user agents to match genuine browser attributes
  • With effective user agent rotation and an ever expanding list of strings, scrapers enjoy better longevity undetected before site administrators can profile and actively block them.

    Browser Fingerprinting

    Beyond simplistic user agent checks, websites have adopted advanced browser fingerprinting techniques to identify bots.

    This involves browser attribute profiling - collecting information regarding device screen size, installed fonts, browser plugins etc. together called browser fingerprints.

    These fingerprints tend to remain largely consistent, stable and unique for standard tool-based bots and automation software.

    Dynamic websites track fingerprints of scrapers accessing them. By detecting known crawler fingerprints they can block them even if the user agents are rotated constantly.

    Minimizing detection risks

    Some ways to minimize exposing scraper fingerprints:

  • Use Selenium to automate a standard desktop browser like Chrome or Firefox instead of custom bot agents
  • Dynamically generate randomized attributes like viewport size, screen resolution, font lists within ranges of variety exhibited by human browsers
  • Utilize proxy rotation and residential IP proxies to prevent tracking of IP specific attributes
  • Limit number of parallel requests from a single proxy to site to make traffic volume seem manual
  • Essentially by mimicking the natural randomness and variability across genuine user browsers, scraper fingerprints can avoid easy profiling by sites simply as another standard browser.

    Here is a code example to dynamically modify browser attributes to avoid fingerprinting:

    from selenium import webdriver
    import random
    
    # List of common screen resolutions
    screen_res = [(1366, 768), (1920, 1080), (1024, 768)]
    
    # List of common font families
    font_families = ["Arial", "Times New Roman", "Verdana"]
    
    #Pick random resolution
    width, height = random.choice(screen_res)
    
    #Create chrome options
    opts = webdriver.ChromeOptions()
    
    # Set random screen res
    opts.add_argument(f"--window-size={width},{height}")
    
    # Set random user agent
    opts.add_argument("--user-agent=Mozilla/5.0...")
    
    # Set random font list
    random_fonts = random.choices(font_families, k=2)
    opts.add_argument(f'--font-list="{random_fonts[0]};{random_fonts[1]}"')
    
    # Initialize driver with options
    driver = webdriver.Chrome(options=opts)
    
    # Access webpage
    driver.get(target_url)
    
    # Webpage sees every scraper request originating
    # from distinct unpredictable browser profiles
    
    

    Here we randomly configure our Selenium controlled Chrome instance with different screen sizes, user agents and font sets per request.

    and here is how you do it using Python Requests

    import requests
    import random
    
    # Device profiles
    desktop_config = {
        'user-agent': 'Mozilla/5.0...',
        'accept-language': ['en-US,en', 'en-GB,en'],
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'accept-encoding': 'gzip, deflate, br',
        'upgrade-insecure-requests': '1',
        'sec-fetch-site': 'none',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'cache-control': 'max-age=0'
    }
    
    mobile_config = {
        'user-agent': 'Mozilla/5.0... Mobile',
        'accept-language': ['en-US,en', 'en-GB,en'],
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'x-requested-with': 'mark.via.gp',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': '<https://www.example.com/>',
        'accept-encoding': 'gzip, deflate, br',
        'cache-control': 'max-age=0'
    }
    
    device_profiles = [desktop_config, mobile_config]
    
    def build_headers():
    
        profile = random.choice(device_profiles)
    
        headers = {
             'User-Agent': random.choice(profile['user-agent']),
             'Accept-Language': random.choice(profile['accept-language']),
             # Other headers
             ...
        }
    
        return headers
    

    Now instead of hard coding, the scraper randomly selects from plausible configuration profiles including several identifying request headers - providing realistic and human-like mutations necessary to avoid fingerprint tracking.

    Parsing Complex HTML

    Scrape targets often have complex HTML structures, obfuscated tags and advanced client side code packing logic which break parsers. Solutions: Careful inspection of rendered source, using robust parsers like lxml and enhancing selectors.

    Here are some common types of bad HTML scrape targets exhibit and techniques to handle them:

    Improper Nesting

    HTML can often have incorrectly nested tags:

    <b>Latest News <p>Impact of oil prices fall...</b></p>
    

    Solution: Use a parser like lxml that handles bad nesting and uneven tags more robustly.

    Broken Markup

    Tags could be unclosed:

    <div>
      <span class="title">Python Web Scraping <span>
      Lorem ipsum...
    </div>
    

    Solution: Specify tag close explicitly while parsing:

    title = soup.find("span", class_="title").text
    

    Non-standard Elements

    Vendor specific unrecognized custom tags may exist:

    <album>
      <cisco:song>Believer</cisco:song>
    </album>
    

    Solution: Search for standard tags in namespace:

    song = soup.find("cisco:song").text
    

    Non-text Content

    Tables, images embedded between text tags:

    <p>
      Trending Now
      <table>...</table>
    </p>
    

    Solution: Select child tags specifically:

    paras = soup.select("p > text()")
    

    This picks only text nodes as children ignoring other elements present under

    tag.

    As you can see, liberal use of selectors along with robust parsers provides the tools to handle even badly designed HTML and extract the required data reliably.

    Other guidelines worth following:

  • Respect robots.txt rules
  • Check if API access is available before scraping sites without permission
  • Scrape data responsibly in moderate volumes
  • Adopting these practices ensures reliable, resilient and responsible scraping operations.

    Conclusion In this comprehensive guide, we took an in-depth look into web scraping using Python. We covered:

  • Why Python and libraries like BeautifulSoup are ideal for scraping most targets
  • Common scraping patterns like making requests, parsing responses, handling dynamic content using Selenium
  • Best practices around mimicry, circumventing blocks, respecting crawl delays and auto-throttling
  • How to build resilient, production-grade scalable scrapers
  • By learning core scraping paradigms, structuring code properly and applying optimization techniques, extracting accurate web data in Python at scale has become an achievable skill!

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: