Dealing with 403 Forbidden Errors in BeautifulSoup

Oct 6, 2023 ยท 2 min read

When scraping websites, you may occasionally encounter 403 Forbidden errors preventing access to certain pages or resources. Here are some ways to handle and bypass these errors in your BeautifulSoup web scraper.

Understanding 403 Forbidden

A 403 Forbidden HTTP status code means the server has denied access to the requested page or resource. Some common reasons include:

  • Trying to access pages restricted to authorized users only
  • Hitting usage limits or access rate thresholds
  • Banned bot or IP address detected
  • Missing API credentials or keys
  • Hotlinking forbidden from external sites
  • These restrictions are typically implemented intentionally by the site owner.

    Checking Error Codes

    When making requests in Python, check the status code to detect 403 errors:

    import requests
    
    response = requests.get(url)
    if response.status_code == 403:
      # Handle error
    

    This lets you react to 403s when they occur.

    Using User Agents

    Spoofing a real browser user agent string may allow you to bypass restrictions:

    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
    
    response = requests.get(url, headers=headers)
    

    Set headers to mimic a normal browser, not a bot.

    Authenticating with Login Credentials

    For pages requiring a login, pass credentials to access authorized content:

    response = requests.get(url, auth=('username','password'))
    

    This will attach HTTP Basic Auth headers to authenticate.

    Waiting and Retrying

    Often 403s are from temporary access limits. So waiting and retrying the request after some delay may let it through:

    from time import sleep
    
    while True:
      response = requests.get(url)
      if response.status_code == 403:
        sleep(60) # Wait 1 minute
      else:
        break # Success
    

    Using Proxies

    Retry with different proxies to distribute requests across IP addresses:

    import requests
    from random import choice
    
    proxies = ['x.x.x.x:xxxx','x.x.x.x:xxxx']
    
    while True:
      proxy = choice(proxies)
      response = requests.get(url, proxies={'http': proxy})
    
      if response.status_code != 403:
        break
    

    This cycles through proxies to avoid IP blocks.

    The key is having strategies in place to retry or shift access patterns when hitting 403 Forbidden errors. Adjusting headers, using proxies/logins, and adding delays can help mimic and validate human traffic to get around restrictions. With some careful handling, you can scrape sites robustly even when 403s occur.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!