Web Scraping Websites with Login Example Using Python

Oct 4, 2023 ยท 3 min read

Introduction

Scraping dynamic websites that require logging in can be tricky. Often you may be able to login initially, but will then be logged out when trying to access other pages. This article will walk through how to keep a session alive when web scraping sites with login using Python requests.

Overview

Here's a quick overview of what we'll cover:

  • Use browser tools to analyze login form
  • Create payload with credentials
  • Post login request with requests
  • Create session to stay logged in
  • Access restricted pages
  • Hide credentials in separate file
  • Inspecting the Login Form

    The first step is analyzing the login form and post request. This can be done using the Network panel in browser developer tools:

    Key Steps

  • Login to the site and monitor network requests
  • Find the POST request for logging in
  • Check the URL endpoint that it posts to
  • Look at the form data/payload sent
  • Note any other headers or parameters needed
  • This will give us the information needed to mimic the login request in Python.

    Sending Login Request

    We can now send a POST request to the login URL with the payload:

    import requests
    
    login_url = '<https://website.com/login>'
    
    payload = {
        'username': 'myusername',
        'password': 'mypassword'
    }
    
    response = requests.post(login_url, data=payload)
    

    This will log us in. However, we are not yet maintaining the session.

    Keeping the Session Alive

    To keep logged in across requests, we need to use a session object:

    with requests.Session() as session:
    
        session.post(login_url, data=payload)
    
        r = session.get('<https://website.com/restricted>')
        # successful as we are logged in!
    

    This will allow us to access restricted pages successfully after logging in.

    Hiding Credentials

    It's good practice to keep credentials in a separate file:

    # cred.py
    
    username = 'myusername'
    password = 'mypassword'
    
    # main.py
    
    import cred
    
    payload = {
       'username': cred.username,
       'password': cred.password
    }
    

    This avoids exposing sensitive info if sharing your main code file.

    Full Code Example

    Below is full code for web scraping a site with login using this approach:

    import requests
    from bs4 import BeautifulSoup
    import cred
    
    login_url = '<https://website.com/login>'
    restricted_page = '<https://website.com/restricted>'
    
    payload = {
        'username': cred.username,
        'password': cred.password
    }
    
    with requests.Session() as session:
    
        session.post(login_url, data=payload)
    
        r = session.get(restricted_page)
    
        soup = BeautifulSoup(r.text, 'html.parser')
    
        # Continue scraping/parsing data from soup here...
    
    

    Summary

  • Analyze login form with browser developer tools
  • Craft payload with credentials
  • Post login request
  • Use session to stay logged in
  • Hide credentials in separate file
  • Scrape data from restricted pages!
  • Using this approach you can now successfully scrape data from websites requiring login with Python.

    While these tools are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!