Using Proxies with Pyppeteer for Web Scraping

Jan 9, 2024 ยท 7 min read

Pyppeteer launches a browser instance through a launch() method. This allows passing of configuration options that customize the browser environment before creation.

We can define proxies by including the --proxy-server argument via the args parameter:

browser = await launch(args=['--proxy-server=SERVER:PORT'])

The --proxy-server flag tells the browser to route all traffic through the specified proxy server. Let's explore some common proxy configurations.

Static IP Proxies

A basic proxy server uses a single static IP address. To set this up:

PROXY = '123.45.6.78:8080'

browser = await launch(args=[f'--proxy-server={PROXY}'])

Pro Tip: I like to store proxies as environment variables so they are easy to change across scripts

This routes all browser requests through your defined proxy IP and port.

However, using the same IP persistently has downsides:

  • Websites can detect and block the static proxy IP
  • Proxy providers often limit concurrent connections per IP
  • Rotating proxies help overcome these issues...

    Rotating Proxies

    Rotating proxy setups cycle through multiple proxy servers to vary the IPs used. This better mimics natural user traffic patterns.

    There are two common approaches to enable proxy rotation in Pyppeteer:

    1. Proxy Lists

    You can maintain a list of proxies and select one randomly per request:

    import random
    
    PROXY_LIST = [
        '123.45.1.1:8000',
        '98.76.2.2:8000',
         ...
    ]
    
    random_proxy = random.choice(PROXY_LIST)
    
    browser = await launch(args=[f'--proxy-server={random_proxy}'])
    

    Tip: I've found keeping backup proxies on hand helps when your main ones get blocked

    2. Proxy API Endpoints

    Services like Proxies API provide a proxy API endpoint that handles rotating proxies automatically under the hood.

    You simply pass the endpoint URL to Pyppeteer:

    
    
    import asyncio
    from pyppeteer import launch
    
    # Step 2: Create an async function to browse the URL
    async def browse_url(url):
        # Step 3: Launch a headless Chromium browser
        browser = await launch()
        
        # Step 4: Create a new page/tab
        page = await browser.newPage()
        
        try:
            # Step 5: Navigate to the URL
            await page.goto(url)
            
            # Step 6: Do something on the page (e.g., take a screenshot)
            await page.screenshot({'path': 'screenshot.png'})
            
        finally:
            # Step 7: Close the browser when done
            await browser.close()
    
    # Step 8: Run the async function with your desired URL
    url_to_browse = '"http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com'
    asyncio.get_event_loop().run_until_complete(browse_url(url_to_browse))

    Now instead of changing IPs yourself, each call will automatically use a different proxy from a large, managed pool!

    Residential Proxies

    Residential proxies are the most advanced option. They use IPs assigned to home networks, making your traffic appear more natural.

    Configuring them is the same as static proxies, but they are shared services costing $$$:

    RES_PROXY = 'user:pass@residential-proxy:8080'
    
    browser = await launch(args=[f'--proxy-server={RES_PROXY}'])
    

    Warning: Don't use free residential proxies found online...they are usually slow or quickly banned!

    Alright, we've covered setting up Pyppeteer with various proxy types. Now let's talk more advanced management and usage...

    Managing Proxies

    When relying on proxies for large scraping projects, you need to carefully track their status and performance over time. Here are some best practices I've found useful:

    Refresh IP Pools Regularly

    For proxy lists and residential services, refresh or rotate your IP pools every couple weeks. Sites can recognize commonly reused IPs.

    Have Backup Options

    Always have a secondary pool of proxy servers or residential services in case your main ones get flagged or rate limited. There's nothing worse than having all proxies go down mid-project!

    Check Statuses and Speeds

    Actively monitor metrics like success rates, latencies, timeouts, and errors for your proxies. This allows you to proactively shift traffic away from poor-performing IPs.

    For example, you could have a job that pings each proxy every hour, logging metrics to identify issues early.

    Leverage Managed Proxies

    Instead of handling all these proxy management complexities yourself in Pyppeteer, services like Proxies API do it for you automatically!

    Their infrastructure constantly cycles through millions of proxies around the world, detecting banned IPs instantly. This maximizes uptime and minimizes your scraping headaches.

    Advanced Pyppeteer Proxy Usage

    In addition to configuring proxies during browser initialization, there are some more advanced proxy-related functions available in Pyppeteer:

    Set Proxies in Page Routes

    The page.route() method allows defining custom behavior when certain resource types are loaded. You could force proxying for specific page route patterns only:

    page.route('<https://target.com/api*>', lambda route:
        route.continue({
            'proxy': PROXY
        }))
    

    Now any API calls to target.com will use your proxy, while other traffic proceeds normally. Granular proxy handling like this prevents overuse of IPs on high-traffic sites.

    Create Proxy Middleware

    For complex scraping workflows, you may want scoped proxy handling logic reusable across scripts.

    Middleware functions declared globally can configure proxies for all page routes or other events in your Pyppeteer instance:

    async def proxy_middleware(route, next):
        # Proxy logic like rotating IPs
    
        route.continue()
    
    # Assign middleware during browser creation
    browser = await launch(args=['--proxy-server=...'], python_middleware=[proxy_middleware])
    

    This keeps your proxy configuration consolidated rather than scattered throughout scripts.

    Proxy Best Practices

    Beyond setup and management, adhering to smart practices when using proxies for scraping can help avoid bot protections:

    Mix Different Proxy Types

    Use a blend of residential, datacenter, and rotating proxies to better simulate natural browsing patterns from multiple environments. Don't rely solely on a single proxy pool.

    Add Random Page Delays

    Humans don't click and scroll instantly between pages. Adding small randomized time.sleep() delays in your Pyppeteer scraping scripts helps avoid bot detection.

    Disable Caching When Rotating IPs

    Browsers often cache resources like JS files and images to speed up subsequent loads. This can cause scrapers issues when rotating IPs constantly.

    You can disable caching in Pyppeteer browser configs to prevent resources being tied to single IPs during proxy rotation.

    Following these proxy best practices helps decrease the chances of sites flagging your web scrapers over time.

    Alright, that wraps up my personal insights on comprehensively using proxies with Pyppeteer for web scraping!

    I aimed to provide a thorough walkthrough of concepts, setup guidance, configuration details, advanced usage, best practices, and management advice based on my own experiences.

    Hopefully this gives you a strong foundation for integrating proxies smoothly into your own Pyppeteer projects. Scraping results can vary drastically based on how well proxies are leveraged.

    If configuring and handling intricate proxy rotation logic still seems daunting, services like Proxies API provide a managed proxy solution requiring zero DevOps upkeep.

    Proxies API abstracts away all underlying proxy complexities through a simple API, allowing you to focus efforts on other scraping tasks instead.

    I suggest giving them a try with 1000 free requests. Their infrastructure can support heavy scraping needs that would require large dedicated proxy pools otherwise.

    With that, happy proxying and happy scraping! Let me know if you have any other questions.

    Frequently Asked Questions

    Q: What is the difference between Pyppeteer and Puppeteer?

    A: Puppeteer is a Node.js library while Pyppeteer ports the same browser automation capabilities to Python. Under the hood, both leverage the Chrome DevTools Protocol to control Chromium.

    Q: Which is better - Pyppeteer or Selenium for browser testing?

    A: Pyppeteer runs Chromium directly for faster, more lightweight browser control. Selenium offers more browser support but requires a Selenium server and browser driver executables.

    Q: Why do we need proxies for web scraping?

    A: Proxies hide your real IP address so each request appears to come from a different location. This mimics natural browsing behavior and avoids anti-bot protections.

    Q: What are the different types of proxies?

    A: Common proxy types are static IPs, rotating proxies, residential proxies, and datacenter proxies. Each has tradeoffs between cost, bot detection avoidance, and ease of use.

    Q: How can I check if a proxy is working correctly?

    A: Try visiting a site like https://www.whatismyproxy.com/ through your configured proxy. It will display the outward-facing IP address seen by that website, which should match your proxy.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!