How to Use Proxy in Playwright in 2024

Jan 9, 2024 ยท 9 min read

As a web scraper, getting blocked by target sites is one of the most frustrating issues you can run into. No matter how carefully crafted your scrapers are, aggressive bot detection and IP blocks can bring your data collection to a grinding halt.

This is where proxies come into play. By routing your requests through an intermediary server, proxies allow you to mask your real IP address and avoid blocks for longer.

In this comprehensive guide, we'll cover everything you need to know about integrating proxies into your Playwright web scraping projects, including:

  • Why proxies help avoid blocks
  • How to set up proxies in Playwright scripts
  • Proxy authentication, protocols, and other advanced features
  • Intercepting network traffic for debugging
  • Best practices for improved scraping with proxies
  • Leveraging a proxy API service for easier management
  • We'll support each concept with detailed code examples in Python and NodeJS so you can quickly apply what you learn.

    By the end, you'll understand all aspects of Playwright proxies to supercharge your web scraping endeavors. Let's get started!

    Why Use Proxies for Web Scraping?

    Before jumping into the implementation, it's worth understanding why proxies help circumvent blocks in web scraping.

    When you send requests directly from your machine to a target website, each request contains headers with your real IP address. Many sites track these IP addresses to detect repeat visits and block suspicious traffic.

    Proxies add an intermediary layer that forwards your requests through a remote server. So instead of your IP, the target site sees the IP of the proxy in all request headers.

    By masking your real IP, the site is unable to track and block you effectively. This allows you to scrape uninterrupted for longer periods.

    Some key benefits include:

  • Avoid IP blocks: Masks real IP, making blocking harder
  • Improve success rate: Proxy IPs rarely get blocked so requests succeed more often
  • Scrape from different locations: Proxies give you additional IP addresses from desired geo-locations
  • Debug traffic easily: Inspect outgoing requests and incoming responses
  • Now let's see this in action by setting up our first Playwright proxy.

    Setting Up Proxies in Playwright

    Adding proxies to Playwright scripts involves:

    1. Choosing a suitable proxy provider
    2. Configuring proxy settings in code
    3. Making requests through proxy IPs

    I recommend using premium residential proxies for web scraping to get reliable uptime and avoid blocks.

    Once you obtain proxies, supplying credentials in Playwright is straightforward.

    Launching Browser with Proxy Parameters

    Here's an example in Python:

    from playwright.async_api import async_playwright
    
    async def main():
        async with async_playwright() as p:
            browser = await p.chromium.launch(
                proxy={
                    "server": "proxy_ip:port",
                    "username": "proxy_user",
                    "password": "proxy_pass"
                }
            )
            context = await browser.new_context()
    
            # Begin scraping...
    
    asyncio.run(main())
    

    We launch Chromium browser and pass our proxy credentials directly in a proxy parameter. This routes all traffic through our chosen proxy IP.

    Similarly in NodeJS:

    const { chromium } = require('playwright');
    
    (async () => {
      const browser = await chromium.launch({
        proxy: {
          server: 'proxy_ip:port',
          username: 'proxy_user',
          password: 'proxy_pass'
        }
      });
      const context = await browser.newContext();
    
      // Start scraping...
    
    })();
    

    That covers the basics of getting started with Playwright proxies!

    Next, we'll explore some advanced proxy configurations.

    Advanced Proxy Usage

    Setting up a single proxy is great, but often you need:

  • Proxy authentication
  • Support for different protocols
  • Rotating through multiple proxies
  • Let's tackle each to level up your proxy game.

    Playwright Proxy Authentication

    For authenticated proxy access, simply pass username and password fields along with the proxy server as shown earlier.

    If the proxy uses IP-based authentication, set username as the dedicated IP address instead.

    Many providers have custom hmac-authorization headers. In this case, disable credentials and supply the headers directly:

    browser = await p.chromium.launch(
       proxy={
         "server": "proxy_ip:port",
         "username": "",
         "password": "",
         "headers": {
           "Proxy-Authorization": "hmac-auth-header-value"
         }
       }
    )
    

    This gives you full flexibility to connect securely.

    Configuring Proxy Protocols

    Playwright works with the common proxy types:

  • HTTP - For plaintext HTTP requests
  • HTTPS - For SSL encrypted HTTPS requests. Works with HTTP too.
  • SOCKS - Advanced protocol handling all traffic types
  • Set the right protocol in server value based on target site:

    # HTTP
    browser = await p.chromium.launch(
       proxy={
         "server": "http://proxy_ip:port"
       }
    )
    
    # SOCKS
    browser = await p.chromium.launch(
       proxy={
         "server": "socks5://proxy_ip:port"
       }
    )
    

    If unsure, use HTTPS as it has best compatibility.

    Next, let's spice things up further with rotating, random proxies.

    Implementing a Rotating Proxy Pool

    Using the same proxy repeatedly risks getting it blocked. The solution? Rotate between multiple proxies for each request.

    First, generate a pool of proxies from your provider's dashboard. For example:

    proxy_pool = [
      {"server": "proxy1_ip:port"},
      {"server": "proxy2_ip:port"},
      # ... more proxies
    ]
    

    Now select a random proxy for each browser launch:

    import random
    
    # Choose random proxy
    proxy = random.choice(proxy_pool)
    
    browser = await p.chromium.launch(proxy=proxy)
    

    Better yet, rotate on every page navigation for utmost stability:

    import random
    
    async def scrape_page(page):
      # Scrape current page...
    
    async def main():
    
      for _ in range(10): # Loop through pages
    
        proxy = random.choice(proxy_pool)
    
        async with async_playwright() as p:
          browser = await p.chromium.launch(proxy=proxy)
          context = await browser.new_context()
          page = await context.new_page()
    
          await scrape_page(page)
    
          await context.close()
          await browser.close()
    
    asyncio.run(main())
    

    This ensures a completely different proxy for each page, making blocking impractical for target sites.

    Now let's learn how proxies assist in intercepting and debugging network requests.

    Intercepting Network Traffic

    An invaluable aspect of Playwright is the ability to intercept requests and responses. When using proxies, this becomes even more beneficial.

    Common use cases for network interception with proxies include:

  • Logging requests and responses
  • Modifying headers and parameters
  • Mocking API responses
  • Debugging block errors
  • Logging Proxy Traffic

    To start, you can simply log Playwright network events via page.on():

    page.on('request', request => {
      console.log('Request URL:', request.url());
    
      // Log other proxy request details like method, headers
    });
    
    page.on('response', response => {
      console.log('Response status:', response.status());
    
      // Log other proxy response details like headers
    });
    

    This provides tremendous visibility into all proxied traffic.

    Alternatively, you can intercept requests for further analysis:

    await page.route('**/*', route => {
    
      // Fetch original response
      const response = await route.fetch();
    
      console.log(response.headers()); // Log proxied headers
    
      route.continue();
    });
    

    The route handler gives you complete control over requests before they are sent. This brings us to our next section...

    Modifying Network Requests

    Beyond logging, proxies allow you to tweak request details on the fly:

    await page.route('<https://target.site/*>', route => {
    
      const headers = route.request().headers();
    
      // Remove or modify headers
      delete headers['User-Agent'];
    
      route.continue({headers});
    });
    

    You can transform parameters, headers, cookies, and more. This helps simulate requests perfectly and avoid bot patterns.

    Mocking API Responses

    Take it up another notch by mocking API responses completely. This avoids hitting sites unnecessarily:

    import json
    
    mock_data = json.dumps({'key': 'value'})
    
    await page.route("<https://api.example.com/data>", route =>
      route.fulfill(
        status=200,
        content_type="application/json",
        body=mock_data
      )
    )
    
    page.click("button.fetch") # Clicked but data is mocked
    

    We cut out the API dependency for reliable testing. Playwright proxies supercharge mock workflows.

    While I've only covered basics here, you can build an entire mocking framework on top leveraging proxies and Playwright capabilities!

    Now that we've explored all proxy features in depth, let's shift gears to troubleshooting and best practices.

    Troubleshooting and Best Practices

    Proxies add complexity so issues inevitably crop up. Following some guidelines goes a long way in avoiding headaches:

    Use Multiple Providers

    Depending on one proxy provider is risky if they have mass IP failures. Use a blended portfolio:

    proxy_pool = [
       # Luminati proxies
       {"server": "lum_proxy1:port"},
    
       # Smartproxy proxies
       {"server": "smart_proxy2:port"},
    
       # GeoSurf proxies
       {"server": "geo_proxy3:port"},
    ]
    

    This insulates you against provider-wide blocks.

    Enable Debug Logs

    Playwright offers fantastic debugging capabilities. Enable trace logs to diagnose proxy configuration problems:

    // 1. Set trace option
    const browser = await chromium.launch({
      proxy: /*...*/ ,
      headless: false,
      trace: 'verbose'
    });
    
    // 2. Check wsEndpoint URL in terminal
    // Browser is launched and listens wsEndpoint
    

    The verbose logs contain the wsEndpoint Playwright connects to. Ensure your proxy IP appears here to confirm proper setup.

    If your actual IP shows, proxies are not configured correctly.

    Use a Proxy Manager

    Manually handling IPs, protocols, credentials across providers becomes chaotic quickly.

    Proxy manager tools like ProxyCannon abstract away this headache through a simple API:

    import ProxyCannon
    
    provider = ProxyCannon.create_provider({
      "luminati": {
         "customer_id": "lum_cust_id",
         "zone": "static"
        }
    })
    
    proxy = provider.get_proxy()
    
    browser = await playwright.launch(proxy=proxy)
    

    ProxyCannon handles authentication, geo-targeting, rotation and more automatically across providers. Definitely use one if dealing with many proxies!

    While troubleshooting bad responses, don't forget to leverage request interception covered earlier. It enables inspecting proxied traffic for diagnosing problems quickly.

    This leads nicely to our final section - a game changing proxy API service to eliminate these issues completely!

    Leveraging a Proxy API Service

    At this point, the immense value of proxies is apparent. However, managing them involves:

  • Vetting multiple providers
  • Handling IP blocks gracefully
  • Updating credentials periodically
  • Checking proxy failures
  • This overhead subtracts precious time from actual scraping.

    Wouldn't it be great if a service existed that simplified proxy headaches into a single API call?

    Introducing Proxies API.

    Proxies API handles all proxy complexities through a developer-friendly API tailored for web scraping.

    Here are some key benefits:

    Rotates highly optimized residential proxies automatically to avoid blocks

    Built-in support for handling CAPTCHAs, cookies, headers

    Easy debugging with request and response interception

    Global locations to simulate geo-distributed traffic

    Blazing fast speeds up to 1 GBPS

    Powerful browser rendering with Playwright, Puppeteer and Selenium

    Generous free tier to get started

    Enough talk, let's see it in action:

    import requests
    import json
    
    api_key = "BUndDSmRhRz_N1w"
    
    api_url = "<http://api.proxiesapi.com/?api_key={api_key}&url=https://target.com>"
    
    headers = {
       "Accept": "application/json"
    }
    
    response = requests.get(api_url, headers=headers)
    html = json.loads(response.text)["html"]
    
    print(html[:100]) # First 100 chars of HTML
    

    With just few lines of code, Proxies API returns target site's rendered HTML while handling all proxies and bot mitigation techniques for you!

    The free tier includes 1000 requests to get started. Sign up and simplify your Playwright scraping today

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!