The Ultimate Goutte Cheat Sheet for PHP

Oct 31, 2023 ยท 5 min read

Goutte is a battle-tested PHP web scraping library. This comprehensive reference aims to thoroughly cover its capabilities.



composer require fabpot/goutte

Client Configuration

Set user agent:

$client = new Goutte\\Client();
$client->setHeaders(['User-Agent' => 'Firefox']);

Set timeouts:

$client->setTimeout(30); // connection timeout
$client->setIdleTimeout(90); // idle timeout

Handle cookies:

$client->getCookieJar()->set(new \\GuzzleHttp\\Cookie\\SetCookie('session=foo'));

Custom client:

$stack = \\GuzzleHttp\\HandlerStack::create();
$client = new \\GuzzleHttp\\Client(['handler' => $stack]);
$goutteClient = new Goutte\\Client();

Making Requests

GET request:

$crawler = $client->request('GET', '/products');

POST request:

$crawler = $client->request('POST', '/login', ['username' => '', 'password' => '']);

Upload files:

$crawler = $client->request('POST', '/upload', [], ['photo' => new FormData($path)]);

Attach session:

$client->getCookieJar()->set(new \\GuzzleHttp\\Cookie\\SetCookie($sessionCookie));

Follow redirects:

$crawler = $client->request('GET', $url); // follows redirects

Selecting Elements

CSS selector:

$els = $crawler->filter('div > span.title');

XPath expression:

$els = $crawler->filterXpath('//h1[@class="headline"]');

Combining CSS and XPath:


Matching text:

$crawler->filterXpath('//p[contains(text(), "Hello")]');

Pagination links:

$crawler->selectLink($crawler->filterXpath('//a[text()="Next Page"]')->text());

Extracting Data

Get text:

$text = $el->text();


$html = $el->html();

Get outer HTML:

$html = $el->outerHtml();

Get attribute:

$url = $el->attr('href');

Get raw response:

$response = $crawler->getResponse();

Interacting with Pages

Click link:

$link = $crawler->selectLink('Next')->link();
$crawler = $client->click($link);

Submit form:

$form = $crawler->selectButton('Submit')->form();
$crawler = $client->submit($form);

Upload file:

$form = $crawler->selectButton('Upload')->form();
$form['file'] = new \\Symfony\\Component\\HttpFoundation\\File\\UploadedFile('/path/to/file');
$crawler = $client->submit($form);

Scroll page:

$crawler->evaluateScript('window.scrollTo(0, document.body.scrollHeight)');

Handling Responses

Check status code:

$statusCode = $crawler->getResponse()->getStatus();

if ($statusCode === 200) {
  // Success

Get response headers:

$headers = $crawler->getResponse()->getHeaders();

Get response body:

$html = $crawler->getResponse()->getContent();

Debugging and Logging

Debug client:

$client->getClient()->getConfig('handler')->push(new \\Monolog\\Handler\\ChromePHPHandler());

Log requests:

$logger = new \\Monolog\\Logger('goutte');
$stack = new \\GuzzleHttp\\HandlerStack();
$client = new \\GuzzleHttp\\Client(['handler' => $stack]);

Mocking Responses

Mock response:

use GuzzleHttp\\Handler\\MockHandler;

$mock = new MockHandler([
  new \\GuzzleHttp\\Psr7\\Response(200, ['Content-Type' => 'text/html'], '<html>...</html>')

$handler = \\GuzzleHttp\\HandlerStack::create($mock);
$client = new Goutte\\Client(['handler' => $handler]);

Rate Limiting

Limit per second:

$client = \\GuzzleHttp\\Client([
  'handler' => \\GuzzleHttp\\HandlerStack::create(new \\GuzzleHttp\\Handler\\CurlHandler([
    'curl' => [CURLOPT_BUFFERSIZE => 1024],
  'middleware' => new \\GuzzleHttp\\Middleware\\ThrottleMiddleware(10), // 10 requests per second

Dynamic throttling:

$stack = new \\GuzzleHttp\\HandlerStack();
$stack->push(new \\SomeProvider\\DynamicThrottleMiddleware());
$client = new Goutte\\Client(['handler' => $stack]);

Asynchronous Requests

Concurrent requests:

use GuzzleHttp\\Promise;

$promises = [
  'page1' => $client->requestAsync('GET', '<>'),
  'page2' => $client->requestAsync('GET', '<>')

$results = Promise\\unwrap($promises);

Real World Use Cases

  • Large scale web archiving
  • Dynamic scraping against modern JS sites
  • Cloud based web automation
  • Distributed scraping with multiple clients
  • Scrapers for research papers
  • Automated financial reports
  • Creating training datasets for ML
  • Regression testing UIs with visual diffs
  • Scraping data from web API responses
  • Migrating content between CMSs
  • Price monitoring and alerting
  • Tracking website changes
  • Comparing product prices
  • Monitoring domains for brand abuse
  • Building news aggregators
  • Public data mining and analysis
  • Processing HTML datasets
  • Scraping geospatial data for mapping
  • Using with Other Libraries

    Integrate with Symfony DomCrawler for more advanced filtering:

    $crawler = $client->request('GET', '<>');
    $domCrawler = new \\Symfony\\Component\\DomCrawler\\Crawler();
    $filtered = $domCrawler->filter('div.content');

    Batching and Concurrency

    Improve efficiency for large scrapes by batching requests:

    $batch = new \\Goutte\\BatchClient($client);
    $batch->enqueue(['url' => '']);
    $batch->enqueue(['url' => '']);
    $crawlers = $batch->start();

    Scrape in parallel for performance using Guzzle promises:

    $promises = [
      'page1' => $client->requestAsync('GET', ''),
      'page2' => $client->requestAsync('GET', '')
    $results = \\GuzzleHttp\\Promise\\settle($promises)->wait();

    Best Practices

    Respect robots.txt:


    Implement rate limiting:

    $stack->push(new \\GuzzleHttp\\Middleware\\ThrottleMiddleware(10)); // 10 rps

    Avoid overloading servers:

    $batch->setConcurrency(10); // only 10 concurrent requests

    Scraping JavaScript Sites

    Use Puppeteer to render JavaScript:

    $browser = \\Puppeteer\\Puppeteer::launch();
    $page = $browser->newPage();
    $html = $page->getHtml();

    Persisting Scraped Data

    Save to JSON file:

    $data = $crawler->filter('.listing')->each(function ($node) {
      return $node->text();
    file_put_contents('listings.json', json_encode($data));

    Debugging Tips

    Enable Guzzle debug logging:

    $stack->push(\\GuzzleHttp\\Middleware::log($logger, LogLevel::DEBUG));

    Inspect headers and response codes:

    $headers = $response->getHeaders();
    $statusCode = $response->getStatusCode();

    Proxy and User Agent Rotation

    Rotate user agents to avoid blocks:

    $agents = ['Firefox', 'Chrome', ...];
    $client->setHeaders(['User-Agent' => $agents[array_rand($agents)]]);

    Use proxies for IP rotation:

    $client = new \\Goutte\\Client();

    Useful Goutte Libraries

  • goutte-scraper - Scraper with batteries included
  • laravel-goutte - Laravel integration
  • guzzle-crawler - Powerful crawling framework
  • Real World Examples

    Scrape pricing data:

    $crawler->filter('.price')->each(function ($node) {
      return $node->text();

    Extract contact info:

    $crawler->filter('.contact-list')->each(function ($node) {
      return $node->filter('a')->each(function ($link) {
        return $link->text();


    Handling captchas:

    // Option 1: Use a service like AntiCaptcha
    // Option 2: Rotate proxies and retry on detection

    Scraping paginated content:

    while($crawler->filter('.next-page')->count() > 0) {
      $nextPage = $crawler->selectLink('Next')->link();
      $crawler = $client->click($nextPage);
      // Scrape page

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you

    Try ProxiesAPI for free

    curl ""

    <!doctype html>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />


    Don't leave just yet!

    Enter your email below to claim your free API key: