Web Scraping Google Scholar in PHP

Jan 21, 2024 · 7 min read

In this beginner-friendly PHP tutorial, we will walk through a full web scraping script to extract search results data from Google Scholar. We won't get into the high-level details or ethics of web scraping - instead we'll jump straight into the code and I'll explain each part in detail.

This is the Google Scholar result page we are talking about…

Pre-requisites

Before running the web scraping script, you need to have:

PHP Engine

PHP 7.0+ is required. This should be installed on most shared hosts. For local testing, install XAMPP or similar which includes the PHP engine.

Simple HTML DOM Parser

We use this excellent PHP library to parse and interact with HTML and XML documents.

Download simple_html_dom.php from: https://simplehtmldom.sourceforge.io/

And add it to your PHP project directory.

With those set up, let's get into the scraper!

Overview

Here is a high-level overview of what the script does:

  1. Imports Simple HTML DOM library
  2. Defines URL to scrape and headers
  3. Initializes cURL and sets options
  4. Makes request and gets HTML response
  5. Checks response is valid
  6. Parses HTML into a DOM document
  7. Uses DOM selectors to extract data
  8. Outputs extracted data
  9. Cleans up

Now let's walk through it section-by-section.

Import Simple HTML DOM

include('simple_html_dom.php');

We include the Simple HTML DOM parser library so we can easily interact with DOM elements later.

Define Target URL and Headers

// Define the URL of the Google Scholar search page
$url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>";

// Define a User-Agent header
$headers = [
  'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
];

We define the target URL to scrape - a Google Scholar search for "transformers".

And add a user-agent header to mimic a real browser, avoiding bot detection.

Initialize cURL Session

// Initialize cURL session
$ch = curl_init();

// Set cURL options
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

cURL allows making HTTP requests in PHP. We initialize a new cURL session, then configure our options:

  • CURLOPT_URL: The URL to request
  • CURLOPT_HTTPHEADER: Our defined headers
  • CURLOPT_RETURNTRANSFER: Return response directly rather than outputting
  • Send Request and Check Response

    // Execute cURL session and get the response
    $response = curl_exec($ch);
    
    // Check if the request was successful (status code 200)
    if ($response !== false) {
       // Scrape page
    } else {
       // Request failed
    }
    

    We execute the cURL request and store the resulting HTML content.

    It's good practice to check the response is valid before trying to parse/scrape it.

    Parse Response and Extract Data

    Inspecting the code

    You can see that the items are enclosed in a

    element with the class gs_ri

    This is the key part - using Simple HTML DOM to interact with DOM elements and extract information.

    // Create a new instance of the Simple HTML DOM Parser
    $html = str_get_html($response);
    
    // Find all the search result blocks with class "gs_ri"
    $search_results = $html->find('div.gs_ri');
    

    First we pass the HTML response into Simple HTML DOM, which parses it into a DOM document we can query.

    We find all

    elements with the class gs_ri - these contain the individual search results.

    // Loop through each search result block and extract information
    foreach ($search_results as $result) {
    
      // Extract the title and URL
      $title_elem = $result->find('h3.gs_rt', 0);
      $title = $title_elem ? $title_elem->plaintext : "N/A";
      $url = $title_elem ? $title_elem->find('a', 0)->href : "N/A";
    
      // Extract the authors
      $authors_elem = $result->find('div.gs_a', 0);
      $authors = $authors_elem ? $authors_elem->plaintext : "N/A";
    
      // Extract the abstract
      $abstract_elem = $result->find('div.gs_rs', 0);
      $abstract = $abstract_elem ? $abstract_elem->plaintext : "N/A";
    
      // Output extracted info
      echo "Title: " . $title . "\\n";
      echo "URL: " . $url . "\\n";
      echo "Authors: " . $authors . "\\n";
      echo "Abstract: " . $abstract . "\\n";
    
    }
    

    We loop through the search result divs. For each one, we use find() to select elements and extract data:

  • Get the title anchor tag from h3.gs_rt
  • Get inner text with plaintext for the title
  • Find the nested link to get the URL
  • Get authors text from div.gs_a
  • Get abstract text from div.gs_rs
  • The data is outputted, giving the title, URL, authors and abstract for each search result.

    Cleanup

    // Clean up resources
    $html->clear();
    unset($html);
    
    // Close cURL session
    curl_close($ch);
    

    Finally we free resources and close cURL to avoid issues.

    And that wraps up the key parts of our Google Scholar scraper! Let's see the full code.

    Full PHP Web Scraping Script

    Here is the complete script to scrape Google Scholar search results in PHP:

    <?php
    // Include the PHP Simple HTML DOM Parser library
    include('simple_html_dom.php');
    
    // Define the URL of the Google Scholar search page
    $url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";
    
    // Define a User-Agent header
    $headers = [
        'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36', // Replace with your User-Agent string
    ];
    
    // Initialize cURL session
    $ch = curl_init();
    
    // Set cURL options
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    
    // Execute cURL session and get the response
    $response = curl_exec($ch);
    
    // Check if the request was successful (status code 200)
    if ($response !== false) {
        // Create a new instance of the Simple HTML DOM Parser
        $html = str_get_html($response);
    
        // Find all the search result blocks with class "gs_ri"
        $search_results = $html->find('div.gs_ri');
    
        // Loop through each search result block and extract information
        foreach ($search_results as $result) {
            // Extract the title and URL
            $title_elem = $result->find('h3.gs_rt', 0);
            $title = $title_elem ? $title_elem->plaintext : "N/A";
            $url = $title_elem ? $title_elem->find('a', 0)->href : "N/A";
    
            // Extract the authors and publication details
            $authors_elem = $result->find('div.gs_a', 0);
            $authors = $authors_elem ? $authors_elem->plaintext : "N/A";
    
            // Extract the abstract or description
            $abstract_elem = $result->find('div.gs_rs', 0);
            $abstract = $abstract_elem ? $abstract_elem->plaintext : "N/A";
    
            // Print the extracted information
            echo "Title: " . $title . "\n";
            echo "URL: " . $url . "\n";
            echo "Authors: " . $authors . "\n";
            echo "Abstract: " . $abstract . "\n";
            echo str_repeat("-", 50) . "\n"; // Separating search results
        }
    
        // Clean up resources
        $html->clear();
        unset($html);
    
        // Close cURL session
        curl_close($ch);
    } else {
        echo "Failed to retrieve the page. Error: " . curl_error($ch);
    }
    ?>

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!