Scraping New York Times News Headlines with PHP

Dec 6, 2023 · 6 min read

Web scraping refers to programmatically extracting data from websites. We may want to scrape data for analysis, monitoring changes over time, aggregating information across sites, and more.

In this article, we'll walk through PHP code to scrape the titles and links of articles from the New York Times homepage.

Prerequisites

To follow along, you'll want a basic knowledge of:

  • PHP - a popular general-purpose scripting language well-suited for web scraping
  • cURL - a PHP library for transferring data with URLs
  • DOMDocument - a PHP class that represents HTML/XML documents for easy parsing
  • We'll also use features like constants, arrays, loops, and object-oriented syntax.

    Walkthrough

    Let's go through each section of the code:

    Define URLs and Constants

    We start by defining the base New York Times URL and a user agent string constant that identifies us to the site:

    // URLs and constants
    define('URL', '<https://www.nytimes.com/>');
    define('USER_AGENT', 'Mozilla/5.0...');
    

    Defining reusable values up top keeps things clean.

    Pro tip: Setting a common user agent tricks sites into thinking you're a normal browser rather than a bot!

    Initialize Arrays to Store Data

    We'll store the scraped headlines and links in PHP arrays, which we initialize empty:

    // Initialize arrays
    $titles = [];
    $links = [];
    

    Unlike some languages, PHP arrays don't need a pre-set capacity. They grow dynamically as we append data.

    cURL Request

    Here we use cURL to request the NYTimes homepage, setting key options:

    // Curl request
    $ch = curl_init();
    curl_setopt_array($ch, [
      CURLOPT_URL => URL,
      CURLOPT_HTTPHEADER => ['User-Agent: ' . USER_AGENT],
      CURLOPT_RETURNTRANSFER => true
    ]);
    

    Curling here refers to transferring data to/from a URL. We configure cURL with the base URL, user agent from our constant earlier, and tell it to return (rather than print) response data.

    Analogy: It's like an old-timey phone handset making a call to the NYTimes website and listening for the response.

    Check Response

    It's good practice to verify we got a proper response before trying to parse it:

    // Send request
    $response = curl_exec($ch);
    $code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    
    if ($code !== 200) {
      die("Error: Failed to access {URL} - Status {$code}");
    }
    

    Here we execute the request, get the response code, and if it's not 200 OK, stop execution.

    Pro tip: Always handle errors gracefully!

    Parse HTML

    Now we can parse the HTML response using DOMDocument:

    // Load HTML
    libxml_use_internal_errors(true);
    
    $doc = new DOMDocument();
    $doc->loadHTML($response);
    

    Suppressing libxml errors avoids issues with imperfect HTML. We then load the HTML into a DOMDocument which allows accessing elements easily.

    Fun fact: DOM stands for Document Object Model and represents the hierarchical structure of HTML.

    Inspecting the page

    We now inspect element in chrome to see how the code is structured…

    You can see that the articles are contained inside section tags and with the class story-wrapper

    Iterate Sections

    We can now iterate document sections and extract data:

    // Iterate article sections
    foreach ($doc->getElementsByTagName('section') as $section) {
    
      // Check for story-wrapper
      if ($section->getAttribute('class') === 'css-147kb3k story-wrapper') {
    
        // Get title and link
        $title = $section->getElementsByTagName('h3')->item(0);
        $link = $section->getElementsByTagName('a')->item(0);
    
        // Append extracted data
        if ($title && $link) {
          $titles[] = trim($title->textContent);
          $links[] = $link->getAttribute('href');
        }
    
      }
    
    }
    

    Here we loop section elements, checking for story-wrappers. We grab the h3 and a elements to get titles/links, trim whitespace, and append results to our arrays from earlier.

    Key ideas: Target elements by tag name and class, access child elements, attributes like href, and text content.

    Output Data

    Finally, we can output or use the data:

    // Output data
    foreach ($titles as $i => $title) {
      echo "Title: {$title}<br>";
      echo "Link: {$links[$i]}<br><br>";
    }
    

    This prints each title and corresponding link. The $i key lets us access the links array in parallel.

    And we've now scraped NYTimes headlines! The full code is listed again below:

    <?php
    
    // URLs and constants 
    define('URL', 'https://www.nytimes.com/');
    define('USER_AGENT', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36');
    
    // Initialize arrays to store data
    $titles = [];
    $links = [];
    
    // Curl request
    $ch = curl_init();
    curl_setopt_array($ch, [
        CURLOPT_URL => URL,
        CURLOPT_HTTPHEADER => ['User-Agent: ' . USER_AGENT],
        CURLOPT_RETURNTRANSFER => true
    ]);
    
    // Send request
    $response = curl_exec($ch);
    $code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    
    if ($code !== 200) {
        die("Error: Failed to access {URL} - Status {$code}"); 
    }
    
    //echo $response;
    // Load HTML  
    libxml_use_internal_errors(true);
    $doc = new DOMDocument();
    $doc->loadHTML($response);
    
    // Iterate article sections directly
    foreach ($doc->getElementsByTagName('section') as $section) {
        // Check for story-wrapper section  
        //echo $section->getAttribute('class')."<br>";
        if ($section->getAttribute('class') === 'css-147kb3k story-wrapper') {
    
    
            // Get title and link
            $title = $section->getElementsByTagName('h3')->item(0);
            $link = $section->getElementsByTagName('a')->item(0);
    
            // Append extracted data 
            if ($title && $link) {
                $titles[] = trim($title->textContent);
                $links[] = $link->getAttribute('href');  
            }
        }
    }
    
    // Output data
    foreach ($titles as $i => $title) {
        echo "Title: {$title}<br>";
        echo "Link: {$links[$i]}<br><br>"; 
    }
    
    ?>

    Recap and Next Steps

    Key steps we covered:

  • Define URLs and constants
  • Initialize storage arrays
  • Make request with cURL
  • Parse response HTML
  • Extract data by looping tags
  • Output results
  • Main takeways:

  • Web scraping follows a common workflow
  • Leverage libraries like cURL and DOMDocument
  • target elements, text, attributes for data extraction
  • Handle errors and verify steps
  • To practice, try customizing the script:

  • Scrape different tags/attributes
  • Output data to file types like CSV
  • Expand to additional pages
  • In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!