Scraping New York Times News Headlines with PHP

Web scraping refers to programmatically extracting data from websites. We may want to scrape data for analysis, monitoring changes over time, aggregating information across sites, and more.

In this article, we'll walk through PHP code to scrape the titles and links of articles from the New York Times homepage.

Prerequisites

To follow along, you'll want a basic knowledge of:

PHP - a popular general-purpose scripting language well-suited for web scraping

cURL - a PHP library for transferring data with URLs

DOMDocument - a PHP class that represents HTML/XML documents for easy parsing

We'll also use features like constants, arrays, loops, and object-oriented syntax.

Walkthrough

Let's go through each section of the code:

Define URLs and Constants

We start by defining the base New York Times URL and a user agent string constant that identifies us to the site:

// URLs and constants
define('URL', '<https://www.nytimes.com/>');
define('USER_AGENT', 'Mozilla/5.0...');

Defining reusable values up top keeps things clean.

Pro tip: Setting a common user agent tricks sites into thinking you're a normal browser rather than a bot!

Initialize Arrays to Store Data

We'll store the scraped headlines and links in PHP arrays, which we initialize empty:

// Initialize arrays
$titles = [];
$links = [];

Unlike some languages, PHP arrays don't need a pre-set capacity. They grow dynamically as we append data.

cURL Request

Here we use cURL to request the NYTimes homepage, setting key options:

// Curl request
$ch = curl_init();
curl_setopt_array($ch, [
  CURLOPT_URL => URL,
  CURLOPT_HTTPHEADER => ['User-Agent: ' . USER_AGENT],
  CURLOPT_RETURNTRANSFER => true
]);

Curling here refers to transferring data to/from a URL. We configure cURL with the base URL, user agent from our constant earlier, and tell it to return (rather than print) response data.

Analogy: It's like an old-timey phone handset making a call to the NYTimes website and listening for the response.

Check Response

It's good practice to verify we got a proper response before trying to parse it:

// Send request
$response = curl_exec($ch);
$code = curl_getinfo($ch, CURLINFO_HTTP_CODE);

if ($code !== 200) {
  die("Error: Failed to access {URL} - Status {$code}");
}

Here we execute the request, get the response code, and if it's not 200 OK, stop execution.

Pro tip: Always handle errors gracefully!

Parse HTML

Now we can parse the HTML response using DOMDocument:

// Load HTML
libxml_use_internal_errors(true);

$doc = new DOMDocument();
$doc->loadHTML($response);

Suppressing libxml errors avoids issues with imperfect HTML. We then load the HTML into a DOMDocument which allows accessing elements easily.

Fun fact: DOM stands for Document Object Model and represents the hierarchical structure of HTML.

Inspecting the page

We now inspect element in chrome to see how the code is structured…

You can see that the articles are contained inside section tags and with the class story-wrapper

Iterate Sections

We can now iterate document sections and extract data:

// Iterate article sections
foreach ($doc->getElementsByTagName('section') as $section) {

  // Check for story-wrapper
  if ($section->getAttribute('class') === 'css-147kb3k story-wrapper') {

    // Get title and link
    $title = $section->getElementsByTagName('h3')->item(0);
    $link = $section->getElementsByTagName('a')->item(0);

    // Append extracted data
    if ($title && $link) {
      $titles[] = trim($title->textContent);
      $links[] = $link->getAttribute('href');
    }

  }

}

Here we loop section elements, checking for story-wrappers. We grab the h3 and a elements to get titles/links, trim whitespace, and append results to our arrays from earlier.

Key ideas: Target elements by tag name and class, access child elements, attributes like href, and text content.

Output Data

Finally, we can output or use the data:

// Output data
foreach ($titles as $i => $title) {
  echo "Title: {$title}<br>";
  echo "Link: {$links[$i]}<br><br>";
}

This prints each title and corresponding link. The $i key lets us access the links array in parallel.

And we've now scraped NYTimes headlines! The full code is listed again below:

<?php

// URLs and constants 
define('URL', 'https://www.nytimes.com/');
define('USER_AGENT', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36');

// Initialize arrays to store data
$titles = [];
$links = [];

// Curl request
$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL => URL,
    CURLOPT_HTTPHEADER => ['User-Agent: ' . USER_AGENT],
    CURLOPT_RETURNTRANSFER => true
]);

// Send request
$response = curl_exec($ch);
$code = curl_getinfo($ch, CURLINFO_HTTP_CODE);

if ($code !== 200) {
    die("Error: Failed to access {URL} - Status {$code}"); 
}

//echo $response;
// Load HTML  
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($response);

// Iterate article sections directly
foreach ($doc->getElementsByTagName('section') as $section) {
    // Check for story-wrapper section  
    //echo $section->getAttribute('class')."<br>";
    if ($section->getAttribute('class') === 'css-147kb3k story-wrapper') {


        // Get title and link
        $title = $section->getElementsByTagName('h3')->item(0);
        $link = $section->getElementsByTagName('a')->item(0);

        // Append extracted data 
        if ($title && $link) {
            $titles[] = trim($title->textContent);
            $links[] = $link->getAttribute('href');  
        }
    }
}

// Output data
foreach ($titles as $i => $title) {
    echo "Title: {$title}<br>";
    echo "Link: {$links[$i]}<br><br>"; 
}

?>

Recap and Next Steps

Key steps we covered:

Define URLs and constants

Initialize storage arrays

Make request with cURL

Parse response HTML

Extract data by looping tags

Output results

Main takeways:

Web scraping follows a common workflow

Leverage libraries like cURL and DOMDocument

target elements, text, attributes for data extraction

Handle errors and verify steps

To practice, try customizing the script:

Scrape different tags/attributes

Output data to file types like CSV

Expand to additional pages

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping New York Times News Headlines with PHP

Prerequisites

Walkthrough

Define URLs and Constants

Initialize Arrays to Store Data

cURL Request

Check Response

Parse HTML

Inspecting the page

Iterate Sections

Output Data

Recap and Next Steps

Browse by language:

The easiest way to do Web Scraping

Scraping New York Times News Headlines with PHP

Prerequisites

Walkthrough

Define URLs and Constants

Initialize Arrays to Store Data

cURL Request

Check Response

Parse HTML

Inspecting the page

Iterate Sections

Output Data

Recap and Next Steps

The easiest way to do Web Scraping

Don't leave just yet!