Scraping Hacker News with PHP

Jan 21, 2024 · 8 min read

Let's take a practical look at how to scrape articles from the Hacker News homepage using PHP and the Simple HTML DOM library. We'll walk through the code step-by-step to understand how it works under the hood.

This is the page we are talking about…

Overview

The goal here is to extract key data from Hacker News article listings, like the title, URL, points, author, etc. To achieve this, we:

  1. Import SimpleHTMLDom for parsing
  2. Define the Hacker News homepage URL
  3. Send a GET request and check if it succeeded
  4. Find all table rows on the page
  5. Iterate through rows, identifying article rows
  6. For each article row, use selectors to extract data into variables
  7. Print the scraped data

Now let's dive into the details...

Importing the HTML Parser

We'll use SimpleHTMLDom to parse and traverse the HN homepage:

require_once('simple_html_dom.php');

This imports the library so we can instantiate HTML DOM objects.

Defining the URL

We'll scrape the default HackerNews homepage at:

$url = "<https://news.ycombinator.com/>";

This URL can be configured to any page you want to scrape.

Sending the GET Request

To download the page content, we send a GET request and store it in $response:

$response = file_get_html($url);

The file_get_html() function handles connecting to the URL and retrieving the HTML.

Checking for Success

We verify that the request succeeded before trying to parse the page content:

if ($response) {

  // scraping code here

} else {

  echo "Failed to retrieve the page.";

}

This is good practice to avoid errors when network issues or other problems occur.

Finding All Table Rows

HackerNews uses a table structure, so we first find all elements:

Inspecting the page

You can notice that the items are housed inside a tag with the class athing

$rows = $response->find("tr");

This locates table rows so we can loop through them to isolate article rows.

Tracking State

As we iterate over rows, we need to track some state:

$current_article = null;

$current_row_type = null;
  • $current_article will store the current article row
  • $current_row_type indicates if we're on an article or details row
  • This state is used in scraping logic.

    Looping Over Rows

    We loop through rows to identify article rows and accompanying details:

    foreach ($rows as $row) {
    
      // scraping logic here
    
    }
    

    The key parts happen inside this loop...

    Identifying Article Rows

    We first check if a row is an article using its class:

    if ($row->class == "athing") {
    
      $current_article = $row;
    
      $current_row_type = "article";
    
    }
    

    Article rows have CSS class athing. When found, we set $current_article to that row DOM object and set the state type.

    Identifying Details Rows

    If the previous row was an article, we know the next row contains article details:

    } elseif ($current_row_type == "article") {
    
      // extract details here
    
      $current_article = null;
    
      $current_row_type = null;
    
    }
    

    We reset the state afterward so subsequent rows are processed correctly.

    Extracting Article Data

    Inside the details row, we use selectors to extract information:

    $title_elem = $current_article->find("span.titleline", 0);
    
    $article_title = $title_elem->find("a", 0)->plaintext;
    
    $article_url = $title_elem->find("a", 0)->href;
    

    Breaking this down:

  • span.titleline locates the title container
  • We grab the first inside it
  • plaintext gets the text content (= title)
  • href gets the link URL
  • Every data field is extracted by:

    1. Using a selector to pinpoint the element
    2. Calling SimpleHTMLDom functions like plaintext and href on it

    Preserving the exact selectors used is essential for the code to function.

    The key things that might confuse beginners are...

    Why Such Specific Selectors?

    You may wonder why selectors like span.titleline target elements so specifically when simpler ones like div could work.

    The reasons are:

  • Uniqueness: Generic selectors often match too many elements. Overly broad matching leads to the wrong data.
  • Precise targeting: Tailored selectors grab the exact elements we want.
  • Brittleness avoidance: If site structure changes, specific selectors are less likely to break.
  • So detailed selectors make the scraper more robust and accurate.

    Selector Order and Indexes

    Selectors often stack together like:

    The order and indexes matter. Here's what's happening:

  • Find the first (0) element within $row
  • Then inside that, find the first (0) element
  • This lets us drill down gradually in a nested structure to zone in on the target element.

    Printing Extracted Data

    Finally, after extraction back in the loop, we can print the article data:

    Outputting each field lets us see the successfully scraped data.

    Separating Articles

    To visually separate article data, we add a divider:

    The scraper will continue looping through and extracting all articles on the page.

    And that's the overview of how this Hacker News scraping script works! Let's look at the full code now...

    Full Code

    Here is the complete code to scrape the Hacker News homepage:

    <?php
    // Include the SimpleHTMLDom library
    require_once('simple_html_dom.php');
    
    // Define the URL of the Hacker News homepage
    $url = "https://news.ycombinator.com/";
    
    // Send a GET request to the URL
    $response = file_get_html($url);
    
    // Check if the request was successful
    if ($response) {
        // Find all rows in the table
        $rows = $response->find("tr");
    
        // Initialize variables to keep track of the current article and row type
        $current_article = null;
        $current_row_type = null;
    
        // Iterate through the rows to scrape articles
        foreach ($rows as $row) {
            if ($row->class == "athing") {
                // This is an article row
                $current_article = $row;
                $current_row_type = "article";
            } elseif ($current_row_type == "article") {
                // This is the details row
                if ($current_article) {
                    $title_elem = $current_article->find("span.titleline", 0);
                    if ($title_elem) {
                        $article_title = $title_elem->find("a", 0)->plaintext;
                        $article_url = $title_elem->find("a", 0)->href;
    
                        $subtext = $row->find("td.subtext", 0);
                        $points = trim($subtext->find("span.score", 0)->plaintext);
                        $author = trim($subtext->find("a.hnuser", 0)->plaintext);
                        $timestamp = $subtext->find("span.age", 0)->title;
                        $comments_elem = $subtext->find("a", 0, true);
                        $comments = "0";
    
                        foreach ($comments_elem as $element) {
                            if (strpos($element->plaintext, 'comment') !== false) {
                                $comments = trim($element->plaintext);
                                break;
                            }
                        }
    
                        // Print the extracted information
                        echo "Title: " . $article_title . "\n";
                        echo "URL: " . $article_url . "\n";
                        echo "Points: " . $points . "\n";
                        echo "Author: " . $author . "\n";
                        echo "Timestamp: " . $timestamp . "\n";
                        echo "Comments: " . $comments . "\n";
                        echo str_repeat("-", 50) . "\n";  // Separating articles
                    }
                }
    
                // Reset the current article and row type
                $current_article = null;
                $current_row_type = null;
            } elseif ($row->style == "height:5px") {
                // This is the spacer row, skip it
                continue;
            }
        }
    } else {
        echo "Failed to retrieve the page.";
    }

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!