Scraping Hacker News Articles with Perl

In this beginner-friendly guide, we'll walk through a Perl script that scrapes articles from the popular Hacker News site.

This is the page we are talking about…

Prerequisites

To follow along, you'll need:

Perl installed on your system

The following modules installed:

You can install these from CPAN using the cpan command like so:

cpan LWP::Simple HTML::TreeBuilder::XPath

Step-by-step walkthrough

Importing modules

We start by importing the Perl modules we need:

use strict;
use warnings;

use LWP::Simple;
use HTML::TreeBuilder::XPath;

strict and warnings enable stricter parsing and useful warnings

LWP::Simple - Sends HTTP requests

HTML::TreeBuilder::XPath - Parses HTML and allows DOM traversal/extraction

Defining the target URL

Next, we store the Hacker News homepage URL in a variable:

my $url = "<https://news.ycombinator.com/>";

This is the page we will scrape.

Sending HTTP request

We use LWP::Simple to send a GET request and store the result:

my $content = get($url);

This downloads the raw HTML content of the Hacker News homepage.

Parsing the HTML

Next, we parse the HTML content using HTML::TreeBuilder::XPath:

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);

This creates a "tree" representation of elements, on which we can perform DOM operations.

Finding all table rows

Inspecting the page

You can notice that the items are housed inside a tag with the class athing

Hacker News consists of a table with rows for each article/details. We extract all elements with:

my @rows = $tree->findnodes('//tr');

This finds every table row on the page.

Processing the rows

Next, we loop through the rows to identify article rows vs detail rows:

foreach my $row (@rows) {

  # Identify article vs detail rows
  # Extract data from detail rows

  }

There are a few key steps:

Identify article rows with class athing. Store in $current_article.
Identify detail rows based on proximity to article rows.
Extract fields (title, URL, points etc.) from detail rows using XPath and regular expressions.
Print out the extracted article data.
Reset tracked variables for next article.

Let's go through each section.

Identifying article rows

An article row has CSS class athing. We check for this:

my $class = $row->attr('class');

if ($class && $class eq "athing") {

  # This is an article row

  $current_article = $row;
  $current_row_type = "article";

}

We store the row in $current_article to process the next detail row.

Identifying detail rows

The detail row comes immediately after the article row. We check for proximity:

elsif ($current_row_type && $current_row_type eq "article") {

  # This is the details row

  // process this row

}

Now we can extract data from $current_article and the detail row.

Extracting article data

Inside the detail row, we use XPath and regular expressions to extract fields:

my $title_elem = $current_article->findvalue('.//span[@class="title"]');
my $article_title = $title_elem->as_text if $title_elem;

my $article_url_elem = $current_article->findnodes('.//a[@class="storylink"]')->[0];
my $article_url = $article_url_elem->attr('href') if $article_url_elem;

my $subtext = $row->findvalue('.//td[@class="subtext"]');
my ($points, $author, $timestamp, $comments);

if ($subtext) {
  ($points)     = $subtext =~ /(\\d+)\\s+points/;
  ($author)     = $subtext =~ /by\\s+(\\S+)/;
  ($timestamp)  = $subtext =~ /(\\d+\\s+\\S+\\s+ago)/;
  ($comments)   = $subtext =~ /(\\d+\\s+comments?)/;
}

Let's break this down:

$title_elem - Finds title span in article row

$article_url_elem - Finds URL link in article row

$subtext - Gets subtext cell from detail row

The regular expressions extract data like points, author etc. from the subtext

Finally, we print the extracted data:

print("Title: $article_title\\n");
print ("URL: $article_url\\n");
// etc...

And so on for each article!

Full code

Here is the complete code once more for reference:

use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;

# Define the URL of the Hacker News homepage
my $url = "https://news.ycombinator.com/";

# Send a GET request to the URL
my $content = get($url);

# Check if the request was successful
if ($content) {
    my $tree = HTML::TreeBuilder::XPath->new_from_content($content);

    # Find all rows in the table
    my @rows = $tree->findnodes('//tr');

    # Initialize variables to keep track of the current article and row type
    my ($current_article, $current_row_type);

    # Iterate through the rows to scrape articles
    foreach my $row (@rows) {
        my $class = $row->attr('class');
        if ($class && $class eq "athing") {
            # This is an article row
            $current_article = $row;
            $current_row_type = "article";
        } elsif ($current_row_type && $current_row_type eq "article") {
            # This is the details row
            if ($current_article) {
                my $title_elem = $current_article->findvalue('.//span[@class="title"]');
                my $article_title = $title_elem->as_text if $title_elem;

                my $article_url_elem = $current_article->findnodes('.//a[@class="storylink"]')->[0];
                my $article_url = $article_url_elem->attr('href') if $article_url_elem;

                my $subtext = $row->findvalue('.//td[@class="subtext"]');
                my ($points, $author, $timestamp, $comments);

                if ($subtext) {
                    ($points) = $subtext =~ /(\d+)\s+points/;
                    ($author) = $subtext =~ /by\s+(\S+)/;
                    ($timestamp) = $subtext =~ /(\d+\s+\S+\s+ago)/;
                    ($comments) = $subtext =~ /(\d+\s+comments?)/;
                }

                # Print the extracted information
                print("Title: $article_title\n");
                print("URL: $article_url\n");
                print("Points: $points\n");
                print("Author: $author\n");
                print("Timestamp: $timestamp\n");
                print("Comments: $comments\n");
                print("-" x 50 . "\n");  # Separating articles
            }

            # Reset the current article and row type
            $current_article = undef;
            $current_row_type = undef;
        } elsif ($row->attr('style') && $row->attr('style') eq "height:5px") {
            # This is the spacer row, skip it
            next;
        }
    }
} else {
    print("Failed to retrieve the page.\n");
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Scraping Hacker News Articles with Perl

Prerequisites

Step-by-step walkthrough

Importing modules

Defining the target URL

Sending HTTP request

Parsing the HTML

Finding all table rows

Processing the rows

Identifying article rows

Identifying detail rows

Extracting article data

Full code

Browse by language:

The easiest way to do Web Scraping

Scraping Hacker News Articles with Perl

Prerequisites

Step-by-step walkthrough

Importing modules

Defining the target URL

Sending HTTP request

Parsing the HTML

Finding all table rows

Processing the rows

Identifying article rows

Identifying detail rows

Extracting article data

Full code

The easiest way to do Web Scraping

Don't leave just yet!