Scraping Hacker News Articles with Perl

Jan 21, 2024 · 7 min read

In this beginner-friendly guide, we'll walk through a Perl script that scrapes articles from the popular Hacker News site.

This is the page we are talking about…

Prerequisites

To follow along, you'll need:

  • Perl installed on your system
  • The following modules installed:
  • You can install these from CPAN using the cpan command like so:

    cpan LWP::Simple HTML::TreeBuilder::XPath
    

    Step-by-step walkthrough

    Importing modules

    We start by importing the Perl modules we need:

    use strict;
    use warnings;
    
    use LWP::Simple;
    use HTML::TreeBuilder::XPath;
    
  • strict and warnings enable stricter parsing and useful warnings
  • LWP::Simple - Sends HTTP requests
  • HTML::TreeBuilder::XPath - Parses HTML and allows DOM traversal/extraction
  • Defining the target URL

    Next, we store the Hacker News homepage URL in a variable:

    my $url = "<https://news.ycombinator.com/>";
    

    This is the page we will scrape.

    Sending HTTP request

    We use LWP::Simple to send a GET request and store the result:

    my $content = get($url);
    

    This downloads the raw HTML content of the Hacker News homepage.

    Parsing the HTML

    Next, we parse the HTML content using HTML::TreeBuilder::XPath:

    my $tree = HTML::TreeBuilder::XPath->new_from_content($content);
    

    This creates a "tree" representation of elements, on which we can perform DOM operations.

    Finding all table rows

    Inspecting the page

    You can notice that the items are housed inside a tag with the class athing

    Hacker News consists of a table with rows for each article/details. We extract all elements with:

    my @rows = $tree->findnodes('//tr');
    

    This finds every table row on the page.

    Processing the rows

    Next, we loop through the rows to identify article rows vs detail rows:

    foreach my $row (@rows) {
    
      # Identify article vs detail rows
      # Extract data from detail rows
    
      }
    

    There are a few key steps:

    1. Identify article rows with class athing. Store in $current_article.
    2. Identify detail rows based on proximity to article rows.
    3. Extract fields (title, URL, points etc.) from detail rows using XPath and regular expressions.
    4. Print out the extracted article data.
    5. Reset tracked variables for next article.

    Let's go through each section.

    Identifying article rows

    An article row has CSS class athing. We check for this:

    my $class = $row->attr('class');
    
    if ($class && $class eq "athing") {
    
      # This is an article row
    
      $current_article = $row;
      $current_row_type = "article";
    
    }
    

    We store the row in $current_article to process the next detail row.

    Identifying detail rows

    The detail row comes immediately after the article row. We check for proximity:

    elsif ($current_row_type && $current_row_type eq "article") {
    
      # This is the details row
    
      // process this row
    
    }
    

    Now we can extract data from $current_article and the detail row.

    Extracting article data

    Inside the detail row, we use XPath and regular expressions to extract fields:

    my $title_elem = $current_article->findvalue('.//span[@class="title"]');
    my $article_title = $title_elem->as_text if $title_elem;
    
    my $article_url_elem = $current_article->findnodes('.//a[@class="storylink"]')->[0];
    my $article_url = $article_url_elem->attr('href') if $article_url_elem;
    
    my $subtext = $row->findvalue('.//td[@class="subtext"]');
    my ($points, $author, $timestamp, $comments);
    
    if ($subtext) {
      ($points)     = $subtext =~ /(\\d+)\\s+points/;
      ($author)     = $subtext =~ /by\\s+(\\S+)/;
      ($timestamp)  = $subtext =~ /(\\d+\\s+\\S+\\s+ago)/;
      ($comments)   = $subtext =~ /(\\d+\\s+comments?)/;
    }
    

    Let's break this down:

  • $title_elem - Finds title span in article row
  • $article_url_elem - Finds URL link in article row
  • $subtext - Gets subtext cell from detail row
  • The regular expressions extract data like points, author etc. from the subtext
  • Finally, we print the extracted data:

    print("Title: $article_title\\n");
    print ("URL: $article_url\\n");
    // etc...
    

    And so on for each article!

    Full code

    Here is the complete code once more for reference:

    use strict;
    use warnings;
    use LWP::Simple;
    use HTML::TreeBuilder::XPath;
    
    # Define the URL of the Hacker News homepage
    my $url = "https://news.ycombinator.com/";
    
    # Send a GET request to the URL
    my $content = get($url);
    
    # Check if the request was successful
    if ($content) {
        my $tree = HTML::TreeBuilder::XPath->new_from_content($content);
    
        # Find all rows in the table
        my @rows = $tree->findnodes('//tr');
    
        # Initialize variables to keep track of the current article and row type
        my ($current_article, $current_row_type);
    
        # Iterate through the rows to scrape articles
        foreach my $row (@rows) {
            my $class = $row->attr('class');
            if ($class && $class eq "athing") {
                # This is an article row
                $current_article = $row;
                $current_row_type = "article";
            } elsif ($current_row_type && $current_row_type eq "article") {
                # This is the details row
                if ($current_article) {
                    my $title_elem = $current_article->findvalue('.//span[@class="title"]');
                    my $article_title = $title_elem->as_text if $title_elem;
    
                    my $article_url_elem = $current_article->findnodes('.//a[@class="storylink"]')->[0];
                    my $article_url = $article_url_elem->attr('href') if $article_url_elem;
    
                    my $subtext = $row->findvalue('.//td[@class="subtext"]');
                    my ($points, $author, $timestamp, $comments);
    
                    if ($subtext) {
                        ($points) = $subtext =~ /(\d+)\s+points/;
                        ($author) = $subtext =~ /by\s+(\S+)/;
                        ($timestamp) = $subtext =~ /(\d+\s+\S+\s+ago)/;
                        ($comments) = $subtext =~ /(\d+\s+comments?)/;
                    }
    
                    # Print the extracted information
                    print("Title: $article_title\n");
                    print("URL: $article_url\n");
                    print("Points: $points\n");
                    print("Author: $author\n");
                    print("Timestamp: $timestamp\n");
                    print("Comments: $comments\n");
                    print("-" x 50 . "\n");  # Separating articles
                }
    
                # Reset the current article and row type
                $current_article = undef;
                $current_row_type = undef;
            } elsif ($row->attr('style') && $row->attr('style') eq "height:5px") {
                # This is the spacer row, skip it
                next;
            }
        }
    } else {
        print("Failed to retrieve the page.\n");
    }
    

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!