Scraping Reddit Posts in Perl

Jan 9, 2024 · 5 min read

In this article, we will be scraping Reddit to extract information from posts using Perl. We will get the HTML content of a Reddit page, parse it to find post blocks, and extract the permalink, title, author, comments count and other fields from each block.

here is the page we are talking about

Modules Used

Here are the modules we use:

LWP::UserAgent - Makes HTTP requests and handles cookies, authorization, proxies, etc. Needed to download the Reddit page.

HTML::TreeBuilder - Parses HTML and allows traversal/extraction of data. We use it to parse the Reddit HTML and find posts.

Strict - Requires variables be declared with my before use

Warnings - Catches errors, undefined variables, etc.

No special installation needed for these core Perl modules.

Code Walkthrough

Define Reddit URL

my $reddit_url = "<https://www.reddit.com>";

We define the Reddit URL we want to scrape. I use the front page but we could use any subreddit or thread URL.

Create a User Agent

my $user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";

We make a browser User Agent string to identify as a real browser. Sites may block automated bot requests.

Make a UserAgent Object

my $ua = LWP::UserAgent->new;
$ua->agent($user_agent);

We create a UserAgent object from LWP::UserAgent and set the User Agent string we defined.

Send Request with UA Header

my $response = $ua->get($reddit_url);

We make a GET request to the Reddit URL using the UserAgent object with our browser user_agent set. This will download the Reddit page with the proper headers.

Check if Successful

if ($response->is_success) {
  # save HTML
} else {
   print "Request failed";
}

We check if the response status code is 200 OK for success. If so we continue, otherwise we print an error.

Save HTML to File

open my $file, '>:encoding(utf-8)', $filename or die "Could not open $filename: $!";
print $file $html_content;
close $file;

We save the HTML content to a file named reddit_page.html. This allows us to parse the file multiple times without needing to re-download.

Parse HTML

my $tree = HTML::TreeBuilder->new;
$tree->parse_content($response->content);

We create a new HTML::TreeBuilder object and use the parse_content method to parse the entire HTML document into a traversable tree structure.

Find Post Blocks

Inspecting the elements

Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

my @blocks = $tree->look_down(
  '_tag' => 'shreddit-post',
  'class' => qr/block relative cursor-pointer bg-neutral-background focus-within:bg-neutral-background-hover hover:bg-neutral-background-hover xs:rounded-\\\\\\[16px\\\\\\] p-md my-2xs nd:visible/
);

This is where we locate the posts within the large HTML document. We use the look_down method to find all elements with tag shreddit-post and the exact class shown. This selects the post blocks.

This is the most complex part for beginners. We are extracting the posts by matching the literal HTML attributes and class name.

The qr/regex/ syntax creates a regular expression to match the class. Do not change these literal strings otherwise it will break.

Extract Post Data

foreach my $block (@blocks) {

  my $permalink  = $block->attr('permalink');

  my $title = $block->look_down('_tag' => 'div', 'slot' => 'title')->as_text;

  print "Permalink: $permalink\\n";
  print "Title: $title\\n";
}

We loop through each block and use attr() and look_down() to extract data into variables like $permalink and $title.

We get the permalink from the block's attribute. We look deeper into the block to find the title element and get its text.

We print out the extracted data. Do this for all fields needed.

Clean Up

$tree->delete;

Delete the tree when done to free memory.

Full Code

use LWP::UserAgent;
use HTML::TreeBuilder;
use strict;
use warnings;

# Define the Reddit URL you want to download
my $reddit_url = "https://www.reddit.com";

# Define a User-Agent header
my $user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"; # Replace with your User-Agent string

# Create a UserAgent object
my $ua = LWP::UserAgent->new;
$ua->agent($user_agent);

# Send a GET request to the URL with the User-Agent header
my $response = $ua->get($reddit_url);

# Check if the request was successful (status code 200)
if ($response->is_success) {
    # Get the HTML content of the page
    my $html_content = $response->content;

    # Specify the filename to save the HTML content
    my $filename = "reddit_page.html";

    # Save the HTML content to a file
    open my $file, '>:encoding(utf-8)', $filename or die "Could not open $filename: $!";
    print $file $html_content;
    close $file;

    print "Reddit page saved to $filename\n";
} else {
    print "Failed to download Reddit page (status code " . $response->status_line . ")\n";
}

# Parse the entire HTML content
my $tree = HTML::TreeBuilder->new;
$tree->parse_content($response->content);

# Find all blocks with the specified tag and class
my @blocks = $tree->look_down(
    '_tag' => 'shreddit-post',
    'class' => qr/block relative cursor-pointer bg-neutral-background focus-within:bg-neutral-background-hover hover:bg-neutral-background-hover xs:rounded-\[16px\] p-md my-2xs nd:visible/
);

# Iterate through the blocks and extract information from each one
foreach my $block (@blocks) {
    my $permalink = $block->attr('permalink');
    my $content_href = $block->attr('content-href');
    my $comment_count = $block->attr('comment-count');
    my $post_title = $block->look_down('_tag' => 'div', 'slot' => 'title')->as_text;
    my $author = $block->attr('author');
    my $score = $block->attr('score');

    # Print the extracted information for each block
    print "Permalink: $permalink\n";
    print "Content Href: $content_href\n";
    print "Comment Count: $comment_count\n";
    print "Post Title: $post_title\n";
    print "Author: $author\n";
    print "Score: $score\n\n";
}

# Clean up
$tree->delete;

Browse by tags:

Browse by language:

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!