Scraping Reddit Posts with Ruby

Jan 9, 2024 · 6 min read

In this article, we'll walk through a Ruby script that scrapes various data from Reddit posts.

Some common use cases for web scraping Reddit include:

  • Collecting public data for research purposes
  • Analyzing posting trends and activity
  • Gathering data to train machine learning models
  • Building Reddit bots or apps
  • While scraping does have some ethical considerations (which we won't get into here), it can be a useful skill for programmers to acquire.

    here is the page we are talking about

    So let's jump right into the code!

    Setting Up

    We'll be using the open-uri and nokogiri Ruby gems, so those will need to be installed first:

    gem install open-uri
    gem install nokogiri
    

    The open-uri module gives us a convenient API for opening URLs, while nokogiri is used for parsing and searching the HTML content.

    Okay, with the dependencies handled, let's start going through the script:

    Defining Constants

    First we set some constant values that will be reused throughout the program:

    reddit_url = "<https://www.reddit.com>"
    
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    
  • reddit_url contains the base URL of the Reddit homepage
  • user_agent spoof a desktop Chrome browser's user agent string when sending requests
  • Browser user agent strings identify what type of browser is making the request. Spoofing a common desktop user agent helps avoid bot detection.

    Making the Initial Request

    Next, we use the handy URI.open method that open-uri gives us to fetch the contents of the Reddit homepage:

    begin
      html_content = URI.open(reddit_url, "User-Agent" => user_agent).read
    rescue OpenURI::HTTPError => e
      # error handling
    end
    

    Breaking this down:

  • URI.open opens the reddit_url
  • We pass the user_agent string in a header to spoof a browser
  • .read reads the contents of the open URI into a string
  • We wrap it in begin/rescue to handle any HTTP errors
  • At this point html_content contains a big string holding the raw HTML of the Reddit homepage.

    Saving the HTML

    Next we write the HTML content to a local file:

    filename = "reddit_page.html"
    
    File.open(filename, "w:UTF-8") do |file|
      file.write(html_content)
    end
    
    puts "Reddit page saved to #{filename}"
    

    Here we:

  • Define a filename for the output
  • File.open that file for writing
  • file.write the html_content string to it
  • Confirm save with a puts
  • Saving the HTML locally allows us to scrape the content multiple times without needing to re-download.

    Parsing the HTML with Nokogiri

    Now we have the Reddit homepage HTML saved locally, and can parse it using the Nokogiri gem:

    doc = Nokogiri::HTML(html_content)
    
  • Nokogiri::HTML() parses HTML from either a string or file
  • Returns a doc object we can now search and iterate through
  • Nokogiri gives us very powerful selectors to extract the exact pieces of information we want.

    Extracting Reddit Posts

    Inspecting the elements

    Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

    In the longest and perhaps trickiest part of the script, we use a complex CSS selector to extract Reddit post blocks from the parsed HTML document:

    blocks = doc.css('shreddit-post.block.relative.cursor-pointer.bg-neutral-background.focus-within\\\\:bg-neutral-background-hover.hover\\\\:bg-neutral-background-hover.xs\\\\:rounded-\\\\\\[16px\\\\\\].p-md.my-2xs.nd\\\\:visible')
    

    Let's break down what this selector is doing to understand it:

  • doc.css() runs our CSS selection on the parsed doc
  • It looks for tags with many classes like:
  • Special characters like \\[ and \\] match literal brackets
  • The \\ escapes special characters like \\: and allows matching that exact text
  • So in plain English, we're finding all Reddit post blocks on the page that have these very specific CSS classes styling and positioning them. This took some trail and error to pinpoint.

    The key things to remember are:

  • Use very explicit, precise selectors to extract just what you want
  • You may need to experiment to get the specificity right
  • Special characters MUST be escaped properly
  • Don't change the literal text like shreddit-post
  • This returns a blocks object containing all the matched post blocks we want to scrape.

    Looping Through the Posts

    With the posts now stored in blocks, we can loop through them:

    blocks.each do |block|
      # extract data from each block
    end
    
  • The .each method loops through the post block elements
  • We can now access each one individually to scrape data
  • Let's look at what information we're extracting from every block next.

    Scraping Post Data

    Inside the loop, we use the post block and handy Nokogiri methods to scrape key data points:

    permalink = block['permalink']
    content_href = block['content-href']
    comment_count = block['comment-count']
    
    post_title = block.css('div[slot="title"]').text.strip
    author = block['author']
    score = block['score']
    

    Here's what we're grabbing and how:

  • permalink, content_href, etc use the [ ] syntax to get HTML attributes
  • post_title uses a more specific CSS selector and .text
  • .strip removes whitespace from that text
  • author and score attributes are extracted directly again
  • The data points gathered describe core aspects of posts like title, author, webpage link, comments, score.

    We print this data after the loop to output and check what we scraped.

    The key things to emphasize again:

  • Attribute vs text data extraction
  • Using precise, unique CSS selectors
  • Methods like .text and .strip
  • And most importantly, not changing the literal strings like permalink used in the selectors.

    Putting It All Together

    Walkthrough complete! Here is the full code:

    require 'open-uri'
    require 'nokogiri'
    
    # Define the Reddit URL you want to download
    reddit_url = "https://www.reddit.com"
    
    # Define a User-Agent header
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    
    # Send a GET request to the URL with the User-Agent header
    begin
      html_content = URI.open(reddit_url, "User-Agent" => user_agent).read
    
      # Specify the filename to save the HTML content
      filename = "reddit_page.html"
    
      # Save the HTML content to a file
      File.open(filename, "w:UTF-8") do |file|
        file.write(html_content)
      end
    
      puts "Reddit page saved to #{filename}"
    rescue OpenURI::HTTPError => e
      puts "Failed to download Reddit page (status code #{e.io.status[0]})"
    end
    
    # Parse the entire HTML content
    doc = Nokogiri::HTML(html_content)
    
    # Find all blocks with the specified tag and class
    blocks = doc.css('shreddit-post.block.relative.cursor-pointer.bg-neutral-background.focus-within\:bg-neutral-background-hover.hover\:bg-neutral-background-hover.xs\:rounded-\[16px\].p-md.my-2xs.nd\:visible')
    
    # Iterate through the blocks and extract information from each one
    blocks.each do |block|
      permalink = block['permalink']
      content_href = block['content-href']
      comment_count = block['comment-count']
      post_title = block.css('div[slot="title"]').text.strip
      author = block['author']
      score = block['score']
    
      # Print the extracted information for each block
      puts "Permalink: #{permalink}"
      puts "Content Href: #{content_href}"
      puts "Comment Count: #{comment_count}"
      puts "Post Title: #{post_title}"
      puts "Author: #{author}"
      puts "Score: #{score}"
      puts "\n"
    end

    And that's it! We walked through the script from start to finish, explaining how Reddit is scraped at each step of the way.

    Let me know if any part was confusing or needs more clarification! I'm happy to explain web scraping concepts in further detail.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!