Scraping Hacker News with Ruby

Jan 21, 2024 · 6 min read

Web scraping is the process of programmatically extracting data from websites. This is often done by sending HTTP requests to a target site, then parsing the HTML response to identify and extract relevant information.

In this article, we'll walk through Ruby code that scrapes titles, URLs, vote counts, authors, timestamps, and comment counts from the popular Hacker News site. The code utilizes the Nokogiri library for HTML parsing and OpenURI for sending HTTP requests.

This is the page we are talking about…

Prerequisites

Before running the web scraper, you'll need to have Ruby installed along with the Nokogiri and OpenURI libraries. These can be installed by running:

gem install nokogiri
gem install open-uri

Walkthrough

Now let's dive into how the web scraper code works:

require 'open-uri'
require 'nokogiri'

First we require the OpenURI and Nokogiri modules that we'll need for making requests and parsing.

url = "<https://news.ycombinator.com/>"

Next, we define the URL of the Hacker News homepage that we want to scrape.

page_content = URI.open(url).read

We use URI.open from OpenURI to send a GET request to the Hacker News URL. This returns an object that we call read on to access the full HTML content of the page. This raw HTML is what we'll parse next.

doc = Nokogiri::HTML(page_content)

Here we pass the HTML content to Nokogiri's HTML method, which parses it and returns a document object that we can now query using CSS selectors.

Inspecting the page

You can notice that the items are housed inside a tag with the class athing

rows = doc.css('tr')

The headlines on Hacker News are contained in table rows (tr). Here we grab all tr elements from the document using the css method and CSS selector syntax. This returns a collection of Nokogiri XML node objects representing each row.

current_article = nil
current_row_type = nil

As we loop through the rows, we'll use these variables to keep track of whether we're currently processing an article row or detail row.

rows.each do |row|

  if row['class'] == 'athing'
    # This is an article row
    current_article = row
    current_row_type = 'article'

  elsif current_row_type == 'article'
    # This is the details row

    if current_article
      title_elem = current_article.css('span.title a')

      if title_elem
        article_title = title_elem.text
        article_url = title_elem[0]['href']

        ...

      end

    end

  current_article = nil
  current_row_type = nil

end

We iterate through the rows using .each. For any row that has class athing, we know it's an article headline, so we set current_article and current_row_type accordingly.

The next row after a headline contains additional details like score, author, etc. So if current_row_type equals article we extract those details, reset the flags, and move on.

Focusing now on the key data extraction using CSS selectors:

title_elem = current_article.css('span.title a')

Here .css() is called on the headline row XML node. We use the selector span.title a to match the anchor tag inside the title span, containing the article title text.

article_title = title_elem.text
article_url = title_elem[0]['href']

From the matched element, .text extracts the title itself, while [0]['href'] grabs the URL from the first anchor's href attribute.

The code continues on using additional selectors to extract score, author, comments, etc. from the detail rows:

points = subtext.css('span.score').text

author = subtext.css('a.hnuser').text

comments_elem = subtext.css('a:contains("comments")')
comments = comments_elem.text if comments_elem.any? else '0'

In each case:

  • The specific elements are targeted using tag, class, and text-based selectors
  • The relevant data is extracted using .text or attributes like ['href']
  • And the final output prints each field scraped for every headline:

    Title: ...
    URL: ...
    Points: ...
    Author: ...
    

    Full code:

    require 'open-uri'
    require 'nokogiri'
    
    # Define the URL of the Hacker News homepage
    url = "https://news.ycombinator.com/"
    
    # Send a GET request to the URL and read the content
    page_content = URI.open(url).read
    
    # Parse the HTML content of the page using Nokogiri
    doc = Nokogiri::HTML(page_content)
    
    # Find all rows in the table
    rows = doc.css('tr')
    
    # Initialize variables to keep track of the current article and row type
    current_article = nil
    current_row_type = nil
    
    # Iterate through the rows to scrape articles
    rows.each do |row|
      if row['class'] == 'athing'
        # This is an article row
        current_article = row
        current_row_type = 'article'
      elsif current_row_type == 'article'
        # This is the details row
        if current_article
          title_elem = current_article.css('span.title a')
          if title_elem
            article_title = title_elem.text  # Get the text of the anchor element
            article_url = title_elem[0]['href']  # Get the href attribute of the anchor element
    
            subtext = row.css('td.subtext')
            points = subtext.css('span.score').text
            author = subtext.css('a.hnuser').text
            timestamp = subtext.css('span.age')[0]['title']
            comments_elem = subtext.css('a:contains("comments")')
            comments = comments_elem.text if comments_elem.any? else '0'
    
            # Print the extracted information
            puts "Title: #{article_title}"
            puts "URL: #{article_url}"
            puts "Points: #{points}"
            puts "Author: #{author}"
            puts "Timestamp: #{timestamp}"
            puts "Comments: #{comments}"
            puts "-" * 50  # Separating articles
          end
        end
    
        # Reset the current article and row type
        current_article = nil
        current_row_type = nil
      elsif row['style'] == 'height:5px'
        # This is the spacer row, skip it
        next
      end
    end

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!