Scraping Hacker News with Ruby

Web scraping is the process of programmatically extracting data from websites. This is often done by sending HTTP requests to a target site, then parsing the HTML response to identify and extract relevant information.

In this article, we'll walk through Ruby code that scrapes titles, URLs, vote counts, authors, timestamps, and comment counts from the popular Hacker News site. The code utilizes the Nokogiri library for HTML parsing and OpenURI for sending HTTP requests.

This is the page we are talking about…

Prerequisites

Before running the web scraper, you'll need to have Ruby installed along with the Nokogiri and OpenURI libraries. These can be installed by running:

gem install nokogiri
gem install open-uri

Walkthrough

Now let's dive into how the web scraper code works:

require 'open-uri'
require 'nokogiri'

First we require the OpenURI and Nokogiri modules that we'll need for making requests and parsing.

url = "<https://news.ycombinator.com/>"

Next, we define the URL of the Hacker News homepage that we want to scrape.

page_content = URI.open(url).read

We use URI.open from OpenURI to send a GET request to the Hacker News URL. This returns an object that we call read on to access the full HTML content of the page. This raw HTML is what we'll parse next.

doc = Nokogiri::HTML(page_content)

Here we pass the HTML content to Nokogiri's HTML method, which parses it and returns a document object that we can now query using CSS selectors.

Inspecting the page

You can notice that the items are housed inside a tag with the class athing

rows = doc.css('tr')

The headlines on Hacker News are contained in table rows (tr). Here we grab all tr elements from the document using the css method and CSS selector syntax. This returns a collection of Nokogiri XML node objects representing each row.

current_article = nil
current_row_type = nil

As we loop through the rows, we'll use these variables to keep track of whether we're currently processing an article row or detail row.

rows.each do |row|

  if row['class'] == 'athing'
    # This is an article row
    current_article = row
    current_row_type = 'article'

  elsif current_row_type == 'article'
    # This is the details row

    if current_article
      title_elem = current_article.css('span.title a')

      if title_elem
        article_title = title_elem.text
        article_url = title_elem[0]['href']

        ...

      end

    end

  current_article = nil
  current_row_type = nil

end

We iterate through the rows using .each. For any row that has class athing, we know it's an article headline, so we set current_article and current_row_type accordingly.

The next row after a headline contains additional details like score, author, etc. So if current_row_type equals article we extract those details, reset the flags, and move on.

Focusing now on the key data extraction using CSS selectors:

title_elem = current_article.css('span.title a')

Here .css() is called on the headline row XML node. We use the selector span.title a to match the anchor tag inside the title span, containing the article title text.

article_title = title_elem.text
article_url = title_elem[0]['href']

From the matched element, .text extracts the title itself, while [0]['href'] grabs the URL from the first anchor's href attribute.

The code continues on using additional selectors to extract score, author, comments, etc. from the detail rows:

points = subtext.css('span.score').text

author = subtext.css('a.hnuser').text

comments_elem = subtext.css('a:contains("comments")')
comments = comments_elem.text if comments_elem.any? else '0'

In each case:

The specific elements are targeted using tag, class, and text-based selectors

The relevant data is extracted using .text or attributes like ['href']

And the final output prints each field scraped for every headline:

Title: ...
URL: ...
Points: ...
Author: ...

Full code:

require 'open-uri'
require 'nokogiri'

# Define the URL of the Hacker News homepage
url = "https://news.ycombinator.com/"

# Send a GET request to the URL and read the content
page_content = URI.open(url).read

# Parse the HTML content of the page using Nokogiri
doc = Nokogiri::HTML(page_content)

# Find all rows in the table
rows = doc.css('tr')

# Initialize variables to keep track of the current article and row type
current_article = nil
current_row_type = nil

# Iterate through the rows to scrape articles
rows.each do |row|
  if row['class'] == 'athing'
    # This is an article row
    current_article = row
    current_row_type = 'article'
  elsif current_row_type == 'article'
    # This is the details row
    if current_article
      title_elem = current_article.css('span.title a')
      if title_elem
        article_title = title_elem.text  # Get the text of the anchor element
        article_url = title_elem[0]['href']  # Get the href attribute of the anchor element

        subtext = row.css('td.subtext')
        points = subtext.css('span.score').text
        author = subtext.css('a.hnuser').text
        timestamp = subtext.css('span.age')[0]['title']
        comments_elem = subtext.css('a:contains("comments")')
        comments = comments_elem.text if comments_elem.any? else '0'

        # Print the extracted information
        puts "Title: #{article_title}"
        puts "URL: #{article_url}"
        puts "Points: #{points}"
        puts "Author: #{author}"
        puts "Timestamp: #{timestamp}"
        puts "Comments: #{comments}"
        puts "-" * 50  # Separating articles
      end
    end

    # Reset the current article and row type
    current_article = nil
    current_row_type = nil
  elsif row['style'] == 'height:5px'
    # This is the spacer row, skip it
    next
  end
end

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Scraping Hacker News with Ruby

Prerequisites

Walkthrough

Browse by language:

The easiest way to do Web Scraping

Scraping Hacker News with Ruby

Prerequisites

Walkthrough

The easiest way to do Web Scraping

Don't leave just yet!