Web scraping is the process of programmatically extracting data from websites. This is often done by sending HTTP requests to a target site, then parsing the HTML response to identify and extract relevant information.
In this article, we'll walk through Ruby code that scrapes titles, URLs, vote counts, authors, timestamps, and comment counts from the popular Hacker News site. The code utilizes the Nokogiri library for HTML parsing and OpenURI for sending HTTP requests.
This is the page we are talking about…

Prerequisites
Before running the web scraper, you'll need to have Ruby installed along with the Nokogiri and OpenURI libraries. These can be installed by running:
gem install nokogiri
gem install open-uri
Walkthrough
Now let's dive into how the web scraper code works:
require 'open-uri'
require 'nokogiri'
First we
url = "<https://news.ycombinator.com/>"
Next, we define the URL of the Hacker News homepage that we want to scrape.
page_content = URI.open(url).read
We use
doc = Nokogiri::HTML(page_content)
Here we pass the HTML content to Nokogiri's
Inspecting the page
You can notice that the items are housed inside a tag with the class athing

rows = doc.css('tr')
The headlines on Hacker News are contained in table rows (
current_article = nil
current_row_type = nil
As we loop through the rows, we'll use these variables to keep track of whether we're currently processing an article row or detail row.
rows.each do |row|
if row['class'] == 'athing'
# This is an article row
current_article = row
current_row_type = 'article'
elsif current_row_type == 'article'
# This is the details row
if current_article
title_elem = current_article.css('span.title a')
if title_elem
article_title = title_elem.text
article_url = title_elem[0]['href']
...
end
end
current_article = nil
current_row_type = nil
end
We iterate through the rows using
The next row after a headline contains additional details like score, author, etc. So if
Focusing now on the key data extraction using CSS selectors:
title_elem = current_article.css('span.title a')
Here
article_title = title_elem.text
article_url = title_elem[0]['href']
From the matched element,
The code continues on using additional selectors to extract score, author, comments, etc. from the detail rows:
points = subtext.css('span.score').text
author = subtext.css('a.hnuser').text
comments_elem = subtext.css('a:contains("comments")')
comments = comments_elem.text if comments_elem.any? else '0'
In each case:
And the final output prints each field scraped for every headline:
Title: ...
URL: ...
Points: ...
Author: ...
Full code:
require 'open-uri'
require 'nokogiri'
# Define the URL of the Hacker News homepage
url = "https://news.ycombinator.com/"
# Send a GET request to the URL and read the content
page_content = URI.open(url).read
# Parse the HTML content of the page using Nokogiri
doc = Nokogiri::HTML(page_content)
# Find all rows in the table
rows = doc.css('tr')
# Initialize variables to keep track of the current article and row type
current_article = nil
current_row_type = nil
# Iterate through the rows to scrape articles
rows.each do |row|
if row['class'] == 'athing'
# This is an article row
current_article = row
current_row_type = 'article'
elsif current_row_type == 'article'
# This is the details row
if current_article
title_elem = current_article.css('span.title a')
if title_elem
article_title = title_elem.text # Get the text of the anchor element
article_url = title_elem[0]['href'] # Get the href attribute of the anchor element
subtext = row.css('td.subtext')
points = subtext.css('span.score').text
author = subtext.css('a.hnuser').text
timestamp = subtext.css('span.age')[0]['title']
comments_elem = subtext.css('a:contains("comments")')
comments = comments_elem.text if comments_elem.any? else '0'
# Print the extracted information
puts "Title: #{article_title}"
puts "URL: #{article_url}"
puts "Points: #{points}"
puts "Author: #{author}"
puts "Timestamp: #{timestamp}"
puts "Comments: #{comments}"
puts "-" * 50 # Separating articles
end
end
# Reset the current article and row type
current_article = nil
current_row_type = nil
elsif row['style'] == 'height:5px'
# This is the spacer row, skip it
next
end
end
This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.
Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Browse by language:
Popular articles:
- Web Scraping in Python - The Complete Guide
- Working with Query Parameters in Python Requests
- How to Authenticate with Bearer Tokens in Python Requests
- Building a Simple Proxy Rotator with Kotlin and Jsoup
- The Complete BeautifulSoup Cheatsheet with Examples
- The Complete Playwright Cheatsheet
- Web Scraping using ChatGPT - Complete Guide with Examples