Scraping Hacker News Articles with R

Jan 21, 2024 · 8 min read

In this beginner web scraping tutorial, we'll walk through code that scrapes news articles from the popular Hacker News site using the rvest package in R.

Specifcally, this code will extract the title, URL, points, author, timestamp, and comment count for each article on Hacker News' front page.

This is the page we are talking about…

Installation

Before running the web scraping script, you need to install rvest:

install.packages("rvest")

rvest depends on the xml2 package, so install that first if needed:

install.packages("xml2")

You may also need to install other dependencies like httr, curl, and stringr.

Once installed, let's load rvest:

library(rvest)

Okay, now we're ready to scrape!

Walkthrough

Let's break down what exactly this Hacker News web scraper is doing:

1. Define URL and Initialize Session

url <- "<https://news.ycombinator.com/>"

response <- read_html(url)

First, we store the HackerNews homepage URL in a variable called url.

Then, we use rvest's read_html() function to send a GET request to that page and store the result in response.

2. Check that the Request Succeeded

if (length(response) > 0) {
   ...
} else {
   cat("Failed to retrieve the page.\\n")
}

It's good practice to verify that the request succeeded before trying to parse the HTML.

Here we simply check if response contains data. If not, we print an error.

3. Find All Table Rows

Inspecting the page

You can notice that the items are housed inside a tag with the class athing

rows <- html_nodes(response, "tr")

Hacker News displays the articles in a table, so to extract each article's data we need to loop through the table rows.

html_nodes() lets us find all table row elements using the CSS selector "tr". This stores all rows in the rows variable.

4. Set Up Tracker Variables

current_article <- NULL
current_row_type <- NULL

As we loop through the rows, we need to keep track of the current article row we're processing and what type of row it is.

For example, the first row contains the article title and URL. The next row contains additional details like points and author. By tracking state with these variables, we can pair this data together for each article.

5. Loop Through Each Row

Now let's walk through the for loop to understand how the data extraction works:

for (row in rows) {

   ...

}

We loop through each row returned by html_nodes() to process it.

6. Identify Article Row or Details Row

if ("athing" %in% html_attr(row, "class")) {

   # This is an article row
   current_article <- row
   current_row_type <- "article"

} else if (current_row_type == "article") {

   # This is the details row

}

Here is the key logic that identifies whether the current row is an article title or the additional details.

Specifically, the CSS class "athing" only occurs on article title rows. So we check for that class name using html_attr() to get a row's CSS classes.

If found, we save the row to current_article and set the type.

Else, if we previously found an article row, we know the next row must contain the additional details for that article.

This sets us up nicely to extract the two connected rows of data.

7. Extract Article Details

Now that we've identified an article row and details row pair, let's extract the data:

title_elem <- html_node(current_article, "span.titleline")

if (!is.null(title_elem)) {

  article_title <- html_text(html_node(title_elem, "a"))

  article_url <- html_attr(html_node(title_elem, "a"), "href")

  ...

}

Using html_node(), we dig into the article row to find the title element, which is under a with class "titleline".

Specifically, html_node() with a CSS selector lets us search child nodes recursively like jQuery!

From there we use html_text() and html_attr() to extract the text content and href attribute from the anchor tag wrapped around the title.

This gives us the article's actual text title and URL!

I won't walk through every line but it continues similarly extracting each data field like points, author, timestamp, and comment count by targeting specific elements in the HTML.

The key is ??? using CSS selectors and XPath to target elements + rvest functions to get text and attribute data!

Finally, we print the extracted article data so we can see the scraper working:

cat("Title: ", article_title, "\\n")
cat("URL: ", article_url, "\\n")
...

And that's the core logic! We continue this process row by row to extract articles from the whole HackerNews table.

Full Code

Now that we understand each part, here is the full web scraper code:

# Load the necessary libraries
library(rvest)

# Define the URL of the Hacker News homepage
url <- "https://news.ycombinator.com/"

# Send a GET request to the URL
response <- read_html(url)

# Check if the request was successful
if (length(response) > 0) {
  # Find all rows in the table
  rows <- html_nodes(response, "tr")

  # Initialize variables to keep track of the current article and row type
  current_article <- NULL
  current_row_type <- NULL

  # Iterate through the rows to scrape articles
  for (row in rows) {
    if ("athing" %in% html_attr(row, "class")) {
      # This is an article row
      current_article <- row
      current_row_type <- "article"
    } else if (current_row_type == "article") {
      # This is the details row
      if (!is.null(current_article)) {
        # Extract information from the current article and details row
        title_elem <- html_node(current_article, "span.titleline")
        if (!is.null(title_elem)) {
          article_title <- html_text(html_node(title_elem, "a"))
          article_url <- html_attr(html_node(title_elem, "a"), "href")

          subtext <- html_node(row, "td.subtext")
          points <- gsub("\\D", "", html_text(html_node(subtext, "span.score")))
          author <- html_text(html_node(subtext, "a.hnuser"))
          timestamp <- html_attr(html_node(subtext, "span.age"), "title")
          comments_elem <- html_node(subtext, xpath = ".//a[contains(text(), 'comment')]")
          comments <- ifelse(is.null(comments_elem), "0", html_text(comments_elem))

          # Print the extracted information
          cat("Title: ", article_title, "\n")
          cat("URL: ", article_url, "\n")
          cat("Points: ", points, "\n")
          cat("Author: ", author, "\n")
          cat("Timestamp: ", timestamp, "\n")
          cat("Comments: ", comments, "\n")
          cat("-" * 50, "\n")  # Separating articles
        }
      }
      # Reset the current article and row type
      current_article <- NULL
      current_row_type <- NULL
    } else if ("height:5px" == html_attr(row, "style")) {
      # This is the spacer row, skip it
      next
    }
  }
} else {
  cat("Failed to retrieve the page.\n")
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: