Scraping Reddit Posts with R

Jan 9, 2024 · 6 min read

In this article, we'll go through R code to scrape various data from Reddit posts. We'll look at how to send requests, handle responses, extract information, and iterate through multiple posts.

here is the page we are talking about

Setup

We'll utilize two useful R packages for scraping web pages:

library(httr)
library(rvest)

The httr package allows us to easily send HTTP requests and check responses. The rvest package helps parse and extract information from HTML and XML pages through CSS selectors.

Define Target URL

We first store the Reddit homepage URL that we want to scrape:

reddit_url <- "<https://www.reddit.com>"

Set User-Agent Header

Many sites check request headers to identify automated traffic. So we define a browser User-Agent to seem like a normal visitor:

headers <- list(
  "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
)

Send GET Request

We use the GET() function to send a request to the URL, attaching our User-Agent header:

response <- httr::GET(url = reddit_url, add_headers(.headers=headers))

Check Status Code

It's important to verify that the request succeeded by checking the status code:

if (httr::status_code(response) == 200) {

  # Request succeeded logic

} else {

  # Request failed logic

}

Status code 200 means our request succeeded. Other codes usually signify an error.

Save Raw HTML

Since our request succeeded, we can save the raw HTML to a file for later parsing:

html_content <- httr::content(response, "text")

filename <- "reddit_page.html"

cat(html_content, file = filename, sep = "", encoding = "UTF-8")

We use content() to extract page text and save it with cat().

Read HTML Content

To start extracting information, we need to read the HTML content into R using the rvest package:

parsed_html <- read_html(html_content)

Understand CSS Selectors

Inspecting the elements

Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

CSS selectors allow us to target specific elements in HTML/XML documents. They are extremely powerful but can be confusing for beginners.

We will go through this complex selector step-by-step:

.block.relative.cursor-pointer.bg-neutral-background.focus-within:bg-neutral-background-hover.hover:bg-neutral-background-hover.xs:rounded-[16px].p-md.my-2xs.nd:visible

Breaking this down:

.block - Target elements with class="block"

.relative - Also match class="relative"

.cursor-pointer - And class="cursor-pointer"

And so on for each subsequent class name. We are matching elements that have ALL these classes.

The bg-neutral-background parts check for state changes on hover/focus. These help isolate interactive elements.

:visible filters out any invisible elements.

Some key things for beginners:

  • Syntax is .class-name
  • Multiple classes use .class1.class2.class3
  • Chaining classes targets elements matching all classes
  • There is still more we could unpack about selectors but this covers the key ideas. Let's see it in action.

    Extract Post Data

    We apply our selector to the parsed HTML, saving post blocks into a variable:

    blocks <- parsed_html %>%
      html_nodes(".block.relative.cursor-pointer...")
    

    This gives us a collection of post block elements from the page.

    Loop Through Posts

    Next we iterate through each block, extracting various data points:

    for (block in blocks) {
    
      permalink <- block %>% html_attr("permalink")
    
      content_href <- block %>% html_attr("content-href")
    
      comment_count <- block %>% html_attr("comment-count")
    
      post_title <- block %>% html_node("[slot='title']") %>% html_text(trim = TRUE)
    
      author <- block %>% html_attr("author")
    
      score <- block %>% html_attr("score")
    
      # Print extracted data
      cat(permalink, "\\\\n")
      cat(content_href, "\\\\n")
      ...
    
    }
    

    Breaking this down:

  • We loop through each block element
  • Use html_attr() to extract attributes like "permalink"
  • For text, we target nodes like html_node("[slot='title']")
  • html_text() gets the text inside nodes
  • Print out data after each loop
  • This gives us the ability to extract many posts systematically.

    The key ideas are:

  • Use html_attr() for attributes like "href"
  • Target nodes for text content with additional selectors
  • Loop through elements to repeat extraction
  • Full Code

    Here is the full Reddit scraping script:

    library(httr)
    library(rvest)
    
    # Define the Reddit URL you want to download
    reddit_url <- "https://www.reddit.com"
    
    # Define a User-Agent header
    headers <- list(
      "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    )
    
    # Send a GET request to the URL with the User-Agent header
    response <- httr::GET(url = reddit_url, add_headers(.headers=headers))
    
    # Check if the request was successful (status code 200)
    if (httr::status_code(response) == 200) {
      # Get the HTML content of the page
      html_content <- httr::content(response, "text")
    
      # Specify the filename to save the HTML content
      filename <- "reddit_page.html"
    
      # Save the HTML content to a file
      cat(html_content, file = filename, sep = "", encoding = "UTF-8")
    
      cat(paste("Reddit page saved to", filename), "\n", sep = "")
    } else {
      cat(paste("Failed to download Reddit page (status code", httr::status_code(response), ")\n"), sep = "")
    }
    
    # Parse the entire HTML content
    parsed_html <- read_html(html_content)
    
    # Find all blocks with the specified tag and class
    blocks <- parsed_html %>%
      html_nodes(".block.relative.cursor-pointer.bg-neutral-background.focus-within:bg-neutral-background-hover.hover:bg-neutral-background-hover.xs:rounded-[16px].p-md.my-2xs.nd:visible")
    
    # Iterate through the blocks and extract information from each one
    for (block in blocks) {
      permalink <- block %>% html_attr("permalink")
      content_href <- block %>% html_attr("content-href")
      comment_count <- block %>% html_attr("comment-count")
      post_title <- block %>% html_node("[slot='title']") %>% html_text(trim = TRUE)
      author <- block %>% html_attr("author")
      score <- block %>% html_attr("score")
    
      # Print the extracted information for each block
      cat("Permalink: ", permalink, "\n")
      cat("Content Href: ", content_href, "\n")
      cat("Comment Count: ", comment_count, "\n")
      cat("Post Title: ", post_title, "\n")
      cat("Author: ", author, "\n")
      cat("Score: ", score, "\n\n")
    }

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!