Scraping Reddit Posts with Rust

Jan 9, 2024 · 5 min read

In this article, we'll be walking through code that scrapes Reddit to extract information about posts. We'll go step-by-step to understand what's happening behind the scenes.

here is the page we are talking about

Prerequisites

To follow along, you'll need:

  • Rust installed on your system
  • The reqwest and select crates added to your Cargo.toml file
  • To install Rust, follow the instructions here.

    The reqwest and select crates handle making web requests and HTML parsing respectively. To add them, run:

    $ cargo add reqwest
    $ cargo add select
    

    With those done, you're all set! Let's get scraping.

    Walkthrough

    We first import the crates and types we need:

    extern crate reqwest;
    extern crate select;
    
    use reqwest::header;
    use select::document::Document;
    use select::node::Node;
    

    Then we define the target URL and User-Agent string:

    let reddit_url = "<https://www.reddit.com>";
    
    let user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";
    

    The user agent helps make requests look like they're coming from a real browser.

    Next, we create a reqwest client with that user agent:

    let client = reqwest::blocking::Client::builder()
        .user_agent(user_agent)
        .build()?;
    

    We use a blocking client for simplicity here.

    Then we send a GET request to the Reddit URL:

    let response = client.get(reddit_url).send()?;
    

    We check if it was successful (200 status code):

    if response.status().is_success() {
      // ...
    }
    

    Inside the if statement is the scraping logic.

    First we get the HTML content as a string:

    let html_content = response.text()?;
    

    We save it to a file:

    let filename = "reddit_page.html";
    std::fs::write(filename, &html_content)?;
    

    Then we parse the HTML into a document using Select:

    let document = Document::from_read(html_content.as_bytes()).unwrap();
    

    Inspecting the elements

    Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

    This is where things get interesting! We find all posts with:

    for node in document.find(Name("shreddit-post").and(Class("block relative cursor-pointer bg-neutral-background focus-within:bg-neutral-background-hover hover:bg-neutral-background-hover xs:rounded-[16px] p-md my-2xs nd:visible"))) {
    
    }
    

    This selector searches for elements with:

  • shreddit-post tag
  • Classes for styling, hover states etc.
  • The key thing is that it finds Reddit post blocks.

    Inside the loop, we extract post data by chaining more Select calls:

    Permalink

    let permalink = node.attr("permalink").unwrap_or("");
    

    Gets the permalink attribute.

    Content URL

    let content_href = node.attr("content-href").unwrap_or("");
    

    Gets content-href attribute.

    Comment Count

    let comment_count = node.attr("comment-count").unwrap_or("");
    

    Gets comment-count attribute.

    Post Title

    let post_title = node.find(Name("div").and(Attr("slot", "title"))).next().unwrap().text();
    

    Finds the title div, gets text within.

    Author

    let author = node.attr("author").unwrap_or("");
    

    Gets author attribute.

    Score

    let score = node.attr("score").unwrap_or("");
    

    Gets score attribute.

    Finally, we print out all extracted fields!

    The key things to note are:

  • We use Select's expressive finder methods to extract data
  • We rely on HTML attributes and classes that Reddit adds
  • And that's it! Full code:

    extern crate reqwest;
    extern crate select;
    
    use reqwest::header;
    use select::document::Document;
    use select::node::Node;
    
    fn main() -> Result<(), reqwest::Error> {
        // Define the Reddit URL you want to download
        let reddit_url = "https://www.reddit.com";
    
        // Define a User-Agent header
        let user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";
    
        // Create a reqwest client with the User-Agent header
        let client = reqwest::blocking::Client::builder()
            .user_agent(user_agent)
            .build()?;
    
        // Send a GET request to the URL
        let response = client.get(reddit_url).send()?;
    
        // Check if the request was successful (status code 200)
        if response.status().is_success() {
            // Get the HTML content of the page as a string
            let html_content = response.text()?;
    
            // Specify the filename to save the HTML content
            let filename = "reddit_page.html";
    
            // Save the HTML content to a file
            std::fs::write(filename, &html_content)?;
    
            println!("Reddit page saved to {}", filename);
    
            // Parse the HTML content
            let document = Document::from_read(html_content.as_bytes()).unwrap();
    
            // Find all blocks with the specified tag and class
            for node in document.find(Name("shreddit-post").and(Class("block relative cursor-pointer bg-neutral-background focus-within:bg-neutral-background-hover hover:bg-neutral-background-hover xs:rounded-[16px] p-md my-2xs nd:visible"))) {
                let permalink = node.attr("permalink").unwrap_or("");
                let content_href = node.attr("content-href").unwrap_or("");
                let comment_count = node.attr("comment-count").unwrap_or("");
                let post_title = node.find(Name("div").and(Attr("slot", "title"))).next().unwrap().text();
                let author = node.attr("author").unwrap_or("");
                let score = node.attr("score").unwrap_or("");
    
                // Print the extracted information for each block
                println!("Permalink: {}", permalink);
                println!("Content Href: {}", content_href);
                println!("Comment Count: {}", comment_count);
                println!("Post Title: {}", post_title);
                println!("Author: {}", author);
                println!("Score: {}", score);
                println!("\n");
            }
        } else {
            println!("Failed to download Reddit page (status code {})", response.status());
        }
    
        Ok(())
    }
    

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!