Scraping New York Times News Headlines with Rust

Dec 6, 2023 · 6 min read

Have you ever wanted to automatically collect or analyze data from a website? Web scraping allows you to programmatically extract information from web pages - say, to grab article headlines from the New York Times homepage. It can be extremely useful for data science, journalistic research, market research, and more.

In this post, we'll walk through a full Rust program that scrapes titles and links from the NYTimes homepage. Along the way, we'll learn key concepts like:

  • Making structured requests with the reqwest crate
  • Parsing HTML using the scraper crate
  • Using CSS selectors to target elements
  • Structuring scraped data
  • Even if you're new to Rust, you'll see just how much you can accomplish. Let's get started!

    Our Use Case

    Why scrape the New York Times home page? We could imagine several scenarios:

  • Analyzing what topics the Times is covering right now
  • Tracking long-term trends in headlines over time
  • Building a tool that alerts you whenever a headline matches given keywords
  • Populating a prototype news aggregator app
  • The Times homepage actually changes quite frequently, with new stories cycling into the top headlines slot. Scraping allows us to capture snapshots programmatically.

    There are certainly APIs and feeds that would enable some of this too - but rolling your own scraper opens up more possibilities!

    Making a Request

    We'll use the popular reqwest crate for making web requests.

    use reqwest::{header::HeaderMap, StatusCode};
    

    This gives us the main reqwest namespace, and specifically the ability to build header maps and check status codes.

    Next we construct a "user agent" header to identify our scraper:

    let headers = {
    
      let mut custom_headers = HeaderMap::new();
    
      custom_headers.insert("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36".parse()?);
    
      custom_headers
    
    };
    

    This mimics a Chrome browser on Windows. Sites like the Times can block suspicious requests, so posing as a real browser helps ensure we get a proper response.

    Now we can make the GET request:

    let resp = reqwest::blocking::get("<https://www.nytimes.com/>")?.headers(headers).send()?;
    

    We use the blocking API since this is a basic script. The .headers(headers) part attaches our user agent, and .send() actually sends off the request.

    Checking the Response

    It's good practice to verify the request was successful before trying to parse the response:

    if resp.status() == StatusCode::OK {
    
      // parsing logic here
    
    } else {
    
      println!("Failed with status: {}", resp.status());
    
    }
    

    This prints out any errors, avoiding confusion if our parser code runs but finds no data.

    Parsing HTML

    To extract information out of the HTML response, we'll use the very handy scraper crate.

    We first get the response text and parse it into a structure scraper understands:

    let body = resp.text()?;
    
    let document = Html::parse_document(&body);
    

    Inspecting the page

    We now inspect element in chrome to see how the code is structured…

    You can see that the articles are contained inside section tags and with the class story-wrapper

    Now document provides query methods allowing us to select elements by CSS selector:

    let story_wrappers = Selector::parse(".story-wrapper").unwrap();
    

    This targets the .story-wrapper elements that contain individual headlines. Next we iterate over these, extracting what we need:

    for element in document.select(&story_wrappers) {
    
      if let Some(title_element) = element.select(".indicate-hover").next() {
    
        let article_title = title_element.text().collect::<String>();
    
        article_titles.push(article_title);
    
      }
    
      if let Some(link_element) = element.select(".css-9mylee").next() {
    
        let article_link = link_element.value().attr("href").unwrap().to_string();
    
        article_links.push(article_link);
    
      }
    
    }
    

    There's a bit going on here:

  • For each story wrapper, we select the key elements using additional CSS selectors
  • We handle missing elements using Option (None if an element doesn't exist)
  • We build two Vecs storing headlines/links from matched elements
  • Helpfully, scraper has methods like text() and attr() to extract data
  • Finally, we can print the results!

    Putting It All Together

    Here is the full code:

    use reqwest::{header::HeaderMap, StatusCode};
    
    use scraper::{Html, Selector};
    
    fn main() -> Result<(), reqwest::Error> {
    
      // Building custom header
      let headers = {
        let mut custom_headers = HeaderMap::new();
        custom_headers.insert("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36".parse()?);
        custom_headers
      };
    
      // Making request
      let resp = reqwest::blocking::get("<https://www.nytimes.com/>")?.headers(headers).send()?;
    
      // Verifying response
      if resp.status() == StatusCode::OK {
    
        // Parsing logic here
        let body = resp.text()?;
        let document = Html::parse_document(&body);
    
        let story_wrappers = Selector::parse(".story-wrapper").unwrap();
    
        let mut article_titles = Vec::new();
        let mut article_links = Vec::new();
    
        for element in document.select(&story_wrappers) {
    
          if let Some(title_element) = element.select(".indicate-hover").next() {
            let article_title = title_element.text().collect::<String>();
            article_titles.push(article_title);
          }
    
          if let Some(link_element) = element.select(".css-9mylee").next() {
            let article_link = link_element.value().attr("href").unwrap().to_string();
            article_links.push(article_link);
          }
    
        }
    
        // Printing results
        for i in 0..article_titles.len() {
          println!("Title: {}", article_titles[i]);
          println!("Link: {}", article_links[i]);
          println!();
        }
    
      } else {
        println!("Request failed with status: {}", resp.status());
      }
    
      Ok(())
    }
    

    And we're done! Running this prints out nice title and link pairs for the top stories.

    There are lots more directions we could take this project - hopefully this gives you a solid starting point for your own Rust web scrapers!

    Key Takeaways

  • Web scraping involves making requests and parsing the HTML programmatically
  • The reqwest and scraper crates make this very straightforward in Rust
  • Mimicking a real browser helps avoid blocks from sites like NYTimes
  • CSS selectors allow us to target and extract specific elements
  • Always handle missing data and verify the request succeeded
  • In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!