Scraping New York Times News Headlines with Rust

Have you ever wanted to automatically collect or analyze data from a website? Web scraping allows you to programmatically extract information from web pages - say, to grab article headlines from the New York Times homepage. It can be extremely useful for data science, journalistic research, market research, and more.

In this post, we'll walk through a full Rust program that scrapes titles and links from the NYTimes homepage. Along the way, we'll learn key concepts like:

Making structured requests with the reqwest crate

Parsing HTML using the scraper crate

Using CSS selectors to target elements

Structuring scraped data

Even if you're new to Rust, you'll see just how much you can accomplish. Let's get started!

Our Use Case

Why scrape the New York Times home page? We could imagine several scenarios:

Analyzing what topics the Times is covering right now

Tracking long-term trends in headlines over time

Building a tool that alerts you whenever a headline matches given keywords

Populating a prototype news aggregator app

The Times homepage actually changes quite frequently, with new stories cycling into the top headlines slot. Scraping allows us to capture snapshots programmatically.

There are certainly APIs and feeds that would enable some of this too - but rolling your own scraper opens up more possibilities!

Making a Request

We'll use the popular reqwest crate for making web requests.

use reqwest::{header::HeaderMap, StatusCode};

This gives us the main reqwest namespace, and specifically the ability to build header maps and check status codes.

Next we construct a "user agent" header to identify our scraper:

let headers = {

  let mut custom_headers = HeaderMap::new();

  custom_headers.insert("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36".parse()?);

  custom_headers

};

This mimics a Chrome browser on Windows. Sites like the Times can block suspicious requests, so posing as a real browser helps ensure we get a proper response.

Now we can make the GET request:

let resp = reqwest::blocking::get("<https://www.nytimes.com/>")?.headers(headers).send()?;

We use the blocking API since this is a basic script. The .headers(headers) part attaches our user agent, and .send() actually sends off the request.

Checking the Response

It's good practice to verify the request was successful before trying to parse the response:

if resp.status() == StatusCode::OK {

  // parsing logic here

} else {

  println!("Failed with status: {}", resp.status());

}

This prints out any errors, avoiding confusion if our parser code runs but finds no data.

Parsing HTML

To extract information out of the HTML response, we'll use the very handy scraper crate.

We first get the response text and parse it into a structure scraper understands:

let body = resp.text()?;

let document = Html::parse_document(&body);

Inspecting the page

We now inspect element in chrome to see how the code is structured…

You can see that the articles are contained inside section tags and with the class story-wrapper

Now document provides query methods allowing us to select elements by CSS selector:

let story_wrappers = Selector::parse(".story-wrapper").unwrap();

This targets the .story-wrapper elements that contain individual headlines. Next we iterate over these, extracting what we need:

for element in document.select(&story_wrappers) {

  if let Some(title_element) = element.select(".indicate-hover").next() {

    let article_title = title_element.text().collect::<String>();

    article_titles.push(article_title);

  }

  if let Some(link_element) = element.select(".css-9mylee").next() {

    let article_link = link_element.value().attr("href").unwrap().to_string();

    article_links.push(article_link);

  }

}

There's a bit going on here:

For each story wrapper, we select the key elements using additional CSS selectors

We handle missing elements using Option (None if an element doesn't exist)

We build two Vecs storing headlines/links from matched elements

Helpfully, scraper has methods like text() and attr() to extract data

Finally, we can print the results!

Putting It All Together

Here is the full code:

use reqwest::{header::HeaderMap, StatusCode};

use scraper::{Html, Selector};

fn main() -> Result<(), reqwest::Error> {

  // Building custom header
  let headers = {
    let mut custom_headers = HeaderMap::new();
    custom_headers.insert("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36".parse()?);
    custom_headers
  };

  // Making request
  let resp = reqwest::blocking::get("<https://www.nytimes.com/>")?.headers(headers).send()?;

  // Verifying response
  if resp.status() == StatusCode::OK {

    // Parsing logic here
    let body = resp.text()?;
    let document = Html::parse_document(&body);

    let story_wrappers = Selector::parse(".story-wrapper").unwrap();

    let mut article_titles = Vec::new();
    let mut article_links = Vec::new();

    for element in document.select(&story_wrappers) {

      if let Some(title_element) = element.select(".indicate-hover").next() {
        let article_title = title_element.text().collect::<String>();
        article_titles.push(article_title);
      }

      if let Some(link_element) = element.select(".css-9mylee").next() {
        let article_link = link_element.value().attr("href").unwrap().to_string();
        article_links.push(article_link);
      }

    }

    // Printing results
    for i in 0..article_titles.len() {
      println!("Title: {}", article_titles[i]);
      println!("Link: {}", article_links[i]);
      println!();
    }

  } else {
    println!("Request failed with status: {}", resp.status());
  }

  Ok(())
}

And we're done! Running this prints out nice title and link pairs for the top stories.

There are lots more directions we could take this project - hopefully this gives you a solid starting point for your own Rust web scrapers!

Key Takeaways

Web scraping involves making requests and parsing the HTML programmatically

The reqwest and scraper crates make this very straightforward in Rust

Mimicking a real browser helps avoid blocks from sites like NYTimes

CSS selectors allow us to target and extract specific elements

Always handle missing data and verify the request succeeded

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping New York Times News Headlines with Rust

Our Use Case

Making a Request

Checking the Response

Parsing HTML

Inspecting the page

Putting It All Together

Key Takeaways

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping New York Times News Headlines with Rust

Our Use Case

Making a Request

Checking the Response

Parsing HTML

Inspecting the page

Putting It All Together

Key Takeaways

The easiest way to do Web Scraping

Don't leave just yet!