Scraping New York Times News Headlines in R

Dec 6, 2023 · 6 min read

Web scraping is the process of extracting data from websites automatically through code. It allows gathering information published openly on the web to analyze or use programmatically.

A common use case is scraping article headlines and links from news sites like The New York Times to perform text analysis or feed into machine learning models. Instead of laboriously copying content by hand, web scraping makes this fast and easy.

In this beginner R tutorial, we'll walk through a simple example of scraping the main New York Times page to extract article titles and links into R for further processing.

Prerequisites

Before running the code, some packages need to be installed:

install.packages(c("rvest", "httr"))
  • rvest - for HTML parsing and extraction
  • httr - provides useful HTTP client functionality
  • Load the libraries:

    library(rvest)
    library(httr)
    

    Making HTTP Requests with R

    We first need to download the New York Times HTML page content into R to search through it. This requires sending an HTTP GET request from R:

    url <- '<https://www.nytimes.com/>'
    headers <- add_headers("User-Agent" = "Mozilla/5.0)")
    
    response <- GET(url, headers)
    

    Here we:

  • Define the NYT homepage URL
  • Set a browser-like user-agent header (sites may block non-browser requests)
  • Make the GET request, store response
  • We'll check the status code to confirm success:

    if(status_code(response) == 200) {
      # Continue scraping
    } else {
      print("Request failed")
    }
    

    HTTP status 200 indicates the request and page load worked properly. Any other code means an error occurred we need to handle.

    Parsing the Page Content in R

    With the HTML content now stored in the response object, we leverage rvest to parse and search through it.

    page_content <- content(response, "text", encoding = "UTF-8")
    page <- read_html(page_content)
    

    Next we need to find which elements on the page contain the article titles and links to extract. Viewing the page source, we can notice article content sits within

    tags.

    Inspecting the page

    We now inspect element in chrome to see how the code is structured…

    You can see that the articles are contained inside section tags and with the class story-wrapper

    We grab all such sections:

    article_sections <- html_nodes(page, "section.story-wrapper")
    

    Extracting Article Data

    To extract the title and link from each section, we loop through the results:

    for (section in article_sections) {
    
      title <- html_node(section, "h3.indicate-hover")
      link <- html_node(section, "a.css-9mylee")
    
      if(!is.na(title) & !is.na(link)) {
        article_title <- html_text(title)
        article_url <- html_attr(link, "href")
    
        print(article_title)
        print(article_url)
      }
    }
    

    Here we first find the specific nodes for title and link using CSS selectors, then extract the text and attribute values if they exist.

    Finally, we print the results - we would likely store and process these further in a real system.

    Potential Challenges

    Some potential issues to be aware of:

  • Site layout changes may break CSS selectors
  • News sites often dynamically load content so may need to further paginate
  • May hit request limits without proper throttling
  • Dealing with authentication or cookies
  • There are R packages like RSelenium which can help simulate a real browser environment better if issues arise.

    Next Steps

    The rvest package provides a wide toolkit for scraping many types of sites. From here you could look to:

  • Automatically scrape articles daily
  • Perform sentiment analysis on headlines
  • Feed data into predictive models
  • Expand types of fields extracted beyond title/link
  • Hopefully this gives a glimpse into the possibilities once you can easily acquire website data into R at scale!

    Full R Code

    Here is the complete runnable script:

    # Load the required libraries
    library(httr)
    library(rvest)
    
    # URL of The New York Times website
    url <- 'https://www.nytimes.com/'
    
    # Define a user-agent header to simulate a browser request
    headers <- add_headers(
      "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    )
    
    # Send an HTTP GET request to the URL
    response <- GET(url, headers = headers)
    
    # Check if the request was successful (status code 200)
    if (status_code(response) == 200) {
      # Parse the HTML content of the page
      page_content <- content(response, "text", encoding = "UTF-8")
      page <- read_html(page_content)
    
      # Find all article sections with class 'story-wrapper'
      article_sections <- html_nodes(page, "section.story-wrapper")
    
      # Initialize lists to store the article titles and links
      article_titles <- character(0)
      article_links <- character(0)
    
      # Iterate through the article sections
      for (article_section in article_sections) {
        # Check if the article title element exists
        title_element <- html_node(article_section, "h3.indicate-hover")
        # Check if the article link element exists
        link_element <- html_node(article_section, "a.css-9mylee")
    
        # If both title and link are found, extract and append
        if (!is.na(title_element) && !is.na(link_element)) {
          article_title <- html_text(title_element)
          article_link <- html_attr(link_element, "href")
    
          article_titles <- c(article_titles, article_title)
          article_links <- c(article_links, article_link)
        }
      }
    
      # Print or process the extracted article titles and links
      for (i in seq_along(article_titles)) {
        cat("Title:", article_titles[i], "\n")
        cat("Link:", article_links[i], "\n\n")
      }
    } else {
      cat("Failed to retrieve the web page. Status code:", status_code(response), "\n")
    }

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!