Scraping New York Times News Headlines in Scala

Dec 6, 2023 · 5 min read

Web scraping is a technique for extracting data from websites automatically. It can be useful for collecting articles, building datasets, and automating workflows. In this beginner-friendly guide, we'll walk through scraping article titles and links from The New York Times homepage using Scala and the Jsoup library.

Use Case

Why would you want to scrape The New York Times site? Here are a few examples:

  • Aggregating the daily headlines to share in a news digest
  • Analyzing article topics over time to detect trends
  • Archiving interesting articles to read later
  • Building a dataset for natural language processing
  • While The New York Times provides API access, scraping can complement that by extracting data directly from the rendered web pages.

    Setup

    We'll use Jsoup, a Java library for parsing HTML. To follow along, you'll need:

  • JDK 8+
  • SBT build tool
  • An editor like IntelliJ IDEA
  • Add this to your SBT build:

    libraryDependencies += "org.jsoup" % "jsoup" % "1.15.3"
    

    This scaffolds out the imports and entry point:

    import org.jsoup.Jsoup
    import org.jsoup.nodes.Document
    
    object TimesScraper {
      def main(args: Array[String]): Unit = {
        // Scraping logic will go here
      }
    }
    

    Making a Request

    To scrape a web page, we need to first download its HTML content. Jsoup provides a clean API for this by handling much of the HTTP complexity under the hood.

    We'll use Jsoup.connect to send a GET request to the NYT homepage URL:

    val url = "<https://www.nytimes.com/>"
    val doc: Document = Jsoup.connect(url).get()
    

    The get() method returns a Document object that contains the parsed HTML content.

    Note: Websites often check the User-Agent header to prevent scraping. Let's spoof a real browser's user agent to avoid issues:

    val userAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
    
    val doc: Document = Jsoup
      .connect(url)
      .userAgent(userAgent)
      .get()
    

    We can now parse this doc object to extract data.

    Parsing the Page

    Inspecting the page

    We now inspect element in chrome to see how the code is structured…

    You can see that the articles are contained inside section tags and with the class story-wrapper

    Next we'll use Jsoup's DOM traversal methods and CSS selectors to find elements.

    The New York Times site has a

    tag with class story-wrapper for each article preview. Let's grab those:

    val articleSections = doc.select("section.story-wrapper")
    

    We can iterate through these sections and use more specific selectors to extract the title and link from each one.

    Jsoup lets you pass CSS selector strings like jQuery. Here's how to get the title and link elements:

    // Get article title
    val titleElement = articleSection.selectFirst("h3.indicate-hover")
    
    // Get article link
    val linkElement = articleSection.selectFirst("a.css-9mylee")
    

    Then we can use Jsoup's DOM methods to extract the text and attributes values:

    // Extract title text
    val articleTitle = titleElement.text()
    
    // Extract href value
    val articleLink = linkElement.attr("href")
    

    And voila! We now have each article title and link scraped from the homepage.

    Putting It All Together

    Here is the full scraper code:

    import org.jsoup.Jsoup
    import org.jsoup.nodes.Document
    
    object TimesScraper {
    
      def main(args: Array[String]): Unit = {
    
        val url = "<https://www.nytimes.com/>"
    
        val userAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
    
        val doc: Document = Jsoup
          .connect(url)
          .userAgent(userAgent)
          .get()
    
        val articleSections = doc.select("section.story-wrapper")
    
        var articleTitles = List[String]()
        var articleLinks = List[String]()
    
        for (articleSection <- articleSections.asScala) {
          val titleElement = articleSection.selectFirst("h3.indicate-hover")
          val linkElement = articleSection.selectFirst("a.css-9mylee")
    
          if (titleElement != null && linkElement != null) {
            val articleTitle = titleElement.text()
            val articleLink = linkElement.attr("href")
    
            articleTitles = articleTitle :: articleTitles
            articleLinks = articleLink :: articleLinks
          }
        }
    
        println(articleTitles)
        println(articleLinks)
    
      }
    
    }
    

    And we're done! Run the code and you'll see the latest articles printed out.

    You can now store these in a database, send them to a web API, or process them further.

    Next Steps

    This covers the basics of using Jsoup to scrape data from an HTML page. Some ideas for next steps:

  • Scrape additional data like subtitles, author names, or article bodies
  • Process the articles using natural language techniques
  • Schedule and deploy this as a daily cron job
  • Generalize the scraper to work on other news sites
  • Use Scrapy or Playwright for more advanced JavaScript sites
  • Web scraping opens up many possibilities for building cool and useful data pipelines. Hopefully this tutorial provided a solid foundation for leveraging these techniques in your projects.

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!