Scraping New York Times News Headlines using Kotlin

Dec 6, 2023 · 6 min read

The New York Times homepage contains dozens of article links that get updated throughout the day. If you want to grab those article titles and links programmatically for further analysis or processing, web scraping is a handy approach.

In this post, we’ll walk through Python code that:

  1. Sends an HTTP request to retrieve the NYTimes homepage HTML
  2. Parses the HTML content using JSoup
  3. Extracts all article titles and links into lists
  4. Prints out the results

Follow along and you’ll end up with a working web scraper for this specific site. Then you can adapt the concepts for your own projects.

Sending the Initial Request

We kick things off by importing the libraries we need and defining our target URL:

import io.ktor.client.*
import io.ktor.client.engine.okhttp.*
import io.ktor.client.request.*
import org.jsoup.Jsoup

val url = "<https://www.nytimes.com/>"

To actually request this URL, we use the Ktor HTTP client. This handles all the low-level network communication for us.

Insider trick: We create the client using the OkHttp engine because it's fast and efficient:

val client = HttpClient(OkHttp)

Before sending the request, we add a custom User-Agent header. This pretends we're a real web browser. Without it, some sites may block automated scraper bots.

val headers = mapOf("User-Agent" to "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")

Finally, we make the GET request and store the HTML content:

val responseText = client.get<String>(url) {
  headers.forEach { (name, value) ->
     header(name, value)
  }
}

So with just a few lines of code, we've retrieved the latest homepage HTML from The Times!

Parsing the Content with JSoup

Now that we have the raw HTML content, we need to parse it to extract the bits we want - the article titles and links.

For this, we use JSoup - a handy Java library for working with HTML and XML.

We pass the HTML string into JSoup's parse() method, which gives us a nested Document Object Model (DOM) representing the content:

val doc = Jsoup.parse(responseText)

This DOM allows us to traverse the HTML elements by CSS selector queries to pinpoint what we need.

Extracting the Articles

Inspecting the page

We now inspect element in chrome to see how the code is structured…

You can see that the articles are contained inside section tags and with the class story-wrapper

We can grab all of them through this selector:

val articleSections = doc.select("section.story-wrapper")

Then we iterate through each section and find the title and link elements inside:

for (articleSection in articleSections) {

  val titleElement = articleSection.selectFirst("h3.indicate-hover")

  val linkElement = articleSection.selectFirst("a.css-9mylee")

  // extract title and link...

}

We check they exist before extracting and storing the text and link URL.

Finally, we print out the results:

for (i in articleTitles.indices) {

  println("Title: ${articleTitles[i]}")

  println("Link: ${articleLinks[i]}")

  println()

}

And we've successfully scraped the latest articles from the homepage!

The full code is included below to use as a reference.

Key Takeaways

  • Use Ktor HTTP client to request a webpage
  • Pass HTML content to JSoup to parse
  • Traverse DOM elements with CSS selectors
  • Check for null conditions before extracting data
  • Store scraped content in lists or other data structures
  • Full Code

    Here is the complete code for this New York Times scraper:

    // CODE FROM ARTICLE
    import io.ktor.client.*
    import io.ktor.client.engine.okhttp.*
    import io.ktor.client.request.*
    import org.jsoup.Jsoup
    
    suspend fun main() {
        // URL of The New York Times website
        val url = "https://www.nytimes.com/"
    
        // Define a user-agent header to simulate a browser request
        val headers = mapOf(
            "User-Agent" to "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
        )
    
        // Create an HTTP client using the OkHttp engine
        val client = HttpClient(OkHttp)
    
        try {
            // Send an HTTP GET request to the URL with headers
            val responseText = client.get<String>(url) {
                headers.forEach { (name, value) ->
                    header(name, value)
                }
            }
    
            // Parse the HTML content of the page using JSoup
            val doc = Jsoup.parse(responseText)
    
            // Find all article sections with class 'story-wrapper'
            val articleSections = doc.select("section.story-wrapper")
    
            // Initialize lists to store the article titles and links
            val articleTitles = mutableListOf<String>()
            val articleLinks = mutableListOf<String>()
    
            // Iterate through the article sections
            for (articleSection in articleSections) {
                // Check if the article title element exists
                val titleElement = articleSection.selectFirst("h3.indicate-hover")
                // Check if the article link element exists
                val linkElement = articleSection.selectFirst("a.css-9mylee")
    
                // If both title and link are found, extract and append
                if (titleElement != null && linkElement != null) {
                    val articleTitle = titleElement.text().trim()
                    val articleLink = linkElement.attr("href")
    
                    articleTitles.add(articleTitle)
                    articleLinks.add(articleLink)
                }
            }
    
            // Print or process the extracted article titles and links
            for (i in articleTitles.indices) {
                println("Title: ${articleTitles[i]}")
                println("Link: ${articleLinks[i]}")
                println()
            }
        } catch (e: Exception) {
            println("Failed to retrieve the web page. Exception: $e")
        } finally {
            client.close()
        }
    }

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!