Scraping New York Times News Headlines in VB

Dec 6, 2023 · 8 min read

Web scraping is the process of extracting data from websites automatically through code. It allows you to harvest and use public data in all kinds of beneficial ways.

In this article, we'll walk through a full example of scraping article titles and links from the home page of The New York Times (NYT). Their site doesn't have an API to access this article data directly, so web scraping provides a method to get it.

Why Scrape The New York Times?

The NYT publishes high-quality, timely articles across many topics. Scraping them allows you to tap into this great content for projects like:

  • Aggregating topic-specific articles over time
  • Generating meta-analysis of coverage and natural language processing
  • Building an alternative interface to their content
  • Populating a database for public analysis
  • Many ideas are possible once the data has been extracted!

    Step 1: Set Up Imports and Modules

    We first need to import the .NET namespaces and define a module for our scraper code:

    Imports System.Net
    Imports System.IO
    Imports HtmlAgilityPack
    
    Module Program
    
  • System.Net contains classes for network communications like web requests
  • System.IO has helpers for managing streams and files
  • HtmlAgilityPack allows HTML parsing to extract data
  • Wrapping our code in a module allows it to be called from other parts of the application.

    Step 2: Create Request to NYT Website

    To make a request to any web page, we need to specify the URL. For The New York Times home page:

    Dim url As String = "<https://www.nytimes.com/>"
    

    We also should define a user agent header that identifies our program as a browser. This gets around blocks some sites have against scraping bots:

    Dim userAgent As String = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    

    Then we can construct an HttpWebRequest with our URL and user agent:

    Dim request As HttpWebRequest = DirectCast(WebRequest.Create(url), HttpWebRequest)
    request.UserAgent = userAgent
    

    This request will simulate a browser visiting the URL.

    Step 3: Send Request and Get Response

    To actually call the URL, we use the GetResponse() method on our request:

    Dim response As HttpWebResponse = DirectCast(request.GetResponse(), HttpWebResponse)
    

    This will connect to the URL and return an HttpWebResponse object with the page contents.

    We should check that the request was successful by looking at the status code:

    If response.StatusCode = HttpStatusCode.OK Then
       ' Request succeeded
    End If
    

    Status code 200 means everything went smoothly. Other codes indicate an error happened.

    Step 4: Load HTML into Parser

    Now that we have the raw HTML content from the page, we need to parse it to extract the articles. The Html Agility Pack (HAP) library allows easy parsing and querying of HTML in .NET.

    We load the response stream into an HtmlDocument:

    Dim htmlDoc As New HtmlDocument()
    htmlDoc.Load(response.GetResponseStream())
    

    This document represents a structured tree of elements that we can now explore using CSS selectors and XPath queries.

    Step 5: Use XPath to Extract Articles

    Inspecting the page

    We now inspect element in chrome to see how the code is structured…

    You can see that the articles are contained inside section tags and with the class story-wrapper

    HAP has many options to target elements - we'll use XPath queries here since they work well for scraping structured data.

    First we get all

    tags with the story wrapper CSS class:

    Dim articleSections As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//section\\[contains(@class, 'story-wrapper')\\]")
    

    This finds the key sections containing articles on NYT's home page.

    We iterate over these sections and use more XPath queries to extract the title and link inside each:

    For Each articleSection As HtmlNode In articleSections
    
        Dim titleElement As HtmlNode = articleSection.SelectSingleNode(".//h3\\[contains(@class, 'indicate-hover')\\]")
    
        Dim linkElement As HtmlNode = articleSection.SelectSingleNode(".//a\\[contains(@class, 'css-9mylee')\\]")
    
    Next
    

    These queries target the specific elements in each section that contain the data we want.

    Step 6: Store Results in Lists

    As we extract each title and link, we can store them in lists:

    Dim articleTitles As New List(Of String)
    Dim articleLinks As New List(Of String)
    
    '...
    
    articleTitles.Add(titleElement.InnerText.Trim())
    articleLinks.Add(linkElement.GetAttributeValue("href", ""))
    

    These lists give us easy access to the scraped data for any processing or output we want.

    Step 7: Print/Process Results

    Finally, we can print the article titles and links:

    For i As Integer = 0 To articleTitles.Count - 1
    
        Console.WriteLine("Title: " & articleTitles(i))
        Console.WriteLine("Link: " & articleLinks(i))
        Console.WriteLine()
    
    Next
    

    This will output each article scraped from the homepage.

    The full code can be found at the end of this article.

    Key Takeaways

    The key steps to scrape structured article data are:

    1. Identify target site and elements
    2. Craft web request with user agent
    3. Parse page HTML
    4. Extract data with XPath
    5. Store in object model
    6. Output/process results

    From here you could expand to scrape additional fields, save to a database, or integrate with other systems. Web scraping opens up many possibilities!

    Next Steps

    To extend this simple scraper:

  • Scrape additional metadata like author, date, topics
  • Store the structured data in databases or services
  • Expand to scrape additional pages/sections
  • Automate on a schedule with cron jobs
  • Detect changes and new articles automatically
  • Web scraping brings the vast content of the web to your fingertips! Let us know if you have any other questions.

    Full code:

    Imports System.Net
    Imports System.IO
    Imports HtmlAgilityPack
    
    Module Program
        Sub Main()
            ' URL of The New York Times website
            Dim url As String = "https://www.nytimes.com/"
    
            ' Define a user-agent header to simulate a browser request
            Dim userAgent As String = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    
            ' Create an HTTP request with the user-agent header
            Dim request As HttpWebRequest = DirectCast(WebRequest.Create(url), HttpWebRequest)
            request.UserAgent = userAgent
    
            ' Send an HTTP GET request to the URL
            Dim response As HttpWebResponse = DirectCast(request.GetResponse(), HttpWebResponse)
    
            ' Check if the request was successful (status code 200)
            If response.StatusCode = HttpStatusCode.OK Then
                ' Create an HtmlDocument to parse the HTML content
                Dim htmlDoc As New HtmlDocument()
                htmlDoc.Load(response.GetResponseStream())
    
                ' Find all article sections with class 'story-wrapper'
                Dim articleSections As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//section[contains(@class, 'story-wrapper')]")
    
                ' Initialize lists to store the article titles and links
                Dim articleTitles As New List(Of String)()
                Dim articleLinks As New List(Of String)()
    
                ' Iterate through the article sections
                If articleSections IsNot Nothing Then
                    For Each articleSection As HtmlNode In articleSections
                        ' Check if the article title element exists
                        Dim titleElement As HtmlNode = articleSection.SelectSingleNode(".//h3[contains(@class, 'indicate-hover')]")
                        ' Check if the article link element exists
                        Dim linkElement As HtmlNode = articleSection.SelectSingleNode(".//a[contains(@class, 'css-9mylee')]")
    
                        ' If both title and link are found, extract and append
                        If titleElement IsNot Nothing AndAlso linkElement IsNot Nothing Then
                            Dim articleTitle As String = titleElement.InnerText.Trim()
                            Dim articleLink As String = linkElement.GetAttributeValue("href", "")
    
                            articleTitles.Add(articleTitle)
                            articleLinks.Add(articleLink)
                        End If
                    Next
                End If
    
                ' Print or process the extracted article titles and links
                For i As Integer = 0 To articleTitles.Count - 1
                    Console.WriteLine("Title: " & articleTitles(i))
                    Console.WriteLine("Link: " & articleLinks(i))
                    Console.WriteLine()
                Next
            Else
                Console.WriteLine("Failed to retrieve the web page. Status code: " & response.StatusCode)
            End If
        End Sub
    End Module

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!