Scraping New York Times News Headlines in Elixir

Dec 6, 2023 · 6 min read

Have you ever wanted to systematically extract data from a website? That process is called web scraping, and it's a very useful skill for building Elixir applications. In this beginner-friendly walkthrough, we'll cover how to use Elixir libraries like HTTPoison and Floki to scrape article titles and links from the New York Times homepage.

Let's get started!

What is Web Scraping?

Web scraping refers to programmatically fetching web page content and extracting the data you need.

For example, say we want to get a list of article headlines from the New York Times to display in our Elixir app. We could manually copy and paste each headline, but that's tedious for us humans. Web scraping automates that grunt work.

The scraper will:

  1. Download the New York Times homepage HTML
  2. Parse through that HTML content
  3. Pick out the article titles and links
  4. Output those titles and links for us to display

Pretty handy! Scraping saves us humans a lot of repetitive work.

Scraping the New York Times Homepage

Let's walk through this scraper code step-by-step:

defmodule Scraper do

  require HTTPoison
  require Floki

First, we define a Scraper module and require the HTTPoison and Floki libraries. We'll use HTTPoison to download the web page HTML. Then Floki will help us query and traverse the HTML content to extract the data we want.

Next, we configure the @user_agent header:

  @user_agent [{"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"}]

This helps us spoof a browser's user agent string so the New York Times server thinks we're Firefox rather than a scraper bot. Useful insider trick!

Now let's define the main fetch logic:

def fetch_articles do

  case HTTPoison.get("<https://www.nytimes.com/>", @user_agent) do
    {:ok, %HTTPoison.Response{status_code: 200, body: body}} ->
      parse_articles(body)

    {:ok, %HTTPoison.Response{status_code: status_code}} ->
      IO.puts "Failed to retrieve web page, status code: #{status_code}"

    {:error, %HTTPoison.Error{}} ->
     IO.puts "Failed to retrieve web page"
  end

end

We make an HTTP GET request to fetch https://www.nytimes.com, passing that user agent header we configured earlier.

This returns an HTTP response which we pattern match on:

  • A 200 status code means success, so we pass the HTML body to parse_articles/1
  • Other status codes mean failure, so we print an error
  • Network errors also print an error
  • You'll often see this status code matching pattern when making web requests from Elixir.

    Inspecting the page

    We now inspect element in chrome to see how the code is structured…

    You can see that the articles are contained inside section tags and with the class story-wrapper

    Next up, parsing the articles:

    defp parse_articles(html) do
    
      html
      |> Floki.find("section.story-wrapper")
      |> Enum.map(fn element ->
    
        title =
          element
          |> Floki.find("h3.indicate-hover")
          |> Floki.text()
          |> String.trim()
    
        link =
          element
          |> Floki.find("a.css-9mylee")
          |> Floki.attribute("href")
    
        {title, link}
      end)
      |> print_articles()
    
    end
    

    The parse_articles function receives the raw HTML of the webpage. We use Floki to find all the story elements, then extract the title and link attributes from each one. The last line passes those article data maps to print_articles/1 to output.

    Finally, printing the articles:

    defp print_articles(articles) do
    
      Enum.each(articles, fn {title, link} ->
        IO.puts "Title: #{title}"
        IO.puts "Link: #{link}"
        IO.puts ""
      end)
    
    end
    

    This iterates through our list of article maps, printing out each title and link.

    And that's it! When we call fetch_articles, this will scrape and output the latest New York Times articles.

    We could extend this scraper to extract article authors, summaries, images, etc by finding additional elements with Floki. We could also persist these articles to a database to power an Elixir news app. Lots of possibilities!

    Key takeways:

  • Web scraping automates data extraction from websites
  • HTTPoison downloads HTML content
  • Floki parses and queries HTML in Elixir
  • Pattern matching on HTTP responses handles errors
  • Scrapers let you power all sorts of useful Elixir apps!
  • Give this scraper code a run yourself:

    defmodule Scraper do
      require HTTPoison
      require Floki
    
      @user_agent [{:"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"}]
    
      def fetch_articles do
        case HTTPoison.get("https://www.nytimes.com/", @user_agent) do
          {:ok, %HTTPoison.Response{status_code: 200, body: body}} ->
            parse_articles(body)
    
          {:ok, %HTTPoison.Response{status_code: status_code}} ->
            IO.puts "Failed to retrieve web page, status code: #{status_code}"
    
          {:error, %HTTPoison.Error{}} ->
            IO.puts "Failed to retrieve web page"
        end
      end
    
      defp parse_articles(html) do
        html
        |> Floki.find("section.story-wrapper")
        |> Enum.map(fn element ->
          title = 
            element
            |> Floki.find("h3.indicate-hover")
            |> Floki.text()
            |> String.trim()
    
          link = 
            element
            |> Floki.find("a.css-9mylee")
            |> Floki.attribute("href")
    
          {title, link}
        end)
        |> print_articles()
      end
    
      defp print_articles(articles) do
        Enum.each(articles, fn {title, link} ->
          IO.puts "Title: #{title}"
          IO.puts "Link: #{link}"
          IO.puts "" 
        end)
      end
    end

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!