Scraping Hacker News with Elixir

Jan 21, 2024 · 8 min read

Hacker News is a popular tech news aggregator where users post links to interesting articles, blog posts, projects, and more. As an avid reader of Hacker News, you may have wondered if there was a way to programmatically scrape articles from the site. In this tutorial, we'll walk through Elixir code to scrape the top headlines from the Hacker News homepage.

This is the page we are talking about…

Overview

We'll use the Elixir programming language along with a few helper libraries to:

  1. Make an HTTP request to retrieve the Hacker News homepage HTML
  2. Parse the HTML content using a parser called Floki
  3. Extract specific elements from the parsed content to get article titles, points, authors etc.
  4. Print out the extracted information

The full code is provided at the end of this article so you can use it yourself.

Let's get started!

Install Elixir

If you don't already have Elixir installed, you can install it by following the instructions on the official Elixir site. Make sure you have a recent version installed (at least Elixir 1.11).

Create an Elixir Project

Start a new project using the mix tool that comes with Elixir:

mix new hacker_news

This will generate a simple project structure for us to work in.

Install Dependencies

Our code imports a few helper libraries:

  • HTTPoison - for making HTTP requests
  • Floki - for HTML parsing
  • Let's grab these by adding them to mix.exs. Open up the new file and under defp deps do:

    [
      {:httpoison, "~> 1.8"},
      {:floki, "~> 0.31.0"}
    ]
    

    Now run mix deps.get to fetch them locally.

    Making HTTP Requests

    The first thing our scraper does is make an HTTP GET request to retrieve the homepage HTML content:

    HTTPoison.get(@url)
    

    The @url module attribute contains the Hacker News homepage URL.

    HTTPoison.get/1 sends the GET request asynchronously and returns a tuple:

    {:ok, %HTTPoison.Response{...}}
    

    This response contains the status code and body of the page if the request succeeded. We pattern match on the tuple to handle different cases:

    case HTTPoison.get(@url) do
      {:ok, %HTTPoison.Response{status_code: 200, body: body}} ->
        # Request succeeded
    
      {:ok, %HTTPoison.Response{status_code: code}} ->
        # Request failed
    
      {:error, reason} ->
        # Error making request
    end
    

    For a status code of 200, the body contains the raw HTML we want to parse.

    Parsing HTML with Floki

    Floki provides a nice API for searching HTML documents using CSS selectors. We can parse the entire body using:

    doc = Floki.parse_document(body)
    

    This builds a tree structure representing elements like

    , , tag with the class athing

    The Hacker News homepage displays articles in a table. We want to iterate through each

    row and extract data.

    First find all rows:

    rows = Floki.find(doc, "tr")
    

    Then we can loop through them:

    Enum.each(rows, fn row ->
      # extract data from each row
    end)
    

    There are a few types of rows we care about:

  • class="athing" - the article title row
  • Details row below that
  • Spacer rows in between articles
  • We check Floki.attribute(row, "class") to identify row types.

    Extracting Article Data

    When we hit an article row, we save a reference to that element:

    if Floki.attribute(row, "class") == ["athing"] do
      current_article = row
    end
    

    On the next iteration when we reach the details row, we can now connect it to the article title and extract additional metadata like points, comments etc.

    elsif current_row_type == "article" do
    
      if current_article do
    
        # Extract article title
        title_elem = Floki.find(current_article, "span.titleline a")
        article_title = Floki.text(title_elem)
    
        # Extract article URL
        article_url = Floki.attribute(title_elem, "href")
    
        # Get subtext details row
        subtext = Floki.find(row, "td.subtext")
    
        # Extract points
        points = Floki.text(Floki.find(subtext, "span.score"))
    
        # Extract author
        author = Floki.text(Floki.find(subtext, "a.hnuser"))
    
        # Extract timestamp
        timestamp = Floki.attribute(Floki.find(subtext, "span.age"), "title")
    
        # Extract comment count
        comments_elem = Floki.find(subtext, "a:contains('comments')")
        comments = if comments_elem, do: Floki.text(comments_elem), else: "0"
    
        # Print out extracted data
        IO.puts("Title: #{article_title}")
        IO.puts("URL: #{@url <> article_url}")
        IO.puts("Points: #{points}")
        IO.puts("Author: #{author}")
        IO.puts("Timestamp: #{timestamp}")
        IO.puts("Comments: #{comments}")
    
      end
    
    end
    

    The key thing to note here are the CSS selectors. For example:

    span.titleline a
    

    This finds the anchor tag inside a with class titleline. This pinpoints the exact element with the article title text we want.

    Floki.text/1 gets the inner text content of that element.

    Floki.attribute/2 grabs a specific attribute like href.

    We continue this process, using very precise selectors targeted small elements to cleanly extract each data field we need.

    Putting It All Together

    The full code ties together:

  • Making the HTTP request
  • Parsing HTML
  • Iterating rows
  • Extacting article data
  • Printing output
  • With some helpful error handling added as well.

    Here is the complete scraper:

    defmodule HackerNewsScraper do
      @url "https://news.ycombinator.com/"
    
      def scrape do
        case HTTPoison.get(@url) do
          {:ok, %HTTPoison.Response{status_code: 200, body: body}} ->
            # Parse the HTML content of the page using Floki
            doc = Floki.parse_document(body)
    
            # Find all rows in the table
            rows = Floki.find(doc, "tr")
    
            # Initialize variables to keep track of the current article and row type
            current_article = nil
            current_row_type = nil
    
            # Iterate through the rows to scrape articles
            Enum.each(rows, fn row ->
              if Floki.attribute(row, "class") == ["athing"] do
                # This is an article row
                current_article = row
                current_row_type = "article"
              elsif current_row_type == "article" do
                # This is the details row
                if current_article do
                  # Extract information from the current article and details row
                  title_elem = Floki.find(current_article, "span.titleline a")
                  article_title = Floki.text(title_elem)
                  article_url = Floki.attribute(title_elem, "href")
    
                  subtext = Floki.find(row, "td.subtext")
                  points = Floki.text(Floki.find(subtext, "span.score"))
                  author = Floki.text(Floki.find(subtext, "a.hnuser"))
                  timestamp = Floki.attribute(Floki.find(subtext, "span.age"), "title")
                  comments_elem = Floki.find(subtext, "a:contains('comments')")
                  comments = if comments_elem, do: Floki.text(comments_elem), else: "0"
    
                  # Print the extracted information
                  IO.puts("Title: #{article_title}")
                  IO.puts("URL: #{@url <> article_url}")
                  IO.puts("Points: #{points}")
                  IO.puts("Author: #{author}")
                  IO.puts("Timestamp: #{timestamp}")
                  IO.puts("Comments: #{comments}")
                  IO.puts(String.duplicate("-", 50))
                end
    
                # Reset the current article and row type
                current_article = nil
                current_row_type = nil
              elsif Floki.attribute(row, "style") == "height:5px" do
                # This is the spacer row, skip it
                :ok
              else
                # Other types of rows, skip them
                :ok
              end
            end)
    
          {:ok, %HTTPoison.Response{status_code: code}} ->
            IO.puts("Failed to retrieve the page. Status code: #{code}")
    
          {:error, reason} ->
            IO.puts("Failed to make the HTTP request. Reason: #{reason}")
        end
      end
    end
    
    # Run the scraper
    HackerNewsScraper.scrape()

    Save this code into a file like lib/scraper.ex and run with:

    You should see scraped headlines print out!

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key:

    etc. that we can now query.

    Scraping Rows from the Table

    Inspecting the page

    You can notice that the items are housed inside a