Scraping Hacker News with Elixir

Hacker News is a popular tech news aggregator where users post links to interesting articles, blog posts, projects, and more. As an avid reader of Hacker News, you may have wondered if there was a way to programmatically scrape articles from the site. In this tutorial, we'll walk through Elixir code to scrape the top headlines from the Hacker News homepage.

This is the page we are talking about…

Overview

We'll use the Elixir programming language along with a few helper libraries to:

Make an HTTP request to retrieve the Hacker News homepage HTML
Parse the HTML content using a parser called Floki
Extract specific elements from the parsed content to get article titles, points, authors etc.
Print out the extracted information

The full code is provided at the end of this article so you can use it yourself.

Let's get started!

Install Elixir

If you don't already have Elixir installed, you can install it by following the instructions on the official Elixir site. Make sure you have a recent version installed (at least Elixir 1.11).

Create an Elixir Project

Start a new project using the mix tool that comes with Elixir:

mix new hacker_news

This will generate a simple project structure for us to work in.

Install Dependencies

Our code imports a few helper libraries:

HTTPoison - for making HTTP requests

Floki - for HTML parsing

Let's grab these by adding them to mix.exs. Open up the new file and under defp deps do:

[
  {:httpoison, "~> 1.8"},
  {:floki, "~> 0.31.0"}
]

Now run mix deps.get to fetch them locally.

Making HTTP Requests

The first thing our scraper does is make an HTTP GET request to retrieve the homepage HTML content:

HTTPoison.get(@url)

The @url module attribute contains the Hacker News homepage URL.

HTTPoison.get/1 sends the GET request asynchronously and returns a tuple:

{:ok, %HTTPoison.Response{...}}

This response contains the status code and body of the page if the request succeeded. We pattern match on the tuple to handle different cases:

case HTTPoison.get(@url) do
  {:ok, %HTTPoison.Response{status_code: 200, body: body}} ->
    # Request succeeded

  {:ok, %HTTPoison.Response{status_code: code}} ->
    # Request failed

  {:error, reason} ->
    # Error making request
end

For a status code of 200, the body contains the raw HTML we want to parse.

Parsing HTML with Floki

Floki provides a nice API for searching HTML documents using CSS selectors. We can parse the entire body using:

doc = Floki.parse_document(body)

This builds a tree structure representing elements like

, , tag with the class athing

The Hacker News homepage displays articles in a table. We want to iterate through each

row and extract data.

First find all rows:

rows = Floki.find(doc, "tr")

Then we can loop through them:

Enum.each(rows, fn row ->
  # extract data from each row
end)

There are a few types of rows we care about:

class="athing" - the article title row

Details row below that

Spacer rows in between articles

We check Floki.attribute(row, "class") to identify row types.

Extracting Article Data

When we hit an article row, we save a reference to that element:

if Floki.attribute(row, "class") == ["athing"] do
  current_article = row
end

On the next iteration when we reach the details row, we can now connect it to the article title and extract additional metadata like points, comments etc.

elsif current_row_type == "article" do

  if current_article do

    # Extract article title
    title_elem = Floki.find(current_article, "span.titleline a")
    article_title = Floki.text(title_elem)

    # Extract article URL
    article_url = Floki.attribute(title_elem, "href")

    # Get subtext details row
    subtext = Floki.find(row, "td.subtext")

    # Extract points
    points = Floki.text(Floki.find(subtext, "span.score"))

    # Extract author
    author = Floki.text(Floki.find(subtext, "a.hnuser"))

    # Extract timestamp
    timestamp = Floki.attribute(Floki.find(subtext, "span.age"), "title")

    # Extract comment count
    comments_elem = Floki.find(subtext, "a:contains('comments')")
    comments = if comments_elem, do: Floki.text(comments_elem), else: "0"

    # Print out extracted data
    IO.puts("Title: #{article_title}")
    IO.puts("URL: #{@url <> article_url}")
    IO.puts("Points: #{points}")
    IO.puts("Author: #{author}")
    IO.puts("Timestamp: #{timestamp}")
    IO.puts("Comments: #{comments}")

  end

end

The key thing to note here are the CSS selectors. For example:

span.titleline a

This finds the anchor tag inside a with class titleline. This pinpoints the exact element with the article title text we want.

Floki.text/1 gets the inner text content of that element.

Floki.attribute/2 grabs a specific attribute like href.

We continue this process, using very precise selectors targeted small elements to cleanly extract each data field we need.

Putting It All Together

The full code ties together:

Making the HTTP request

Parsing HTML

Iterating rows

Extacting article data

Printing output

With some helpful error handling added as well.

Here is the complete scraper:

defmodule HackerNewsScraper do
  @url "https://news.ycombinator.com/"

  def scrape do
    case HTTPoison.get(@url) do
      {:ok, %HTTPoison.Response{status_code: 200, body: body}} ->
        # Parse the HTML content of the page using Floki
        doc = Floki.parse_document(body)

        # Find all rows in the table
        rows = Floki.find(doc, "tr")

        # Initialize variables to keep track of the current article and row type
        current_article = nil
        current_row_type = nil

        # Iterate through the rows to scrape articles
        Enum.each(rows, fn row ->
          if Floki.attribute(row, "class") == ["athing"] do
            # This is an article row
            current_article = row
            current_row_type = "article"
          elsif current_row_type == "article" do
            # This is the details row
            if current_article do
              # Extract information from the current article and details row
              title_elem = Floki.find(current_article, "span.titleline a")
              article_title = Floki.text(title_elem)
              article_url = Floki.attribute(title_elem, "href")

              subtext = Floki.find(row, "td.subtext")
              points = Floki.text(Floki.find(subtext, "span.score"))
              author = Floki.text(Floki.find(subtext, "a.hnuser"))
              timestamp = Floki.attribute(Floki.find(subtext, "span.age"), "title")
              comments_elem = Floki.find(subtext, "a:contains('comments')")
              comments = if comments_elem, do: Floki.text(comments_elem), else: "0"

              # Print the extracted information
              IO.puts("Title: #{article_title}")
              IO.puts("URL: #{@url <> article_url}")
              IO.puts("Points: #{points}")
              IO.puts("Author: #{author}")
              IO.puts("Timestamp: #{timestamp}")
              IO.puts("Comments: #{comments}")
              IO.puts(String.duplicate("-", 50))
            end

            # Reset the current article and row type
            current_article = nil
            current_row_type = nil
          elsif Floki.attribute(row, "style") == "height:5px" do
            # This is the spacer row, skip it
            :ok
          else
            # Other types of rows, skip them
            :ok
          end
        end)

      {:ok, %HTTPoison.Response{status_code: code}} ->
        IO.puts("Failed to retrieve the page. Status code: #{code}")

      {:error, reason} ->
        IO.puts("Failed to make the HTTP request. Reason: #{reason}")
    end
  end
end

# Run the scraper
HackerNewsScraper.scrape()

Save this code into a file like lib/scraper.ex and run with:

$ elixir lib/scraper.ex

You should see scraped headlines print out!

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Browse by language:

The easiest way to do Web Scraping

Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you

Try ProxiesAPI for free

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
...

etc. that we can now query.

Scraping Rows from the Table

Inspecting the page

You can notice that the items are housed inside a