This Java code scrapes article titles, URLs, points, authors, timestamps, and comment counts from the Hacker News homepage. By the end, you'll understand how to use Jsoup and CSS selectors to extract information from an HTML page.

This is the page we are talking about…

Let's get started!

Prerequisites

To run this web scraper, you need:

JDK 8+

Jsoup library

Walkthrough of Code

We first import Jsoup and some helpful classes:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

Jsoup - Main Jsoup class with connection methods

Document - Represents an HTML document parsed by Jsoup

Element - Single element node in HTML doc

Elements - Collection of elements

In the main method, we define the Hacker News homepage URL:

String url = "<https://news.ycombinator.com/>";

We send a GET request with Jsoup to retrieve and parse the page HTML:

Document doc = Jsoup.connect(url).get();

The doc variable now contains a Jsoup Document representing the parsed Hacker News homepage.

Inspecting the page

You can notice that the items are housed inside a tag with the class athing

Next, we select all table rows on the page:

Elements rows = doc.select("tr");

We use two variables to track state as we iterate through rows:

Element currentArticle = null;
String currentRowType = null;

currentArticle - Saves current article row Element

currentRowType - Tracks if we're on "article" or "details" row

We loop through the rows:

for (Element row : rows) {

// row processing code

}

Inside the loop, we first check if row has "athing" class - indicating it's an article row:

if (row.hasClass("athing")) {

    currentArticle = row;
    currentRowType = "article";

}

If so, we save the row to currentArticle and set currentRowType accordingly.

Next, we check if previous row type was "article", meaning current row holds article details:

} else if ("article".equals(currentRowType)) {

    // Extract article data from currentArticle

}

Inside this conditional, we extract article data if currentArticle is not null:

if (currentArticle != null) {

    // Extract title
    Element titleElem = currentArticle.selectFirst("span.titleline");
    String articleTitle = titleElem.selectFirst("a").text();

    // Extract URL
    String articleUrl = titleElem.selectFirst("a").attr("href");

    // Extract points, author, timestamp, comments
    Element subtext = row.selectFirst("td.subtext");
    String points = subtext.selectFirst("span.score").text();
    String author = subtext.selectFirst("a.hnuser").text();
    String timestamp = subtext.selectFirst("span.age").attr("title");
    Element commentsElem = subtext.selectFirst("a:contains(comments)");
    String comments = commentsElem != null ? commentsElem.text() : "0";

    // Print extracted data
    System.out.println("Title: " + articleTitle);
    // ...
}

Let's break this down:

titleElem - Get element with title text

articleTitle - Extract text from anchor tag

articleUrl - Get href attribute of anchor tag

subtext - Element with additional details

points/author/timestamp - Extract text from spans

commentsElem - Get comments anchor element

comments - Extract text, 0 if no element

Finally, we reset the state variables and check if row is a spacer row to skip.

Full code:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

public class HackerNewsScraper {

    public static void main(String[] args) {
        // Define the URL of the Hacker News homepage
        String url = "https://news.ycombinator.com/";

        try {
            // Send a GET request to the URL and parse the HTML content
            Document doc = Jsoup.connect(url).get();

            // Find all rows in the table
            Elements rows = doc.select("tr");

            // Initialize variables to keep track of the current article and row type
            Element currentArticle = null;
            String currentRowType = null;

            // Iterate through the rows to scrape articles
            for (Element row : rows) {
                if (row.hasClass("athing")) {
                    // This is an article row
                    currentArticle = row;
                    currentRowType = "article";
                } else if ("article".equals(currentRowType)) {
                    // This is the details row
                    if (currentArticle != null) {
                        // Extract information from the current article and details row
                        Element titleElem = currentArticle.selectFirst("span.titleline");
                        if (titleElem != null) {
                            String articleTitle = titleElem.selectFirst("a").text();  // Get the text of the anchor element
                            String articleUrl = titleElem.selectFirst("a").attr("href");  // Get the href attribute of the anchor element

                            Element subtext = row.selectFirst("td.subtext");
                            String points = subtext.selectFirst("span.score").text();
                            String author = subtext.selectFirst("a.hnuser").text();
                            String timestamp = subtext.selectFirst("span.age").attr("title");
                            Element commentsElem = subtext.selectFirst("a:contains(comments)");
                            String comments = commentsElem != null ? commentsElem.text() : "0";

                            // Print the extracted information
                            System.out.println("Title: " + articleTitle);
                            System.out.println("URL: " + articleUrl);
                            System.out.println("Points: " + points);
                            System.out.println("Author: " + author);
                            System.out.println("Timestamp: " + timestamp);
                            System.out.println("Comments: " + comments);
                            System.out.println("-".repeat(50));  // Separating articles
                        }
                    }

                    // Reset the current article and row type
                    currentArticle = null;
                    currentRowType = null;
                } else if ("height:5px".equals(row.attr("style"))) {
                    // This is the spacer row, skip it
                    continue;
                }
            }
        } catch (IOException e) {
            System.err.println("Failed to retrieve the page. Error: " + e.getMessage());
        }
    }
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Scraping Hacker News Articles with Java

Prerequisites

Walkthrough of Code

Browse by language:

The easiest way to do Web Scraping

Scraping Hacker News Articles with Java

Prerequisites

Walkthrough of Code

The easiest way to do Web Scraping

Don't leave just yet!