Scraping Hacker News Articles with Java

Jan 21, 2024 · 6 min read

This Java code scrapes article titles, URLs, points, authors, timestamps, and comment counts from the Hacker News homepage. By the end, you'll understand how to use Jsoup and CSS selectors to extract information from an HTML page.

This is the page we are talking about…

Let's get started!

Prerequisites

To run this web scraper, you need:

  • JDK 8+
  • Jsoup library
  • Walkthrough of Code

    We first import Jsoup and some helpful classes:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
  • Jsoup - Main Jsoup class with connection methods
  • Document - Represents an HTML document parsed by Jsoup
  • Element - Single element node in HTML doc
  • Elements - Collection of elements
  • In the main method, we define the Hacker News homepage URL:

    String url = "<https://news.ycombinator.com/>";
    

    We send a GET request with Jsoup to retrieve and parse the page HTML:

    Document doc = Jsoup.connect(url).get();
    

    The doc variable now contains a Jsoup Document representing the parsed Hacker News homepage.

    Inspecting the page

    You can notice that the items are housed inside a tag with the class athing

    Next, we select all table rows on the page:

    Elements rows = doc.select("tr");
    

    We use two variables to track state as we iterate through rows:

    Element currentArticle = null;
    String currentRowType = null;
    
  • currentArticle - Saves current article row Element
  • currentRowType - Tracks if we're on "article" or "details" row
  • We loop through the rows:

    for (Element row : rows) {
    
    // row processing code
    
    }
    

    Inside the loop, we first check if row has "athing" class - indicating it's an article row:

    if (row.hasClass("athing")) {
    
        currentArticle = row;
        currentRowType = "article";
    
    }
    

    If so, we save the row to currentArticle and set currentRowType accordingly.

    Next, we check if previous row type was "article", meaning current row holds article details:

    } else if ("article".equals(currentRowType)) {
    
        // Extract article data from currentArticle
    
    }
    

    Inside this conditional, we extract article data if currentArticle is not null:

    if (currentArticle != null) {
    
        // Extract title
        Element titleElem = currentArticle.selectFirst("span.titleline");
        String articleTitle = titleElem.selectFirst("a").text();
    
        // Extract URL
        String articleUrl = titleElem.selectFirst("a").attr("href");
    
        // Extract points, author, timestamp, comments
        Element subtext = row.selectFirst("td.subtext");
        String points = subtext.selectFirst("span.score").text();
        String author = subtext.selectFirst("a.hnuser").text();
        String timestamp = subtext.selectFirst("span.age").attr("title");
        Element commentsElem = subtext.selectFirst("a:contains(comments)");
        String comments = commentsElem != null ? commentsElem.text() : "0";
    
        // Print extracted data
        System.out.println("Title: " + articleTitle);
        // ...
    }
    

    Let's break this down:

  • titleElem - Get element with title text
  • articleTitle - Extract text from anchor tag
  • articleUrl - Get href attribute of anchor tag
  • subtext - Element with additional details
  • points/author/timestamp - Extract text from spans
  • commentsElem - Get comments anchor element
  • comments - Extract text, 0 if no element
  • Finally, we reset the state variables and check if row is a spacer row to skip.

    Full code:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    import java.io.IOException;
    
    public class HackerNewsScraper {
    
        public static void main(String[] args) {
            // Define the URL of the Hacker News homepage
            String url = "https://news.ycombinator.com/";
    
            try {
                // Send a GET request to the URL and parse the HTML content
                Document doc = Jsoup.connect(url).get();
    
                // Find all rows in the table
                Elements rows = doc.select("tr");
    
                // Initialize variables to keep track of the current article and row type
                Element currentArticle = null;
                String currentRowType = null;
    
                // Iterate through the rows to scrape articles
                for (Element row : rows) {
                    if (row.hasClass("athing")) {
                        // This is an article row
                        currentArticle = row;
                        currentRowType = "article";
                    } else if ("article".equals(currentRowType)) {
                        // This is the details row
                        if (currentArticle != null) {
                            // Extract information from the current article and details row
                            Element titleElem = currentArticle.selectFirst("span.titleline");
                            if (titleElem != null) {
                                String articleTitle = titleElem.selectFirst("a").text();  // Get the text of the anchor element
                                String articleUrl = titleElem.selectFirst("a").attr("href");  // Get the href attribute of the anchor element
    
                                Element subtext = row.selectFirst("td.subtext");
                                String points = subtext.selectFirst("span.score").text();
                                String author = subtext.selectFirst("a.hnuser").text();
                                String timestamp = subtext.selectFirst("span.age").attr("title");
                                Element commentsElem = subtext.selectFirst("a:contains(comments)");
                                String comments = commentsElem != null ? commentsElem.text() : "0";
    
                                // Print the extracted information
                                System.out.println("Title: " + articleTitle);
                                System.out.println("URL: " + articleUrl);
                                System.out.println("Points: " + points);
                                System.out.println("Author: " + author);
                                System.out.println("Timestamp: " + timestamp);
                                System.out.println("Comments: " + comments);
                                System.out.println("-".repeat(50));  // Separating articles
                            }
                        }
    
                        // Reset the current article and row type
                        currentArticle = null;
                        currentRowType = null;
                    } else if ("height:5px".equals(row.attr("style"))) {
                        // This is the spacer row, skip it
                        continue;
                    }
                }
            } catch (IOException e) {
                System.err.println("Failed to retrieve the page. Error: " + e.getMessage());
            }
        }
    }

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!