Scraping Hacker News with Objective-C

Jan 21, 2024 · 8 min read

Web scraping can be an intimidating topic for beginners, but it doesn't have to be! In this comprehensive guide, we'll walk through how to scrape article data from Hacker News using Objective-C and XML parsing.

Whether you're just getting started with web scraping or are new to Objective-C, I'll break things down step-by-step to help you better understand how everything works. By the end, you'll have the knowledge to start scraping Hacker News as well as the confidence to apply these learnings to your own web scraping projects.

This is the page we are talking about…

Let's get started!

Installation & Setup

Before we dive into the code, let's quickly get set up with the Apple frameworks we'll need for this scraping script:

Foundation Framework

The Foundation framework provides core data types we'll rely on like NSURL, NSURLRequest, NSData, etc. This comes included with Xcode so no separate installation needed.

XML Parsing

We'll use XML parsing libraries to process the HTML content returned by the Hacker News website. Xcode comes with a built-in XML parser we can initialize like so:

NSXMLDocument *document = [[NSXMLDocument alloc] initWithData:responseData options:NSXMLDocumentHTMLKind error:&error];

And that's it for setup! Just import Foundation and we're ready to start scraping.

Scraping Code Walkthrough

With the basics covered, let's dive into the code...

Define URL and Create Request

First we construct the URL pointing to the Hacker News homepage we want to scrape:

NSURL *url = [NSURL URLWithString:@"<https://news.ycombinator.com/>"];

Next we create the actual NSURLRequest that will be used to retrieve and download the webpage content:

NSURLRequest *request = [NSURLRequest requestWithURL:url];

This defines the destination URL. The request will return an HTML document.

Send Request and Receive Response

To send the request and download the Hacker News HTML content, we use:

NSData *responseData = [NSURLConnection sendSynchronousRequest:request returningResponse:nil error:nil];

This kicks off the request and saves the returned data into the responseData variable.

We also do some quick validation to make sure the request succeeded and data was returned:

if ([responseData length] > 0) {
  // Data retrieved successfully!
} else {
  // Request failed
}

So far, so good! We've requested and downloaded the raw HTML data from the Hacker News site. Now the real work begins...

Parsing the HTML Content

With the HTML stored in responseData, we can start processing and extracting the data we want.

Hacker News uses table rows to display articles, so we'll relied on XML parsing to loop through the rows and identify article data.

Initialize XML Parser

Let's initialize a parser which we can use to traverse and evaluate the HTML content:

NSXMLDocument *document = [[NSXMLDocument alloc] initWithData:responseData options:NSXMLDocumentHTMLKind error:&error];

Specifying NSXMLDocumentHTMLKind configures the parser for HTML documents.

Find Table Rows with XPath

Inspecting the page

You can notice that the items are housed inside a tag with the class athing

With a configured parser, we can now start extracting data!

Hacker News displays articles in table rows, so we grab those elements by using an XPath query:

NSArray *rows = [document nodesForXPath:@"//tr" error:nil];

This gives us all nodes to iterate through.

XPath is a powerful querying language that allows us to extract elements by attributes, position, nesting and more. I'll cover some common techniques below.

Loop Through Rows to Identify Articles

With all table rows selected, we loop through them to identify the ones containing article content:

for (NSXMLElement *row in rows) {

  // Check if row marks an article
  if ([[row attributeForName:@"class"] isEqualToString:@"athing"]) {

    // This row represents an article

  }

}

Leveraging the class attribute, we can pick out article rows specifically.

From there, we can pair the article row with the next row (containing metadata like votes, date, etc) to extract a complete article record.

Extract Article Data

Now that we can identify article rows, let's look at how data is actually extracted.

Say we have an article row stored in currentArticle and corresponding detail row saved in row. Here's how we would grab some common fields:

Title

NSXMLElement *titleElem = [[currentArticle nodesForXPath:@"//span[@class='titleline']/a" error:nil] firstObject];

NSString *articleTitle = [titleElem stringValue];

The title is nested in an tag so we:

  1. Use XPath to find
  2. Get the first matching node
  3. Extract inner text as string value

URL

The article URL is stored in the href attribute of that same element, so we pull it directly.

Points

Here we:

  1. Find the in the row
  2. Then drill down to the nested
  3. Extract the score's inner text

And so on for other fields like author, comments, etc!

As you can see, XPath queries paired with stringValue and attribute access allow us to systematically extract data from the parsed HTML.

I won't walk through every single field, but hopefully this gives you a framework for how scraping can be approached!

Putting It All Together

Let's take one more high-level view of how everything connects before we conclude:

  1. Craft request - Define URL and create NSURLRequest
  2. Send request - Dispatch request and receive raw HTML response
  3. Initialize XML parser - Convert response into structured NSXMLDocument
  4. Use XPath queries - Traverse HTML nodes and extract data into native objects like NSString
  5. Transform data - Clean and structure content as needed

And at the end you have programmatic access to scrape and manipulate web content!

While it may seem daunting at first, by breaking things into smaller steps, web scraping become much more approachable.

Full Code Sample

For easy reference, here is the full scraping script covered in this guide:

#import <Foundation/Foundation.h>

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        // Define the URL of the Hacker News homepage
        NSURL *url = [NSURL URLWithString:@"https://news.ycombinator.com/"];
        
        // Create a URL request
        NSURLRequest *request = [NSURLRequest requestWithURL:url];
        
        // Send the request and receive the response
        NSData *responseData = [NSURLConnection sendSynchronousRequest:request returningResponse:nil error:nil];
        
        // Check if the request was successful (status code 200)
        if ([responseData length] > 0) {
            // Parse the HTML content using NSXMLDocument
            NSError *error = nil;
            NSXMLDocument *document = [[NSXMLDocument alloc] initWithData:responseData options:NSXMLDocumentHTMLKind error:&error];
            
            if (document) {
                // Find all rows in the table
                NSArray *rows = [document nodesForXPath:@"//tr" error:nil];
                
                // Initialize variables to keep track of the current article and row type
                NSXMLElement *currentArticle = nil;
                NSString *currentRowType = nil;
                
                // Iterate through the rows to scrape articles
                for (NSXMLElement *row in rows) {
                    // Check if this is an article row
                    if ([[row attributeForName:@"class"] stringValue] && [[[row attributeForName:@"class"] stringValue] isEqualToString:@"athing"]) {
                        currentArticle = row;
                        currentRowType = @"article";
                    } else if ([currentRowType isEqualToString:@"article"]) {
                        // This is the details row
                        if (currentArticle) {
                            // Extract information from the current article and details row
                            NSXMLElement *titleElem = [[currentArticle nodesForXPath:@"//span[@class='titleline']/a" error:nil] firstObject];
                            NSString *articleTitle = [titleElem stringValue];
                            NSString *articleURL = [[titleElem attributeForName:@"href"] stringValue];
                            
                            NSXMLElement *subtext = [[row nodesForXPath:@"//td[@class='subtext']" error:nil] firstObject];
                            NSString *points = [[[subtext nodesForXPath:@"//span[@class='score']" error:nil] firstObject] stringValue];
                            NSString *author = [[[subtext nodesForXPath:@"//a[@class='hnuser']" error:nil] firstObject] stringValue];
                            NSString *timestamp = [[[subtext nodesForXPath:@"//span[@class='age']/@title" error:nil] firstObject] stringValue];
                            NSXMLElement *commentsElem = [[subtext nodesForXPath:@"//a[contains(text(),'comments')]" error:nil] firstObject];
                            NSString *comments = [commentsElem stringValue] ?: @"0";
                            
                            // Print the extracted information
                            NSLog(@"Title: %@", articleTitle);
                            NSLog(@"URL: %@", articleURL);
                            NSLog(@"Points: %@", points);
                            NSLog(@"Author: %@", author);
                            NSLog(@"Timestamp: %@", timestamp);
                            NSLog(@"Comments: %@", comments);
                            NSLog(@"--------------------------------------------------");
                        }
                        
                        // Reset the current article and row type
                        currentArticle = nil;
                        currentRowType = nil;
                    }
                }
            } else {
                NSLog(@"Failed to parse HTML document. Error: %@", [error localizedDescription]);
            }
        } else {
            NSLog(@"Failed to retrieve the page.");
        }
    }
    return 0;
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!