Scraping Wikipedia Tables in Objective-C for Beginners

Gathering data by scraping websites sounds advanced, but is easy to pick up. We'll extract a table from Wikipedia in Objective-C with just 34 lines of code! No need to feel overwhelmed - I'll walk you through it step-by-step.

Here's what you'll learn:

Making HTTP requests to access web pages

Parsing the HTML content using TFHpple

Extracting data from a table with XPath

Cleaning and processing the scraped content

Let's get scraping!

Our Goal

We want to get data on all the Presidents of the United States from this Wikipedia page.

It has a nicely formatted HTML table that serves our purpose. We'll use Objective-C and a nifty library called TFHpple to scrape it.

This is the table we are talking about

Key Concepts

Before we jump into the code, let's understand at a high-level what's going on behind the scenes:

We use NSURLSession to send a GET request to the Wikipedia URL

The server returns the page content as raw HTML

We parse this HTML using TFHpple's XPath queries

XPath helps us target the exact table element to extract data from

We traverse rows and columns to store the table data into an array

We're essentally asking Wikipedia for data and processing the response - pretty cool!

The Code

Time to scrape. I'll break the full code down section-by-section. Don't worry if something is unclear at first, I'll explain everything.

Imports & Setup

#import <Foundation/Foundation.h>
#import "TFHpple.h"

int main(int argc, const char * argv[]) {

    @autoreleasepool {

We import Foundation to use Apple's frameworks and the TFHpple parser library. The rest sets up the main function.

Pro Tip: Importing useful libraries upfront saves headaches later!

Define URL

NSString *urlString = @"<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>";

NSURL *url = [NSURL URLWithString:urlString];

We initialize a NSString with the URL of the Wikipedia page we want to scrape.

Then create a NSURL object from that string to represent the target URL. All web pages have a unique address - this is ours.

Set User Agent Header

NSDictionary *headers = @{
    @"User-Agent": @"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
};

We set a custom User-Agent header that mimics a Chrome browser request.

Pro Tip: Websites treat programs differently from browsers. Faking the header helps avoid blocks.

Create and Send Request

NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:url];

[request setAllHTTPHeaderFields:headers];

NSURLSession *session = [NSURLSession sharedSession];

NSURLSessionDataTask *dataTask = [session dataTaskWithRequest:request completionHandler:^(NSData *data, NSURLResponse *response, NSError *error) {

First we initialize a mutable request object with our target NSURL.

We attach the headers dictionary to it containing the User-Agent.

Next we get a shared NSURLSession instance and call dataTaskWithRequest on it. This sends an asynchronous GET request to the URL when we resume the task.

The completion handler is the meat of our scraper...

Handle the Response

if (error) {

    NSLog(@"Failed to retrieve the web page. Error: %@", error.localizedDescription);
    return;

}

NSHTTPURLResponse *httpResponse = (NSHTTPURLResponse*) response;

if (httpResponse.statusCode == 200) {

We check if any errors occurred during the request. If so, we print it and return.

Otherwise, we access the HTTP status code to check it was successful. Status 200 means OK!

Inspecting the page

When we inspect the page we can see that the table has a class called wikitable and sortable

TFHpple *doc = [[TFHpple alloc] initWithHTMLData:data];

NSArray *tables = [doc searchWithXPathQuery:@"//table[@class='wikitable sortable']"];

if (tables.count > 0) {

    TFHppleElement *table = tables[0];

    NSMutableArray *data = [NSMutableArray array];

We initialize TFHpple by passing the raw HTML webpage data.

searchWithXPathQuery lets us find elements by their XPath. We target the presidential data table using its class name.

If found, we get the first matching table element then initialize a NSMutableArray to store our scraped data.

Extract Rows & Columns

NSArray *rows = [table childrenWithTagName:@"tr"];

for (int i = 1; i < rows.count; i++) {

    TFHppleElement *row = rows[i];

    NSArray *columns = [row children];

    NSMutableArray *rowData = [NSMutableArray array];

We get all the row elements from our table. Loop through them skipping the first header row.

For each row, find all columns using children again and initialize a rowData array.

    for (TFHppleElement *col in columns) {

        NSString *text = [col content];

        [rowData addObject:[text stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]];

    }

[data addObject:rowData];

}

Inside this nested loop, we extract the text content of each cell stripping whitespace.

We add this cleaned cell data into our rowData array and the full row into the main data array.

Print Scraped Data

for (NSArray *presidentData in data) {

    NSLog(@"President Data:");
    NSLog(@"Number: %@", presidentData[0]);
    NSLog(@"Name: %@", presidentData[2]);
    NSLog(@"Term: %@", presidentData[3]);
    NSLog(@"Party: %@", presidentData[5]);
    NSLog(@"Election: %@", presidentData[6]);
    NSLog(@"Vice President: %@", presidentData[7]);
    NSLog(@"\\n");

}

Finally, we can iterate through our 2D array and access each piece of data we want to print!

The full row and column position references were figured out through trial and error.

And we're done - not so bad for a beginner, right?

Key Takeaways

Let's recap what we learned:

Used NSURLSession to request webpages

Learned how to handle request errors

Parsed HTML responses with TFHpple + XPath

Extracted table data row-by-row and column-by-column

Printed structured presidential data scraped from Wikipedia

With just 34 lines of code!

What Next?

You could scrape bigger tables or multiple ones from the same site. Try removing boilerplate code with libraries like AFNetworking. Or queue requests asynchronously for efficiency.

Hope this tutorial helped you grasp the basics. Web scraping isn't so daunting when taken one step at a time. Feel free to tweak the code and practice on other websites.

Happy scraping!

#import <Foundation/Foundation.h>
#import "TFHpple.h"

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        // Define the URL of the Wikipedia page
        NSString *urlString = @"https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States";
        NSURL *url = [NSURL URLWithString:urlString];
        
        // Define a user-agent header to simulate a browser request
        NSDictionary *headers = @{
            @"User-Agent": @"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
        };
        
        // Create a request with headers
        NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:url];
        [request setAllHTTPHeaderFields:headers];
        
        // Send an HTTP GET request to the URL
        NSURLSession *session = [NSURLSession sharedSession];
        NSURLSessionDataTask *dataTask = [session dataTaskWithRequest:request completionHandler:^(NSData *data, NSURLResponse *response, NSError *error) {
            if (error) {
                NSLog(@"Failed to retrieve the web page. Error: %@", error.localizedDescription);
                return;
            }
            
            NSHTTPURLResponse *httpResponse = (NSHTTPURLResponse *)response;
            if (httpResponse.statusCode == 200) {
                // Parse the HTML content of the page using TFHpple
                TFHpple *doc = [[TFHpple alloc] initWithHTMLData:data];
                
                // Find the table with the specified class name
                NSArray *tables = [doc searchWithXPathQuery:@"//table[@class='wikitable sortable']"];
                if (tables.count > 0) {
                    TFHppleElement *table = tables[0];
                    
                    // Initialize an empty array to store the table data
                    NSMutableArray *data = [NSMutableArray array];
                    
                    // Iterate through the rows of the table
                    NSArray *rows = [table childrenWithTagName:@"tr"];
                    for (int i = 1; i < rows.count; i++) { // Skip the header row
                        TFHppleElement *row = rows[i];
                        
                        // Extract data from each column and append it to the data array
                        NSArray *columns = [row children];
                        NSMutableArray *rowData = [NSMutableArray array];
                        
                        for (TFHppleElement *col in columns) {
                            NSString *text = [col content];
                            [rowData addObject:[text stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]];
                        }
                        
                        [data addObject:rowData];
                    }
                    
                    // Print the scraped data for all presidents
                    for (NSArray *presidentData in data) {
                        NSLog(@"President Data:");
                        NSLog(@"Number: %@", presidentData[0]);
                        NSLog(@"Name: %@", presidentData[2]);
                        NSLog(@"Term: %@", presidentData[3]);
                        NSLog(@"Party: %@", presidentData[5]);
                        NSLog(@"Election: %@", presidentData[6]);
                        NSLog(@"Vice President: %@", presidentData[7]);
                        NSLog(@"\n");
                    }
                } else {
                    NSLog(@"Table not found on the web page.");
                }
            } else {
                NSLog(@"Failed to retrieve the web page. Status code: %ld", (long)httpResponse.statusCode);
            }
        }];
        
        [dataTask resume];
        
        [[NSRunLoop currentRunLoop] run];
    }
    return 0;
}

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping Wikipedia Tables in Objective-C for Beginners

Our Goal

Key Concepts

The Code

Imports & Setup

Define URL

Set User Agent Header

Create and Send Request

Handle the Response

Extract Rows & Columns

Print Scraped Data

Key Takeaways

What Next?

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Wikipedia Tables in Objective-C for Beginners

Our Goal

Key Concepts

The Code

Imports & Setup

Define URL

Set User Agent Header

Create and Send Request

Handle the Response

Extract Rows & Columns

Print Scraped Data

Key Takeaways

What Next?

The easiest way to do Web Scraping

Don't leave just yet!