Guide to Scraping Reddit Posts in Objective C

Jan 9, 2024 · 7 min read

Parsing through an unfamiliar code base can be intimidating for beginner programmers. When facing lines of code, it's hard to know where to start or what exactly is happening.

In this article, we'll walk step-by-step through a sample program that scrapes posts from Reddit. We'll learn how it:

  • Downloads the Reddit homepage
  • Extracts the HTML
  • Uses XPath selectors to pull out post information
  • Stores data in easy to access objects
  • here is the page we are talking about

    By the end, you'll understand the key concepts so you can adapt the code for your own scraping projects!

    Downloading the Page

    First, we create a NSString variable containing the Reddit URL we want to scrape:

    NSString *redditUrl = @"<https://www.reddit.com>";
    

    Next, we convert that string into a NSURL object:

    NSURL *url = [NSURL URLWithString:redditUrl];
    

    Think of the NSURL as packaging up the string into a format we can work with programmatically.

    We then create a NSMutableURLRequest using this URL, which lets us configure a request:

    NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:url];
    

    Requests allow us to specify things like headers. Here, we add a User-Agent header to mimic a desktop browser:

    [request setValue:@"Mozilla/5.0..." forHTTPHeaderField:@"User-Agent"];
    

    Finally, we use NSURLSession to send off the GET request and handle the response asynchronously:

    NSURLSessionDataTask *task = [[NSURLSession sharedSession] dataTaskWithRequest:request completionHandler:^(NSData *data, NSURLResponse *response, NSError *error) {
       // Handle response
    }];
    
    [task resume];
    

    This downloads the contents of the Reddit homepage for us to start scraping!

    Parsing the HTML

    Inside the completion handler, we first check if any errors occurred. If not, we:

    1. Cast the NSURLResponse to a NSHTTPURLResponse so we can read the status code
    2. Check for a 200 OK status
    3. Convert the NSData *data into a string containing HTML
    if (error) {
       // Handle error
    } else {
    
        NSHTTPURLResponse *httpResponse = (NSHTTPURLResponse *)response;
    
        if (httpResponse.statusCode == 200) {
    
          NSString *htmlContent = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
    
        }
    
    }
    

    With the HTML stored as a string, we can parse it. We use the htmlReadDoc method from libxml2 to parse the content into an easy to query structure.

    Think of this like converting the HTML string into an object we can programmatically analyze.

    htmlDocPtr doc = htmlReadDoc((const xmlChar *)[htmlContent UTF8String], NULL, NULL, HTML_PARSE_RECOVER | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING);
    

    The parsing options passed in handle small errors without stopping. This helps account for "real world" HTML that may not be perfectly formatted.

    Extracting Post Data

    Inspecting the elements

    Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

    With the document parsed, we can now use XPath selectors to pull out pieces of information.

    xmlXPathContextPtr xpathCtx = xmlXPathNewContext(doc);
    
    NSString *xpathExpression = @"//div[contains(@class,'shreddit-post')]";
    
    xmlXPathObjectPtr xpathObj = xmlXPathEvalExpression((const xmlChar *)[xpathExpression UTF8String], xpathCtx);
    

    Walking through this:

  • xmlXPathNewContext creates a new context for evaluating XPath expressions against our parsed doc
  • We define the XPath selector we want to use
  • xmlXPathEvalExpression then applies this selector to extract matching nodes
  • The key thing to understand here is that xpathExpression. This contains the actual XPath selector defining what data to match.

    Some examples of XPaths:

  • //div - Match all div elements at any level of nesting
  • //div[@class='my-class'] - Match divs with a certain class attribute
  • //a/text() - Match the text content of all a tags
  • In our Reddit scraping code, the XPaths to extract specific fields would go here. Make sure not to modify these string literals!

    With the matched nodes selected, we can now loop through and extract info:

    for (int i = 0; i < count; i++) {
    
      RedditPost *post = [[RedditPost alloc] init];
    
      // Extract data
      post.permalink = @"";
      post.contentHref = @"";
    
      // ...
    
    }
    

    We create RedditPost objects to store each post's data, populate the properties, and add them to an array.

    Later on, we can access these objects to work with the scraped content!

    Putting It All Together

    While individual concepts like requests, XPath and DOM manipulation may be unfamiliar at first, by relating them to more tangible analogies and walking through the flow step-by-step, we can decode what this scraping script is doing under the hood!

    The key points to remember are:

  • We make a request to download raw HTML
  • Parsing converts this into an object we can analyze
  • XPath selectors then extract specific data points
  • We store extracted info into objects for later use
  • And voila, we've scraped Reddit!

    Here is the full code sample again to tinker with:

    #import <Foundation/Foundation.h>
    #import <libxml2/libxml/HTMLparser.h>
    
    // Define a data structure to store extracted information
    @interface RedditPost : NSObject
    @property (nonatomic, strong) NSString *permalink;
    @property (nonatomic, strong) NSString *contentHref;
    @property (nonatomic, strong) NSString *commentCount;
    @property (nonatomic, strong) NSString *postTitle;
    @property (nonatomic, strong) NSString *author;
    @property (nonatomic, strong) NSString *score;
    @end
    
    @implementation RedditPost
    @end
    
    int main(int argc, const char * argv[]) {
        @autoreleasepool {
            // Define the Reddit URL you want to download
            NSString *redditUrl = @"https://www.reddit.com";
            
            // Create a NSURL object from the URL string
            NSURL *url = [NSURL URLWithString:redditUrl];
            
            if (url) {
                // Create a NSMutableURLRequest with the URL
                NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:url];
                
                // Set the User-Agent header
                [request setValue:@"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" forHTTPHeaderField:@"User-Agent"];
                
                // Send a GET request
                NSURLSessionDataTask *task = [[NSURLSession sharedSession] dataTaskWithRequest:request completionHandler:^(NSData *data, NSURLResponse *response, NSError *error) {
                    if (error) {
                        NSLog(@"Failed to download Reddit page: %@", [error localizedDescription]);
                    } else {
                        // Check if the response status code is 200 (OK)
                        NSHTTPURLResponse *httpResponse = (NSHTTPURLResponse *)response;
                        if (httpResponse.statusCode == 200) {
                            // Get the HTML content as a string
                            NSString *htmlContent = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
                            
                            // Parse the HTML content using libxml2
                            htmlDocPtr doc = htmlReadDoc((const xmlChar *)[htmlContent UTF8String], NULL, NULL, HTML_PARSE_RECOVER | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING);
                            if (doc) {
                                // Create an array to store RedditPost objects
                                NSMutableArray *redditPosts = [NSMutableArray array];
                                
                                // Example of how to extract information from the parsed HTML document
                                xmlXPathContextPtr xpathCtx = xmlXPathNewContext(doc);
                                if (xpathCtx) {
                                    // Use XPath expressions to extract data
                                    NSString *xpathExpression = @"//div[contains(@class,'shreddit-post')]";
                                    xmlXPathObjectPtr xpathObj = xmlXPathEvalExpression((const xmlChar *)[xpathExpression UTF8String], xpathCtx);
                                    
                                    if (xpathObj) {
                                        int count = xpathObj->nodesetval->nodeNr;
                                        
                                        for (int i = 0; i < count; i++) {
                                            // Create a RedditPost object to store data
                                            RedditPost *post = [[RedditPost alloc] init];
                                            
                                            // Extract data for each post
                                            // You will need to modify these XPath expressions to match Reddit's actual structure
                                            post.permalink = @"";
                                            post.contentHref = @"";
                                            post.commentCount = @"";
                                            post.postTitle = @"";
                                            post.author = @"";
                                            post.score = @"";
                                            
                                            [redditPosts addObject:post];
                                        }
                                        
                                        xmlXPathFreeObject(xpathObj);
                                    }
                                    
                                    xmlXPathFreeContext(xpathCtx);
                                }
                                
                                // Now you have an array of RedditPost objects with extracted data
                                for (RedditPost *post in redditPosts) {
                                    NSLog(@"Permalink: %@", post.permalink);
                                    NSLog(@"Content Href: %@", post.contentHref);
                                    NSLog(@"Comment Count: %@", post.commentCount);
                                    NSLog(@"Post Title: %@", post.postTitle);
                                    NSLog(@"Author: %@", post.author);
                                    NSLog(@"Score: %@", post.score);
                                    NSLog(@"\n");
                                }
                                
                                // Example of how to free the parsed document when done
                                xmlFreeDoc(doc);
                            } else {
                                NSLog(@"Failed to parse HTML content");
                            }
                        } else {
                            NSLog(@"Failed to download Reddit page (status code %ld)", (long)httpResponse.statusCode);
                        }
                    }
                }];
                
                // Start the task
                [task resume];
            } else {
                NSLog(@"Invalid Reddit URL");
            }
        }
        return 0;
    }

    The concepts here appear in all kinds of web scraping tasks, so grasping these core ideas empowers you to start scraping data across the internet!

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!