Web Scraping Google Scholar in Objective-C

Jan 21, 2024 · 9 min read

Google Scholar is an invaluable resource for researching academic publications across disciplines. However, sometimes you need to extract information from Google Scholar programmatically for analysis or further processing. That's where web scraping comes in!

This is the Google Scholar result page we are talking about…

Web scraping uses code to automatically extract data from websites. In this article, we'll explore a complete iOS example for scraping key fields from Google Scholar search results.

Even with no prior web scraping experience, you'll learn:

  • How to construct a request to retrieve data from Google Scholar
  • Techniques for parsing and processing the HTML response
  • Extracting specific pieces of information using selectors
  • Storing scraped results in a customized model object
  • So let's get scraping!

    Setting Up the Scraper

    We'll utilize native iOS frameworks to handle fetching and parsing. First we need a few imports:

    #import <Foundation/Foundation.h>
    #import <libxml/HTMLParser.h>
    

    The Foundation framework provides networking capabilities, data structures, and more. libxml gives us HTML parsing functionality via NSXMLParser.

    Next we define a custom class ScholarResult to model search result data:

    @interface ScholarResult : NSObject
    @property (nonatomic, strong) NSString *title;
    @property (nonatomic, strong) NSString *url;
    @property (nonatomic, strong) NSString *authors;
    @property (nonatomic, strong) NSString *abstract;
    @end
    

    Each search result will be represented by an instance of ScholarResult containing the title, URL, authors, and abstract.

    We also need a handler class to act as the parser delegate:

    @interface ScholarParser : NSObject <NSXMLParserDelegate>
    @property (nonatomic, strong) NSMutableArray<ScholarResult *> *results;
    @property (nonatomic, strong) ScholarResult *currentResult;
    @property (nonatomic, strong) NSMutableString *currentElementValue;
    @end
    

    Key properties:

  • results: Stores extracted ScholarResult objects
  • currentResult: Reference to the ScholarResult we're currently populating
  • currentElementValue: Temporary string for building up text content as we parse
  • The parser delegate methods will do the actual data extraction work.

    Constructing the Request

    To retrieve the Google Scholar HTML, we need to send a properly formatted request:

    // Define URL
    NSString *urlString = @"<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>";
    
    // User-Agent header
    NSString *userAgent = @"Mozilla/5.0...";
    
    // Create NSURL
    NSURL *url = [NSURL URLWithString:urlString];
    
    // Create request
    NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:url];
    
    // Set User-Agent header
    [request setValue:userAgent forHTTPHeaderField:@"User-Agent"];
    

    Key aspects:

  • Hardcoded Google Scholar URL with a sample query
  • User agent header simulating a desktop browser
  • NSURLRequest allows customization before sending
  • We could parameterize the search query, but hardcoding serves our example well.

    Important: Google expects a proper user agent header. Passing a web browser's user agent helps avoid bot detection.

    Now that preparation is complete, we can actually fetch the search results page by sending the request!

    Fetching and Parsing the Response

    To retrieve and parse Google's response, we leverage NSURLSession:

    // Send request & get response
    NSURLSession *session = [NSURLSession sharedSession];
    NSURLSessionDataTask *dataTask = [session dataTaskWithRequest:request completionHandler:^(NSData * _Nullable data, NSURLResponse * _Nullable response, NSError * _Nullable error) {
       // Handle data
    }];
    
    [dataTask resume];
    [session finishTasksAndInvalidate];
    
    [[NSRunLoop currentRunLoop] run];
    

    This asynchronously sends our request and executes the completion handler when the full response NSData is available, or if an error occurs.

    With the raw HTML NSData, we can parse using NSXMLParser:

    // Parse HTML
    ScholarParser *parser = [[ScholarParser alloc] init];
    NSXMLParser *xmlParser = [[NSXMLParser alloc] initWithData:data];
    xmlParser.delegate = parser;
    [xmlParser parse];
    

    Feeding the response data into NSXMLParser triggers the ScholarParser delegate methods as elements are encountered.

    This is where the magic happens! ???? The delegate methods will find and extract fields of interest.

    Extracting Paper Details with XPath Selectors

    Inspecting the code

    You can see that the items are enclosed in a

    element with the class gs_ri

    NSXMLParser makes data extraction easy. As it processes the HTML, delegate methods notify us when nodes open/close. More importantly, we can access element attributes and text content exactly when we need them.

    The key is crafting XPath selectors to pinpoint elements containing the data we want.

    For example, title lives inside

    tags:

    <h3>Attention Is All You Need</h3>
    

    Let's break down selector logic to extract titles:

    - (void)parser:(NSXMLParser *)parser didEndElement:(NSString *)elementName {
    
      if ([elementName isEqualToString:@"h3"]) {
          // Set title on current result
          self.currentResult.title = self.currentElementValue;
      }
    
    }
    

    Whenever an

    close tag is encountered:

  • Check if element name equals "h3"
  • If so, we've reached the end of a title!
  • Set the accumulated text as title on current result
  • Similar logic applies for other fields like URL, authors, and abstract.

    Key delegate methods:

    - (void)parser:(NSXMLParser *)parser didStartElement:(NSString *)elementName {
       // Initialize current result when start new search result
       if ([elementName isEqualToString:@"div"] && div.class == "gs_ri") {
          self.currentResult = [[ScholarResult alloc] init];
       }
    
       // Reset current element value
       self.currentElementValue = [NSMutableString string];
    }
    
    - (void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string {
      [self.currentElementValue appendString:string];
    }
    

    This handles:

  • Detecting when a new search result starts
  • Resetting current text value
  • Accumulating text content from nested elements
  • Through carefully crafted selectors, we extract all needed data!

    Storing Results

    With each result extracted, we add it to the overall results array:

    - (void)parser:(NSXMLParser *)parser didEndElement: {
    
      // After all fields populated:
      if ([elementName isEqual:@"div"] && div.class == "gs_ri") {
         [self.results addObject:self.currentResult];
         self.currentResult = nil; // Reset
      }
    
    }
    

    By the end, self.results contains all search results, neatly packaged in ScholarResult objects!

    We could analyze these programmatically, display them in an app, save to a database, etc. The possibilities are endless!

    Putting It All Together

    #import <Foundation/Foundation.h>
    #import <libxml/HTMLParser.h>
    
    // Create custom class to store result data
    @interface ScholarResult : NSObject
    @property (nonatomic, strong) NSString *title;
    @property (nonatomic, strong) NSString *url;
    @property (nonatomic, strong) NSString *authors;
    @property (nonatomic, strong) NSString *abstract;
    @end
    
    @implementation ScholarResult
    @end
    
    @interface ScholarParser : NSObject <NSXMLParserDelegate>
    @property (nonatomic, strong) NSMutableArray<ScholarResult *> *results;
    @property (nonatomic, strong) ScholarResult *currentResult;
    @property (nonatomic, strong) NSMutableString *currentElementValue;
    @end
    
    @implementation ScholarParser
    
    - (instancetype)init {
        self = [super init];
        if (self) {
            self.results = [NSMutableArray array];
        }
        return self;
    }
    
    - (void)parserDidStartDocument:(NSXMLParser *)parser {
        NSLog(@"Parsing started.");
    }
    
    - (void)parserDidEndDocument:(NSXMLParser *)parser {
        NSLog(@"Parsing finished.");
        
        for (ScholarResult *result in self.results) {
            NSLog(@"Title: %@", result.title);
            NSLog(@"URL: %@", result.url);
            NSLog(@"Authors: %@", result.authors);
            NSLog(@"Abstract: %@", result.abstract);
            NSLog(@"-------------------------------------------------");
        }
    }
    
    - (void)parser:(NSXMLParser *)parser didStartElement:(NSString *)elementName namespaceURI:(nullable NSString *)namespaceURI qualifiedName:(nullable NSString *)qName attributes:(NSDictionary<NSString *, NSString *> *)attributeDict {
        if ([elementName isEqualToString:@"div"] && [attributeDict[@"class"] isEqualToString:@"gs_ri"]) {
            self.currentResult = [[ScholarResult alloc] init];
        }
        self.currentElementValue = [NSMutableString string];
    }
    
    - (void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string {
        [self.currentElementValue appendString:string];
    }
    
    - (void)parser:(NSXMLParser *)parser didEndElement:(NSString *)elementName namespaceURI:(nullable NSString *)namespaceURI qualifiedName:(nullable NSString *)qName {
        if (self.currentResult) {
            if ([elementName isEqualToString:@"h3"] && [self.currentElementValue length] > 0) {
                self.currentResult.title = self.currentElementValue;
            } else if ([elementName isEqualToString:@"a"] && [self.currentElementValue length] > 0) {
                self.currentResult.url = self.currentElementValue;
            } else if ([elementName isEqualToString:@"div"] && [attributeDict[@"class"] isEqualToString:@"gs_a"] && [self.currentElementValue length] > 0) {
                self.currentResult.authors = self.currentElementValue;
            } else if ([elementName isEqualToString:@"div"] && [attributeDict[@"class"] isEqualToString:@"gs_rs"] && [self.currentElementValue length] > 0) {
                self.currentResult.abstract = self.currentElementValue;
            }
        }
        if ([elementName isEqualToString:@"div"] && [attributeDict[@"class"] isEqualToString:@"gs_ri"]) {
            [self.results addObject:self.currentResult];
            self.currentResult = nil;
        }
    }
    
    @end
    
    int main(int argc, const char * argv[]) {
        @autoreleasepool {
            // Define the URL of the Google Scholar search page
            NSString *urlString = @"https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";
    
            // Define a User-Agent header
            NSString *userAgent = @"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"; // Replace with your User-Agent string
    
            // Create an NSURL object from the URL string
            NSURL *url = [NSURL URLWithString:urlString];
    
            // Create an NSMutableURLRequest with the URL
            NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:url];
    
            // Set the User-Agent header
            [request setValue:userAgent forHTTPHeaderField:@"User-Agent"];
    
            // Send a GET request to the URL with the User-Agent header
            NSURLSession *session = [NSURLSession sharedSession];
            NSURLSessionDataTask *dataTask = [session dataTaskWithRequest:request completionHandler:^(NSData * _Nullable data, NSURLResponse * _Nullable response, NSError * _Nullable error) {
                if (error) {
                    NSLog(@"Failed to retrieve the page. Error: %@", error.localizedDescription);
                    return;
                }
    
                // Parse the HTML content of the page using NSXMLParser
                ScholarParser *parser = [[ScholarParser alloc] init];
                NSXMLParser *xmlParser = [[NSXMLParser alloc] initWithData:data];
                xmlParser.delegate = parser;
                [xmlParser parse];
            }];
    
            [dataTask resume];
            [session finishTasksAndInvalidate];
    
            [[NSRunLoop currentRunLoop] run];
        }
        return 0;
    }

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!