Scrape Any Website with OpenAI Function Calling in Ruby

Web scraping allows extracting data from websites programmatically. This is useful for gathering information like prices, inventory, reviews etc.

OpenAI provides an innovative approach to build robust web scrapers using natural language processing.

In this post, we will walk through a complete Ruby code example that leverages OpenAI function calling to scrape product data from a sample ecommerce website.

Leveraging OpenAI Function Calling

OpenAI function calling provides a way to define schemas for the data you want extracted from a given input. When making an API request, you can specify a function name and parameters representing the expected output format.

OpenAI's natural language model will then analyze the provided input, extract relevant data from it, and return the extracted information structured according to the defined schema.

This pattern separates the raw data extraction capabilities of the AI model from your downstream data processing logic. Your code simply expects the data in a clean, structured format based on the function specification.

By leveraging OpenAI's natural language processing strengths for data extraction, you can create web scrapers that are resilient to changes in the underlying page structure and content. The business logic remains high-level and focused on data usage, while OpenAI handles the messy details of parsing and extracting information from complex HTML.

Why Use Function Calling

One key advantage of this web scraping technique is that the core scraper logic is immune to changes in the HTML structure of the target site. Since OpenAI is responsible for analyzing the raw HTML and extracting the desired data, the Ruby code does not make any assumptions about HTML structure. The scraper will adapt as long as the sample HTML provided to OpenAI reflects the current page structure. This makes the scraper much more robust against site redesigns compared to scraping code that depends on specific HTML elements.

Overview

Here is an overview of the web scraping process we will implement:

Send HTML representing the target page to OpenAI
OpenAI analyzes the HTML and extracts the data we want
OpenAI returns the extracted data structured as defined in our Ruby function
Process the extracted data in Ruby as needed

This allows creating a scraper that adapts to changes in page layouts. The core logic stays high-level while OpenAI handles analyzing the raw HTML.

Installing the OpenAI Ruby Gem

To call the OpenAI API from Ruby, we need to install the openai gem:

gem install openai

This will install the official OpenAI Ruby client library.

If using Bundler, we can add it to our Gemfile:

gem 'openai'

And run bundle install.

Then in our code we can require the gem and initialize a client instance:

require 'openai'

openai = OpenAI::Client.new(api_key: 'sk-...')

The openai client will have methods like create_completion for sending API requests.

So the key steps are:

Install openai gem

Require openai in code

Initialize Client with API key

Call methods like create_completion

This allows us to leverage OpenAI's API from Ruby code to implement web scraping using function calling.

Sample HTML

First, we need some sample HTML representing the page content we want to scrape.

Here is sample HTML for a page listing 3 products:

<div class="products">

  <div class="product">
    <h3>Blue T-Shirt</h3>
    <p>A comfortable blue t-shirt made from 100% cotton.</p>
    <p>Price: $14.99</p>
  </div>

  <div class="product">
    <h3>Noise Cancelling Headphones</h3>
    <p>These wireless over-ear headphones provide active noise cancellation.</p>
    <p>Price: $199.99</p>
  </div>

  <div class="product">
    <h3>Leather Laptop Bag</h3>
    <p>Room enough for up to a 15" laptop. Made from genuine leather.</p>
    <p>Price: $49.99</p>
  </div>

</div>

This contains 3 product listings, each with a title, description and price.

Sending HTML to OpenAI

Next, we need to send this sample HTML to the OpenAI API. The HTML is passed in the content parameter:

messages = [
  {
    role: "user",
    content: html
  }
]

This will allow OpenAI to analyze the HTML structure.

Defining Output Schema

We need to define the expected output schema so OpenAI knows what data to extract.

We'll define a extracted_data function with a products array parameter:

functions = [
  {
    name: "extracted_data",

    parameters: {
      type: "array",

      items: {
        type: "object",

        properties: {
          title: {
            type: "string"
          },
          description: {
            type: "string"
          },
          price: {
            type: "string"
          }
        }
      }
    }
  }
]

This specifies we want an array of product objects, each with a title, description and price.

Calling OpenAI API

Now we can call the OpenAI API, passing the HTML and function definition:

data = {
  model: "text-davinci-003",
  messages: messages,
  functions: functions
}

response = openai.create_completion(data)

This will analyze the HTML and return extracted data matching the schema we defined.

Processing Extracted Data

Finally, we can process the extracted data in our Ruby function:

def extracted_data(products)

  # Output product data
  products.each do |product|
    puts product["title"]
    puts product["description"]
    puts product["price"]
  end

end

This simply loops through and prints each product's details. We could also save the data to a database etc.

Installing the OpenAI Ruby Gem

To call the OpenAI API from Ruby, we need to install the openai gem:

gem install openai

Then in code we can initialize the client with our API key:

require 'openai'

openai = OpenAI::Client.new(api_key: 'sk-...')

Now the openai client can call methods like create_completion to send requests to the API.

Full Code Example

Here is the full Ruby code to scrape product data using OpenAI function calling:

# Extracted data function
def extracted_data(products)

  puts "Extracted Product Data"  

  products.each do |product|
    puts product["title"]
    puts product["description"]
    puts product["price"]
    puts "---"
  end

  { status: "saved" }

end

# Sample HTML
html = <<-HTML
  <div class="products">
  
    <div class="product">
      <h3>Blue T-Shirt</h3>
      <p>A comfortable blue t-shirt made from 100% cotton.</p>
      <p>Price: $14.99</p>
    </div>

    <!-- More products -->

  </div>  
HTML

# Send HTML to OpenAI
messages = [
  {role: "user", content: html}
]

# Function schema
functions = [
  {
    name: "extracted_data",
    description: "Extract product data from HTML",
    
    parameters: {
      type: "object",
      properties: {
        products: {
          type: "array",
          items: {
            type: "object",
            properties: {
              title: {type: "string"},
              description: {type: "string"},
              price: {type: "string"}
            }
          }
        }
      },
      required: ["products"]
    }
  }
]  

require 'openai'
openai = OpenAI::Client.new(api_key: 'sk-...')

# Call API 
response = openai.create_completion(
  model: "text-davinci-003",
  messages: messages,
  functions: functions  
)

Conclusion

Using OpenAI opens up an exciting new way to approach web scraping whih wasnt possible before

However, this approach also has some limitations:

The scraped code needs to handle CAPTCHAs, IP blocks and other anti-scraping measures

Running the scrapers on your own infrastructure can lead to IP blocks

Dynamic content needs specialized handling

A more robust solution is using a dedicated web scraping API like Proxies API

With Proxies API, you get:

Millions of proxy IPs for rotation to avoid blocks

Automatic handling of CAPTCHAs, IP blocks

Rendering of Javascript-heavy sites

Simple API access without needing to run scrapers yourself

With features like automatic IP rotation, user-agent rotation and CAPTCHA solving, Proxies API makes robust web scraping easy via a simple API:

curl "https://api.proxiesapi.com/?key=API_KEY&url=targetsite.com"

Get started now with 1000 free API calls to supercharge your web scraping!

Scrape Any Website with OpenAI Function Calling in Ruby

Leveraging OpenAI Function Calling

Why Use Function Calling

Overview

Installing the OpenAI Ruby Gem

Sample HTML

Sending HTML to OpenAI

Defining Output Schema

Calling OpenAI API

Processing Extracted Data

Installing the OpenAI Ruby Gem

Full Code Example

Conclusion

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scrape Any Website with OpenAI Function Calling in Ruby

Leveraging OpenAI Function Calling

Why Use Function Calling

Overview

Installing the OpenAI Ruby Gem

Sample HTML

Sending HTML to OpenAI

Defining Output Schema

Calling OpenAI API

Processing Extracted Data

Installing the OpenAI Ruby Gem

Full Code Example

Conclusion

The easiest way to do Web Scraping

Don't leave just yet!