Web scraping allows extracting data from websites programmatically. This is useful for gathering information like prices, inventory, reviews etc.
OpenAI provides an innovative approach to build robust web scrapers using natural language processing.
In this post, we will walk through a complete JavaScript code example that leverages OpenAI function calling to scrape product data from a sample ecommerce website.
Leveraging OpenAI Function Calling
OpenAI function calling provides a way to define schemas for the data you want extracted from a given input. When making an API request, you can specify a function name and parameters representing the expected output format.
OpenAI's natural language model will then analyze the provided input, extract relevant data from it, and return the extracted information structured according to the defined schema.
This pattern separates the raw data extraction capabilities of the AI model from your downstream data processing logic. Your code simply expects the data in a clean, structured format based on the function specification.
By leveraging OpenAI's natural language processing strengths for data extraction, you can create web scrapers that are resilient to changes in the underlying page structure and content. The business logic remains high-level and focused on data usage, while OpenAI handles the messy details of parsing and extracting information from complex HTML.
Why Use Function Calling
One key advantage of this web scraping technique is that the core scraper logic is immune to changes in the HTML structure of the target site. Since OpenAI is responsible for analyzing the raw HTML and extracting the desired data, the JavaScript code does not make any assumptions about HTML structure. The scraper will adapt as long as the sample HTML provided to OpenAI reflects the current page structure. This makes the scraper much more robust against site redesigns compared to scraping code that depends on specific HTML elements.
Overview
Here is an overview of the web scraping process we will implement:
- Send HTML representing the target page to OpenAI
- OpenAI analyzes the HTML and extracts the data we want
- OpenAI returns the extracted data structured as defined in our JavaScript function
- Process the extracted data in JavaScript as needed
This allows creating a scraper that adapts to changes in page layouts. The core logic stays high-level while OpenAI handles analyzing the raw HTML.
The Setup
The main library we need is the official OpenAI JavaScript library:
npm install openai
This provides a JavaScript client for calling the OpenAI API endpoints.
Then in the code we initialize the OpenAI client, passing our API key:
const { Configuration, OpenAIApi } = require("openai");
const configuration = new Configuration({
apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);
This openai instance can then be used to call API methods like createCompletion() to send requests to OpenAI models.
So the key dependencies are:
With these pieces, we can call the OpenAI API from a Node.js application to implement the web scraping example.
Sample HTML
First, we need some sample HTML representing the page content we want to scrape.
Here is sample HTML for a page listing 3 products:
<div class="products">
<div class="product">
<h3>Blue T-Shirt</h3>
<p>A comfortable blue t-shirt made from 100% cotton.</p>
<p>Price: $14.99</p>
</div>
<div class="product">
<h3>Noise Cancelling Headphones</h3>
<p>These wireless over-ear headphones provide active noise cancellation.</p>
<p>Price: $199.99</p>
</div>
<div class="product">
<h3>Leather Laptop Bag</h3>
<p>Room enough for up to a 15" laptop. Made from genuine leather.</p>
<p>Price: $49.99</p>
</div>
</div>
This contains 3 product listings, each with a title, description and price.
Sending HTML to OpenAI
Next, we need to send this sample HTML to the OpenAI API. The HTML is passed in the
const messages = [
{
role: 'user',
content: html
}
];
This will allow OpenAI to analyze the HTML structure.
Defining Output Schema
We need to define the expected output schema so OpenAI knows what data to extract.
We'll define a
const functions = [
{
name: 'extractedData',
parameters: {
type: 'array',
items: {
type: 'object',
properties: {
title: {
type: 'string'
},
description: {
type: 'string'
},
price: {
type: 'string'
}
}
}
}
}
];
This specifies we want an array of product objects, each with a title, description and price.
Calling OpenAI API
Now we can call the OpenAI API, passing the HTML and function definition:
const data = {
model: 'text-davinci-003',
messages,
functions
};
const response = openai.createCompletion(data);
This will analyze the HTML and return extracted data matching the schema we defined.
Processing Extracted Data
Finally, we can process the extracted data in our JavaScript function:
function extractedData(products) {
// Output product data
products.forEach(product => {
console.log(product.title);
console.log(product.description);
console.log(product.price);
});
}
This simply loops through and prints each product's details. We could also save the data to a database etc.
Full Code Example
Here is the complete JavaScript code to scrape product data using OpenAI function calling:
// Extracted data function
function extractedData(products) {
console.log('Extracted Product Data');
products.forEach(product => {
console.log(product.title);
console.log(product.description);
console.log(product.price);
console.log('---');
});
return {
status: 'saved'
};
}
// Sample HTML
const html = `
<div class="products">
<div class="product">
<h3>Blue T-Shirt</h3>
<p>A comfortable blue t-shirt made from 100% cotton.</p>
<p>Price: $14.99</p>
</div>
<div class="product">
<h3>Noise Cancelling Headphones</h3>
<p>These wireless over-ear headphones provide active noise cancellation.</p>
<p>Price: $199.99</p>
</div>
<div class="product">
<h3>Leather Laptop Bag</h3>
<p>Room enough for up to a 15" laptop. Made from genuine leather.</p>
<p>Price: $49.99</p>
</div>
</div>
`;
// Send HTML to OpenAI
const messages = [
{role: 'user', content: html}
];
// Function schema
const functions = [
{
name: 'extractedData',
description: 'Extract product data from HTML',
parameters: {
type: 'object',
properties: {
products: {
type: 'array',
items: {
type: 'object',
properties: {
title: {
type: 'string'
},
description: {
type: 'string'
},
price: {
type: 'string'
}
}
}
}
},
required: ['products']
}
}
];
// OpenAI credentials
const configuration = {
apiKey: 'sk-xxx'
};
const openai = require('openai')(configuration);
// Call API
const response = openai.createCompletion({
model: 'text-davinci-003',
messages,
functions
});
Conclusion
Using OpenAI opens up an exciting new way to approach web scraping whih wasnt possible before
However, this approach also has some limitations:
A more robust solution is using a dedicated web scraping API like Proxies API
With Proxies API, you get:
With features like automatic IP rotation, user-agent rotation and CAPTCHA solving, Proxies API makes robust web scraping easy via a simple API:
curl "https://api.proxiesapi.com/?key=API_KEY&url=targetsite.com"
Get started now with 1000 free API calls to supercharge your web scraping!
Related articles:
- Scrape Any Website with OpenAI Function Calling in C++
- Scrape Any Website with OpenAI Function Calling in Ruby
- Scrape Any Website with OpenAI Function Calling in CSharp
- Scrape Any Website with OpenAI Function Calling in Perl
- Scrape Any Website with OpenAI Function Calling in PHP
- Scrape Any Website with OpenAI Function Calling in Scala
- Web Scraping with Javascript & ChatGPT
Browse by tags:
Browse by language:
Popular articles:
- Web Scraping in Python - The Complete Guide
- Working with Query Parameters in Python Requests
- How to Authenticate with Bearer Tokens in Python Requests
- Building a Simple Proxy Rotator with Kotlin and Jsoup
- The Complete BeautifulSoup Cheatsheet with Examples
- The Complete Playwright Cheatsheet
- Web Scraping using ChatGPT - Complete Guide with Examples