Web Scraping using ChatGPT - Complete Guide with Examples

Web scraping is the process of extracting data from websites automatically using code. ChatGPT has emerged as a powerful tool for web scraping due to its natural language capabilities. This article provides a comprehensive guide on web scraping using ChatGPT, covering the key concepts, approaches, examples and code snippets.

Introduction

ChatGPT has opened up new possibilities for automating web scraping without writing complex code. At its core, web scraping with ChatGPT involves:

Inspecting page elements to identify the data to be scraped

Providing detailed scraping instructions in natural language

Getting the code generated by ChatGPT to extract the required data

Validating and exporting the scraped data

The key benefit is the ability to get scrapers created with minimal coding knowledge. However, providing the right instructions is crucial for ChatGPT to generate accurate scrapers.

This article covers various techniques for web scraping using ChatGPT with examples for scraping static and dynamic websites.

Prerequisites

Before diving into the steps, ensure the following:

Access to ChatGPT (premium plan recommended for unlimited queries)

Basic HTML, CSS knowledge to inspect elements

Python and libraries like Beautiful Soup, Selenium, Pandas installed

CSV viewer or Python/Jupyter Notebook to validate results

Lets start simple

Its super easy for well known websites… lets take IMDB for example… We just use this prompt

web scrape https://www.imdb.com/chart/moviemeter/ and get all the movie names using python and beautifulsoup

ChatGPT responds with…

import requests
from bs4 import BeautifulSoup

url = "https://www.imdb.com/chart/moviemeter/"

response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Finding all the title columns which have the movie names
    title_columns = soup.find_all('td', class_='titleColumn')
    
    movies = []
    for title_column in title_columns:
        # Extracting movie names from the title columns
        movie_name = title_column.a.get_text(strip=True)
        movies.append(movie_name)
        
    # Printing the list of movie names
    for i, movie in enumerate(movies, start=1):
        print(f"{i}. {movie}")
else:
    print("Failed to retrieve the webpage. Status Code:", response.status_code)

When you run this script, unfortunately it fails…

Failed to retrieve the webpage. Status Code: 403

Now we are getting into the trenches… Lets modify the script a little bit. IMDB needs to see a browser signature for it to respond. So we add a user-agent header. This is the new script…

import requests
from bs4 import BeautifulSoup

url = "https://www.imdb.com/chart/moviemeter/"

# Define headers to mimic a browser request
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    
    title_columns = soup.find_all('td', class_='titleColumn')
    
    movies = []
    for title_column in title_columns:
        movie_name = title_column.a.get_text(strip=True)
        movies.append(movie_name)
        
    for i, movie in enumerate(movies, start=1):
        print(f"{i}. {movie}")
else:
    print("Failed to retrieve the webpage. Status Code:", response.status_code)

Now it downloads the webpage but it doesnt extract any data. So we need to help ChatGPT some more..

The titles are located in...<h3 class="ipc-title__text">Barbie</h3>

It will change the code to this…

import requests
from bs4 import BeautifulSoup

url = "https://www.imdb.com/chart/moviemeter/"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Finding all the <h3> tags with class 'ipc-title__text' which have the movie names
    title_tags = soup.find_all('h3', class_='ipc-title__text')
    
    movies = []
    for title_tag in title_tags:
        # Extracting movie names from the <h3> tags
        movie_name = title_tag.get_text(strip=True)
        movies.append(movie_name)
        
    # Printing the list of movie names
    for i, movie in enumerate(movies, start=1):
        print(f"{i}. {movie}")
else:
    print("Failed to retrieve the webpage. Status Code:", response.status_code)

This code works. When you run it gives use the results…

Scraping Dynamic Websites

Dynamic websites have content that changes without reloading the page, like infinite scroll, tabs, popups etc. Scraping them requires automated browsers like Selenium.

Here is an example to scrape product data from an ecommerce site:

Instructions

Extract product titles and pricing from amazon.com using Python and Selenium

Wait for 10 seconds for page load

Locate elements with:
- Tag: div
- Class: product

Within each element:
- Find <h3> tag and extract text for title
- Find <p> tag and extract text for price

Click on 'Next Page' button to load more products

Export scraped data to CSV file

ChatGPT Generated Code

from selenium import webdriver
from selenium.webdriver.common.by import By
import csv

driver = webdriver.Chrome()
driver.get("https://amazon.com")

driver.implicitly_wait(10)

products = driver.find_elements(By.CSS_SELECTOR, "div.product")

titles = []
prices = []

for product in products:
  title = product.find_element(By.TAG_NAME, "h3").text
  price = product.find_element(By.TAG_NAME, "p").text

  titles.append(title)
  prices.append(price)

next_button = driver.find_element(By.ID, "next_page")
next_button.click()

# Extract data from next pages

with open('output.csv', 'w') as file:
  writer = csv.writer(file)
  writer.writerow(['Title', 'Price'])
  writer.writerows(zip(titles, prices))

This code uses Selenium to load dynamic content, extract data and handle pagination. The data is exported to CSV for analysis. The following section explains how you can install Selenium if you dont have it on your system

Installing Selenium

To use Selenium for scraping dynamic websites, it needs to be installed along with a browser driver.

Steps

Install Selenium:

pip install selenium

Install browser driver:

For Chrome:

pip install chromedriver-autoinstaller

For Firefox:

pip install geckodriver-autoinstaller

Import Selenium:

from selenium import webdriver

driver = webdriver.Chrome() # For Chrome
# driver = webdriver.Firefox() # For Firefox

Pass driver to ChatGPT code:

Provide the driver object to ChatGPT for dynamic scraping:

Use the provided driver object for dynamic scraping:

driver = webdriver.Chrome()

ChatGPT will then generate code using this driver to scrape dynamic pages.

This covers the basics of setting up Selenium for web scraping with ChatGPT. The driver will automatically handle JavaScript, AJAX calls, infinite scrolls etc. making scraping seamless.

Key Takeaways

For dynamic sites, use Selenium and add steps like wait, click etc.

Provide element attributes like ID, class, tag name to locate data

Handle pagination or scrolling pages

Validate scraped data before exporting

Alternative Approach - Using ChatGPT “Advanced Data Analysis”

An alternative approach provided by ChatGPT is using its code interpreter or Advanced Data Analysis. Here, instead of providing scraping instructions, the target page HTML can directly be uploaded.

Suppose we want to scrape Amazon search results page…

Just save the page as HTML on your disk.

Then just drag and drop it into the GPT4 - Advanced Data Analysis section.

ChatGPT might need some help here in locating the title… We notice using Chrome’s Inspect element that the title is in the H2 tag. We paste the whole thing into ChatGPT

Here is the prompt…

Extract product title and price from the attached page HTML using Python and BeautifulSoup. Export results to CSV. Tht Title tag is in H2 like this... <h2 class="a-size-mini a-spacing-none a-color-base s-line-clamp-2">

While it succeeded it getting the titles, you can see that it is struggling to get the prices… So we help ChatGPT out again…

Success! ChatGPT gives us the data…

But, lets not stop there. Lets ask it for the code that made the data possible. Use this prompt…

give me the code that makes this possible

ChatGPT should gladly return the code like this…

import re
import csv
from bs4 import BeautifulSoup

# Define the path to the HTML file
file_path = 'path_to_your_html_file.html'

# Read the HTML file and create a BeautifulSoup object
with open(file_path, 'r', encoding='utf-8') as file:
    soup = BeautifulSoup(file.read(), 'html.parser')

# Function to extract price from a specific section of the HTML
def extract_price(section):
    # Look for span tags with class 'a-price' which usually contains the price
    price_tag = section.find('span', class_='a-price')
    # If a price tag is found, extract the price value from it
    if price_tag:
        # Extract the whole and fraction parts of the price and concatenate them
        whole_part_tag = price_tag.find('span', class_='a-price-whole')
        fraction_part_tag = price_tag.find('span', class_='a-price-fraction')
        if whole_part_tag and fraction_part_tag:
            # Concatenate the whole part, decimal point, and fraction part to form the full price
            whole_part = whole_part_tag.get_text(strip=True).replace(',', '')  # Remove commas in whole part
            fraction_part = fraction_part_tag.get_text(strip=True)
            return f"${whole_part}.{fraction_part}"
    return None  # Return None if no price is found

# List to hold the extracted product title and price pairs
products = []

# Loop through each product title tag to extract title and price pairs
for title_tag in soup.find_all('h2', class_='a-size-mini a-spacing-none a-color-base s-line-clamp-2'):
    title = title_tag.get_text(strip=True)  # Extract the product title
    # Get the parent section of the title tag which likely contains the price tag as well
    section = title_tag.find_parent('div', recursive=True)
    if section:  # Check if a parent div is found
        price = extract_price(section)  # Extract the price from the section
        products.append({'Title': title, 'Price': price})

# Define the path to the CSV file
csv_file_path = 'path_where_you_want_to_save_csv_file.csv'

# Write the extracted data to the CSV file
with open(csv_file_path, 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['Title', 'Price'])
    writer.writeheader()
    writer.writerows(products)

print(f"Data has been written to {csv_file_path}")

When to use ChatGPT Code Interpreter?

Some benefits of using the code interpreter are:

HTML analysis instead of manual element inspection

Handles page layout and structure automatically

Can scrape multiple pages more easily

Use this when the target site has:

Complex or nested HTML structure

Difficult to manually identify element attributes

Pagination that needs automated handling

However, the code instructions approach provides more control over the scraping logic.

Full Code Examples

Here are the full code samples created above for reference:

Static Site Scraping

import requests
from bs4 import BeautifulSoup

url = "https://www.imdb.com/chart/moviemeter/"

# Define headers to mimic a browser request
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    
    title_columns = soup.find_all('td', class_='titleColumn')
    
    movies = []
    for title_column in title_columns:
        movie_name = title_column.a.get_text(strip=True)
        movies.append(movie_name)
        
    for i, movie in enumerate(movies, start=1):
        print(f"{i}. {movie}")
else:
    print("Failed to retrieve the webpage. Status Code:", response.status_code)

Dynamic Site Scraping

from selenium import webdriver
from selenium.webdriver.common.by import By
import csv

driver = webdriver.Chrome()
driver.get("<https://ecomsite.com>")

driver.implicitly_wait(10)

products = driver.find_elements(By.CSS_SELECTOR, "div.product")

titles = []
prices = []

for product in products:
  title = product.find_element(By.TAG_NAME, "h3").text
  price = product.find_element(By.TAG_NAME, "p").text

  titles.append(title)
  prices.append(price)

next_button = driver.find_element(By.ID, "next_page")
next_button.click()

with open('output.csv', 'w') as file:
  writer = csv.writer(file)
  writer.writerow(['Title', 'Price'])
  writer.writerows(zip(titles, prices))

Code Interpreter Scraping

import re
import csv
from bs4 import BeautifulSoup

# Define the path to the HTML file
file_path = 'path_to_your_html_file.html'

# Read the HTML file and create a BeautifulSoup object
with open(file_path, 'r', encoding='utf-8') as file:
    soup = BeautifulSoup(file.read(), 'html.parser')

# Function to extract price from a specific section of the HTML
def extract_price(section):
    # Look for span tags with class 'a-price' which usually contains the price
    price_tag = section.find('span', class_='a-price')
    # If a price tag is found, extract the price value from it
    if price_tag:
        # Extract the whole and fraction parts of the price and concatenate them
        whole_part_tag = price_tag.find('span', class_='a-price-whole')
        fraction_part_tag = price_tag.find('span', class_='a-price-fraction')
        if whole_part_tag and fraction_part_tag:
            # Concatenate the whole part, decimal point, and fraction part to form the full price
            whole_part = whole_part_tag.get_text(strip=True).replace(',', '')  # Remove commas in whole part
            fraction_part = fraction_part_tag.get_text(strip=True)
            return f"${whole_part}.{fraction_part}"
    return None  # Return None if no price is found

# List to hold the extracted product title and price pairs
products = []

# Loop through each product title tag to extract title and price pairs
for title_tag in soup.find_all('h2', class_='a-size-mini a-spacing-none a-color-base s-line-clamp-2'):
    title = title_tag.get_text(strip=True)  # Extract the product title
    # Get the parent section of the title tag which likely contains the price tag as well
    section = title_tag.find_parent('div', recursive=True)
    if section:  # Check if a parent div is found
        price = extract_price(section)  # Extract the price from the section
        products.append({'Title': title, 'Price': price})

# Define the path to the CSV file
csv_file_path = 'path_where_you_want_to_save_csv_file.csv'

# Write the extracted data to the CSV file
with open(csv_file_path, 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['Title', 'Price'])
    writer.writeheader()
    writer.writerows(products)

print(f"Data has been written to {csv_file_path}")

Questions you may have

Does ChatGPT allow web scraping?

Yes. It is legal as far ChatGPT terms are concerned.

How to get chatgpt to read a website?

You can use the Advanced data analysis tool to just upload the HTML of the webpage and then ask it to scrape it using various techniques described above.

Where does ChatGPT scrape data from?

It can’t scrape the internet on its own at the moment.

Can ChatGPT read a web page?

It can’t browse the internet at the moment but you can use the Advanced data analysis tool to just upload the HTML of the webpage and then ask it to scrape it using various techniques described above.

Conclusion

This article provided a comprehensive overview of web scraping using ChatGPT with various examples and code samples. The key takeaways are:

Inspect elements to identify data to be extracted

Provide detailed natural language instructions

Leverage ChatGPT's AI for generating accurate scrapers

Validate and refine the results through iterative prompting

Consider tools like code interpreter or visual scrapers based on use case

Web scraping can be automated without complex coding by strategically tapping into ChatGPT's capabilities. With the right techniques, it is possible to extract data from virtually any website.

ChatGPT heralds an exciting new era in intelligent automation!

However, this approach also has some limitations:

The scraped code needs to handle CAPTCHAs, IP blocks and other anti-scraping measures

Running the scrapers on your own infrastructure can lead to IP blocks

Dynamic content needs specialized handling

A more robust solution is using a dedicated web scraping API like Proxies API

With Proxies API, you get:

Millions of proxy IPs for rotation to avoid blocks

Automatic handling of CAPTCHAs, IP blocks

Rendering of Javascript-heavy sites

Simple API access without needing to run scrapers yourself

With features like automatic IP rotation, user-agent rotation and CAPTCHA solving, Proxies API makes robust web scraping easy via a simple API:

curl "https://api.proxiesapi.com/?key=API_KEY&url=targetsite.com"

Get started now with 1000 free API calls to supercharge your web scraping!

Web Scraping using ChatGPT - Complete Guide with Examples

Introduction

Prerequisites

Lets start simple

Scraping Dynamic Websites

Installing Selenium

Key Takeaways

Alternative Approach - Using ChatGPT “Advanced Data Analysis”

When to use ChatGPT Code Interpreter?

Full Code Examples

Questions you may have

Conclusion

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Web Scraping using ChatGPT - Complete Guide with Examples

Introduction

Prerequisites

Lets start simple

Scraping Dynamic Websites

Installing Selenium

Key Takeaways

Alternative Approach - Using ChatGPT “Advanced Data Analysis”

When to use ChatGPT Code Interpreter?

Full Code Examples

Questions you may have

Conclusion

The easiest way to do Web Scraping

Don't leave just yet!