Web Scraping into Excel using ChatGPT

Sep 25, 2023 ยท 4 min read

Web scraping allows you to extract data from websites and save it in a structured format like Excel. With ChatGPT, you can generate Python code to scrape websites without any prior coding knowledge. In this article, we'll see how to use ChatGPT to scrape a book website into an Excel sheet.

Overview

Here's a quick overview of the process we'll cover:

  • Copy the target website URL
  • Generate Python scraping code with ChatGPT
  • Run the code to extract data
  • Format and output data to an Excel sheet
  • Generate Scraping Code with ChatGPT

    To start, copy the URL of the website you want to scrape. For this example, we'll use a books website.

    Next, go to ChatGPT and enter this prompt:

    Generate Python code to scrape the title, link and price of all books from this URL into variables: [paste URL here]
    

    ChatGPT will provide Python code to scrape the requested data from the site. It will look something like this:

    import requests
    from bs4 import BeautifulSoup
    
    url = '[paste URL here]'
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    titles = []
    links = []
    prices = []
    
    for book in soup.find_all('div', class_='book'):
        title = book.h2.text
        link = book.a['href']
        price = book.find('span', class_='price').text
    
        titles.append(title)
        links.append(link)
        prices.append(price)
    

    This code uses the Requests library to download the webpage content, then BeautifulSoup to parse the HTML and extract the data we want into lists.

    Run the Code to Extract Data

    Copy the ChatGPT generated code into a Python file and run it. This will scrape the website and print out the extracted data.

    You can modify the code as needed - for example, to extract additional data points or iterate through paginated content.

    Format and Output Data to Excel

    To get the scraped data into an Excel sheet, modify the Python script to:

    1. Import the Pandas library
    2. Create a Pandas DataFrame from the extracted data lists
    3. Use the to_excel() method to export the DataFrame to an Excel file

    Here is how the script would look:

    # Imports
    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    # Scraping code
    
    # Create DataFrame
    df = pd.DataFrame({'Title': titles, 'Link': links, 'Price': prices})
    
    # Export to Excel
    df.to_excel('books.xlsx', index=False)
    

    Now when you run the script, it will generate an Excel file with the scraped data!

    Tips for Web Scraping with ChatGPT

  • Use precise prompts to get good scraping code from ChatGPT
  • Review and tweak the code as needed for your use case
  • Iterate through pages to scrape entire websites
  • Avoid scraping too aggressively to prevent getting blocked
  • Output data to formats like JSON or CSV for additional processing
  • Full Python Code for Scraping Books Website

    Here is the full Python code to scrape a books website into an Excel sheet using ChatGPT:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    url = '<https://books.toscrape.com>'
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    titles = []
    prices = []
    links = []
    
    for book in soup.find_all('article', class_='product_pod'):
    
        # Get title
        title = book.find('h3').find('a')['title']
        titles.append(title)
    
        # Get price
        price = book.find(class_='price_color').get_text()
        prices.append(price)
    
        # Get link
        link = book.find('h3').find('a')['href']
        links.append(url + link)
    
    # Create dataframe and export to Excel
    df = pd.DataFrame({'Title': titles, 'Price': prices, 'Link': links})
    df.to_excel('books.xlsx', index=False)
    

    This script scrapes the book title, price, and link from each book on the homepage. It stores the data in lists then exports to an Excel file.

    So that's how you can leverage ChatGPT to easily generate web scraping code and output data to Excel without coding experience! Let me know if you have any other questions.

    ChatGPT heralds an exciting new era in intelligent automation!

    However, this approach also has some limitations:

  • The scraped code needs to handle CAPTCHAs, IP blocks and other anti-scraping measures
  • Running the scrapers on your own infrastructure can lead to IP blocks
  • Dynamic content needs specialized handling
  • A more robust solution is using a dedicated web scraping API like Proxies API

    With Proxies API, you get:

  • Millions of proxy IPs for rotation to avoid blocks
  • Automatic handling of CAPTCHAs, IP blocks
  • Rendering of Javascript-heavy sites
  • Simple API access without needing to run scrapers yourself
  • With features like automatic IP rotation, user-agent rotation and CAPTCHA solving, Proxies API makes robust web scraping easy via a simple API:

    curl "https://api.proxiesapi.com/?key=API_KEY&url=targetsite.com"
    

    Get started now with 1000 free API calls to supercharge your web scraping!

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: