Scraping Craigslist Listings with Python

Oct 1, 2023 · 4 min read

This article will explain how to scrape Craigslist apartment listings using Python and BeautifulSoup. We will go through each line of code to understand what it is doing.

First we import the requests and BeautifulSoup modules:

import requests
from bs4 import BeautifulSoup

Requests allows us to make HTTP requests to web pages. BeautifulSoup helps parse and navigate HTML and XML documents.

Next we set the URL to scrape - in this case Craigslist San Francisco apartment listings:

url = '<https://sfbay.craigslist.org/search/apa>'

We make a GET request to fetch the page content:

response = requests.get(url)

Optionally, we can save the HTML content to a file for inspection:

with open('craigslist.html', 'w') as f:
    f.write(response.text)

Now we can parse the page with BeautifulSoup. We pass in the response text and specify 'html.parser' to parse as HTML:

soup = BeautifulSoup(response.text, 'html.parser')

If you check the source code of Craigslist listings you can see that the listings area code looks something like this…

You can see the code block that generates the listing…

<li class="cl-static-search-result" title="Situated in Sunnyvale!, Recycling Center, 1/BD">
            <a href="https://sfbay.craigslist.org/sby/apa/d/santa-clara-situated-in-sunnyvale/7666802370.html">
                <div class="title">Situated in Sunnyvale!, Recycling Center, 1/BD</div>

                <div class="details">
                    <div class="price">$2,150</div>
                    <div class="location">
                        sunnyvale
                    </div>
                </div>
            </a>
        </li>

its encapsulated in the cl-static-search-result class. We also need to get the title class div and the price and location class divs to get all the data

Craigslist organizes listings in

  • tags with class "cl-static-search-result". We find all of them:

    listings = soup.find_all('li', class_='cl-static-search-result')
    

    We loop through each listing and extract the info we want - title, price, location, and link:

    for listing in listings:
        a_tag = listing.find('a')
        link = a_tag['href']
    
        title = listing.find('div', class_='title')
        price = listing.find('div', class_='price')
        location = listing.find('div', class_='location')
    
        print(title.text, price.text, location.text, link)
    

    The full code is:

    import requests
    from bs4 import BeautifulSoup
    
    url = '<https://sfbay.craigslist.org/search/apa>'
    
    response = requests.get(url)
    
    with open('craigslist.html', 'w') as f:
        f.write(response.text)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    listings = soup.find_all('li', class_='cl-static-search-result')
    
    for listing in listings:
        a_tag = listing.find('a')
        link = a_tag['href']
    
        title = listing.find('div', class_='title')
        price = listing.find('div', class_='price')
        location = listing.find('div', class_='location')
    
        print(title.text, price.text, location.text, link)
    

    This walks through the code to scrape Craigslist apartment listings and extract key information from each listing.

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!