Stories from the Web Crawling trenches in scraping

The Complete Puppeteer Cheatsheet

Author: Mohan Ganesan

Date: Dec 6, 2023

Puppeteer is a Node.js library for automating UI testing, scraping, and screenshot testing using headless Chrome.

How to Tell if a Website is Scrapable

Author: Mohan Ganesan

Date: Feb 20, 2024

Determine if a website can be scraped by checking the robots.txt file, analyzing the page source, checking for CAPTCHAs, and testing scraping a page.

How to Find Free Proxies & Rotate Them with Python

Author: Mohan Ganesan

Date: Oct 4, 2023

Web scraping with proxies in Python to avoid getting blocked and rotate IP addresses for successful scraping.

The Ultimate Nokogiri Cheat Sheet for Ruby

Author: Mohan Ganesan

Date: Oct 31, 2023

Nokogiri is a powerful HTML/XML parsing and scraping library for Ruby. This cheat sheet covers its extensive capabilities.

The Ultimate DOMDocument Cheat Sheet for PHP

Author: Mohan Ganesan

Date: Oct 31, 2023

DOMDocument allows manipulating HTML/XML documents in PHP. This cheat sheet is a comprehensive reference for working with DOMDocument.

Downloading Images from a Website with Javascript and cheerio

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to use Javascript and the cheerio library to download all the images from a Wikipedia page and extract data about dog breeds listed on the page.

What is the difference between Python ElementTree and BeautifulSoup?

Author: Mohan Ganesan

Date: Feb 5, 2024

ElementTree is best for working with valid XML documents, while BeautifulSoup is designed for parsing potentially malformed real-world HTML.

Scraping eBay Listings with Python and BeautifulSoup in 2023

Author: Mohan Ganesan

Date: Oct 5, 2023

eBay is a large online marketplace. This tutorial shows how to scrape and extract data from eBay listings using Python and BeautifulSoup.

Dealing with 403 Forbidden Errors in BeautifulSoup

Author: Mohan Ganesan

Date: Oct 6, 2023

Ways to handle and bypass 403 Forbidden errors in web scraping: checking error codes, using user agents, authenticating with login credentials, waiting and retrying, using proxies.

Making the Most of Proxies in aiohttp for Python

Author: Mohan Ganesan

Date: Feb 22, 2024

Learn how to use proxies with the aiohttp library in Python for privacy, geographic access, load balancing, and scraping.

Does Instagram allow scraping?

Author: Mohan Ganesan

Date: Feb 20, 2024

Instagram's terms allow limited scraping for non-commercial personal use. Best practices to avoid blocks include scraping slowly, varying user agents, avoiding logging in, and using proxies. Commercial scraping alternatives include the Instagram API and data resellers.

Scraping eBay Listings with Java and JSoup in 2023

Author: Mohan Ganesan

Date: Oct 5, 2023

eBay is a large online marketplace. This tutorial explains how to scrape and extract data from eBay listings using Java and the JSoup library.

What are the limitations of BeautifulSoup?

Author: Mohan Ganesan

Date: Feb 5, 2024

BeautifulSoup is a Python library for parsing and extracting data from HTML and XML documents. It struggles with modern JavaScript sites and cannot bypass most bot protections. CSS selectors and navigation logic can get complex. Consider alternatives like Scrapy, Puppeteer, or Playwright for professional web scraping.

How many tweets can you scrape?

Author: Mohan Ganesan

Date: Feb 20, 2024

Twitter provides a useful public API for accessing Tweets, but it does have rate limits in place to prevent abuse. Here are some key factors to consider for optimizing your data collection and respecting user privacy.

How does Amazon detect scraping?

Author: Mohan Ganesan

Date: Feb 20, 2024

Amazon strictly prohibits scraping their site. Use proxies, randomize delays, limit volume, and scrape selectively to avoid detection. Python code provided.

Scraping Reddit Posts in Node.js

Author: Mohan Ganesan

Date: Jan 9, 2024

Guide to scraping image URLs from a Reddit page using Node.js, focusing on identifying and extracting post blocks with images and metadata.

Dodging CAPTCHAs with Python for Web Scraping

Author: Mohan Ganesan

Date: Oct 4, 2023

CAPTCHAs are a major annoyance when scraping the web. This article explains how to automatically solve CAPTCHAs using Python libraries and services like 2Captcha and Proxies API.

Scraping Craigslist Listings with CSharp

Author: Mohan Ganesan

Date: Oct 1, 2023

Learn how to scrape Craigslist apartment listings using C# and HtmlAgilityPack. Avoid IP blocking with a rotating proxy server.

A Guide to BeautifulSoup's CSS Selector Capabilities

Author: Mohan Ganesan

Date: Oct 6, 2023

The BeautifulSoup library supports searching and extracting elements from HTML and XML documents using CSS selectors, making it a powerful tool for web scraping.

How To Use BeautifulSoup's find_all() Method

Author: Mohan Ganesan

Date: Oct 6, 2023

The find_all() method in BeautifulSoup is used to find all tags or strings matching a given criteria in an HTML/XML document. It returns a list of all matching tags and strings. It can search by string, regex, or function. It can also search within a specific tag and filter matches by attribute values. Mastering find_all() is key to effective web scraping with BeautifulSoup.

ParseHub Alternative - Simplify Web Scraping with ProxiesAPI

Author: Mohan Ganesan

Date: Sep 30, 2023

ParseHub is a visual web scraper with complex configuration and slow scraping speed. ProxiesAPI simplifies scraping with one API call, providing proxy rotation, browser identities, CAPTCHA solving, and javascript rendering.

Scraping Reddit Posts in Perl

Author: Mohan Ganesan

Date: Jan 9, 2024

Scraping Reddit using Perl to extract information from posts by parsing HTML and using UserAgent for data extraction.

Scraping Craigslist Listings with Go

Author: Mohan Ganesan

Date: Oct 1, 2023

Learn how to scrape Craigslist apartment listings using Go and goquery. Avoid IP blocking with a rotating proxy server.

How do websites detect web scraping?

Author: Mohan Ganesan

Date: Feb 20, 2024

Websites use detection methods like traffic patterns, browser fingerprints, cookies, and user agents to catch scrapers. Tips to avoid detection include slowing down requests, rotating IPs, using real browser user agents, and maintaining sessions/cookies.

Scraping Yelp Business Listings in Kotlin

Author: Mohan Ganesan

Date: Dec 6, 2023

Yelp data extraction using Kotlin for scraping key data points from listings in San Francisco.

Scraping eBay Listings with C++ and libcurl in 2023

Author: Mohan Ganesan

Date: Oct 5, 2023

Scrape and extract key data from eBay listings using C++ and the libcurl library.

Scraping eBay Listings in Rust in 2023

Author: Mohan Ganesan

Date: Oct 5, 2023

Learn how to scrape and extract data from eBay listings using Rust, reqwest, and select crates.

Rayobyte Alternative - Simplify Web Scraping with ProxiesAPI

Author: Mohan Ganesan

Date: Sep 30, 2023

ProxiesAPI simplifies web scraping with easy pricing and handles proxies automatically. Rayobyte offers complex and expensive proxy management services. Get started with 1,000 free API requests at ProxiesAPI.com.

Scraping Booking.com Property Listings in Kotlin in 2023

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to scrape property listings from Booking.com using Kotlin, Ktor, and kotlinx.html. Extract details like property name, location, ratings, etc.

Scraping eBay Listings with Kotlin and HttpClient in 2023

Author: Mohan Ganesan

Date: Oct 5, 2023

eBay is a large online marketplace. This tutorial explains how to scrape and extract data from eBay listings using Kotlin and the HttpClient library.

Scraping Reddit Posts with Rust

Author: Mohan Ganesan

Date: Jan 9, 2024

Code walkthrough for scraping Reddit using Rust to extract post information.

Using Proxies With Goutte in 2024

Author: Mohan Ganesan

Date: Jan 9, 2024

Proxies play a pivotal role in web scraping, preventing blocks and CAPTCHAs. Setting a proxy in Goutte involves using a custom HTTP client. Rotating proxies maximizes scraping before blocks. Proxies API simplifies proxies for seamless scraping.

Do I need to install Urllib in Python?

Author: Mohan Ganesan

Date: Feb 8, 2024

urllib is included automatically with Python and comes pre-installed with standard Python distributions. No separate installation required.

Google Search API: Unlocking the Power of Web Data

Author: Mohan Ganesan

Date: Apr 26, 2024

Google Search API is a powerful tool for developers and businesses to access web data. Proxies API offers a cost-effective alternative for integrating Google search functionality.

Scraping Booking.com Property Listings in Visual Basic in 2023

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to scrape property listings from Booking.com using Visual Basic and HtmlAgilityPack. Use HttpClient to fetch HTML content and extract details like property name, location, ratings. Scale your web scraping with Proxies API.

Scraping eBay Listings with Visual Basic and HtmlDocument in 2023

Author: Mohan Ganesan

Date: Oct 5, 2023

eBay is a large online marketplace. This tutorial explains how to scrape and extract data from eBay listings using Visual Basic and the HtmlDocument library.

Parsing XML with BeautifulSoup

Author: Mohan Ganesan

Date: Oct 6, 2023

BeautifulSoup can parse and extract data from XML and HTML documents, making it useful for scraping and analyzing data. It can navigate and search the parsed tree, modify the tree, and output the modified XML. It can also convert a BeautifulSoup XML object back into a string and perform additional processing. Examples demonstrate parsing XML files, displaying extracted data in tables using Pandas, and saving extracted data to CSV files.

Scraping Craigslist Listings with Perl

Author: Mohan Ganesan

Date: Oct 1, 2023

Learn how to scrape Craigslist apartment listings using Perl and modules LWP::UserAgent and HTML::TreeBuilder. Avoid IP blocking with a rotating proxy server.

Scraping Reddit Posts with R

Author: Mohan Ganesan

Date: Jan 9, 2024

Scrape data from Reddit posts using R code, handling responses, extracting information, and iterating through multiple posts.

Scraping Booking.com Property Listings with JavaScript in 2023

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to scrape property listings from Booking.com using JavaScript. Use Axios and Cheerio to fetch HTML content and extract details like property name, location, ratings, etc.

Scraping Data from Wikipedia with Perl

Author: Mohan Ganesan

Date: Dec 6, 2023

Scraping tabular data from Wikipedia using Perl. Extract and utilize structured data from Wikipedia pages.

Scraping Craigslist Listings with Ruby

Author: Mohan Ganesan

Date: Oct 1, 2023

Learn how to scrape Craigslist apartment listings using Ruby and Nokogiri. Avoid IP blocking with a rotating proxy server.

Scraping Business Listings from Yelp with Objective C

Author: Mohan Ganesan

Date: Dec 6, 2023

Scraping business listings from Yelp using Objective-C and proxies for data extraction.

Scraping eBay Listings with Ruby and Nokogiri in 2023

Author: Mohan Ganesan

Date: Oct 5, 2023

eBay is a large online marketplace. This tutorial explains how to scrape and extract data from eBay listings using Ruby and Nokogiri.

Scraping Craigslist Listings with Scala

Author: Mohan Ganesan

Date: Oct 1, 2023

Learn how to scrape Craigslist apartment listings using Scala and the play-ws library. Use XML parsing and a rotating proxy server to avoid IP blocking.

Scraping Booking.com Property Listings in Ruby in 2023

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to scrape property listings from Booking.com using Ruby, Nokogiri, and OpenURI libraries. Use proxies for scaling web scraping.

Scraping Craigslist Listings with Visual Basic

Author: Mohan Ganesan

Date: Oct 1, 2023

Learn how to scrape Craigslist apartment listings using Visual Basic and HtmlAgilityPack library. Avoid IP blocking with a rotating proxy server.

Guide to Scraping Reddit Posts in Objective C

Author: Mohan Ganesan

Date: Jan 9, 2024

Parsing through an unfamiliar code base can be intimidating for beginner programmers. In this article, we'll walk step-by-step through a sample program that scrapes posts from Reddit using HTML parsing and XPath selectors.

Scraping eBay Listings with Scala and HTTP4S in 2023

Author: Mohan Ganesan

Date: Oct 5, 2023

eBay is a large online marketplace. This tutorial explains how to scrape and extract data from eBay listings using Scala and the HTTP4S library.

Scraping Booking.com Property Listings in Elixir in 2023

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to scrape property listings from Booking.com using Elixir, HTTPoison, and Floki. Use proxies for scaling web scraping.

The Murky Legality of Scraping Public APIs

Author: Mohan Ganesan

Date: Feb 20, 2024

APIs provide easy access to public data, but scraping them may be illegal. Factors like rate limits and terms of service impact legality. Best practices include respecting restrictions, citing sources, and not selling or spamming with scraped data.

Scraping Yelp Business Listings using CSharp

Author: Mohan Ganesan

Date: Dec 6, 2023

Yelp is a popular review site with over 200 million reviews. This article explains how to scrape Yelp using proxies and HTML parsing with XPath.

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!