Stories from the Web Crawling trenches in HTML parsing

The Complete BeautifulSoup Cheatsheet with Examples

Author: Mohan Ganesan

Date: Oct 4, 2023

This cheatsheet covers the full BeautifulSoup 4 API with practical examples. It provides a comprehensive guide to web scraping and HTML parsing using Python's BeautifulSoup library.

The Ultimate Goquery Cheatsheet

Author: Mohan Ganesan

Date: Oct 31, 2023

Goquery is a Go library for easy HTML manipulation and extraction using jQuery-style syntax. Great for web scraping and building web apps.

The Ultimate Jsoup Cheatsheet in Java

Author: Mohan Ganesan

Date: Oct 31, 2023

Jsoup is a Java library for parsing and manipulating HTML using DOM, CSS, and jquery-like methods.

Web Scraping in C++ - The Complete Guide

Author: Mohan Ganesan

Date: Feb 20, 2024

Web scraping is a cool way to gather data from websites using code. This guide explores how to use web scraping with high-performance C++ and important libraries. C++ is a good language for web scraping due to its speed, efficiency, and integration with popular scraping tools. The article provides a step-by-step example of scraping a webpage and extracting structured data. It also discusses challenges and best practices for web scraping, such as rotating user agents and handling dynamic content.

Web Scraping with Scala & ChatGPT

Author: Mohan Ganesan

Date: Sep 25, 2023

Scala is a great language for web scraping with ChatGPT. Use Scalaj and Jsoup libraries for HTTP requests and HTML parsing. ChatGPT can provide explanations and generate code snippets for scraping tasks.

Web Scraping All The Images From a Website in Node.js

Author: Mohan Ganesan

Date: Dec 13, 2023

Automate data collection from websites using web scraping with Node.js, axios, and cheerio. Extract dog breed information and images from a Wikipedia page.

Scraping eBay Listings with Python and BeautifulSoup in 2023

Author: Mohan Ganesan

Date: Oct 5, 2023

eBay is a large online marketplace. This tutorial shows how to scrape and extract data from eBay listings using Python and BeautifulSoup.

Downloading Images from a Website with C++ and cpp-selector

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to use C++ and libraries like cpp-httplib and cpp-selector to scrape data and images from HTML tables and download them locally.

Scraping Multiple Pages in Java with JSoup

Author: Mohan Ganesan

Date: Oct 15, 2023

Web scraping in Java using JSoup to extract data from multiple pages. Use base URL pattern, loop through pages, send request, parse HTML, and extract data using selectors.

Scraping Wikipedia Pages with Node.js

Author: Mohan Ganesan

Date: Dec 6, 2023

Scrape Wikipedia using Node.js with axios and cheerio to extract structured data for various use cases.

Scarping All The Images From a Website in PHP

Author: Mohan Ganesan

Date: Dec 13, 2023

Scrape dog breed data from a Wikipedia page using PHP, parse HTML, send HTTP requests, extract data, and download images. Overcome IP blocking with a rotating proxy service.

Web Scraping New York Times News Headlines in Go

Author: Mohan Ganesan

Date: Dec 6, 2023

Web scraping is the process of extracting data from websites using code. This article provides a tutorial on web scraping using Go language and goquery library. It covers the steps to send a GET request, parse HTML content, extract data, and handle common scraping challenges like IP blocking.

Scraping Wikipedia Tables with R

Author: Mohan Ganesan

Date: Dec 6, 2023

Learn how to scrape data from Wikipedia using R. Extract tables and data, handle errors, and work with scraped data. Get hands-on experience with the end-to-end process.

Web Scraping with Kotlin & ChatGPT

Author: Mohan Ganesan

Date: Sep 25, 2023

Kotlin is a great language for web scraping with ChatGPT. Use libraries like Ktor and Jsoup for HTTP requests and HTML parsing. ChatGPT can provide explanations and code snippets for scraping tasks.

Scraping all the Images from a Website with Rust

Author: Mohan Ganesan

Date: Dec 13, 2023

Learn how to use Rust for web scraping, including data extraction, image scraping, and error handling. Overcome IP blocking with a rotating proxy service like Proxies API.

Downloading Images from a Website with Ruby and Nokogiri

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to use Ruby and Nokogiri to scrape data and images from HTML tables, download and save images, and overcome challenges like CAPTCHAs and IP blocks with Proxies API.

Building a Simple Proxy Rotator with Ruby and Nokogiri

Author: Mohan Ganesan

Date: Oct 2, 2023

Fetch and use public proxies in Ruby projects using Nokogiri and free proxy lists. Scale to thousands of links with a rotating proxy service like Proxies API.

Scraping Wikipedia in Java for Beginners

Author: Mohan Ganesan

Date: Dec 6, 2023

Web scraping is the process of extracting data from websites. This article provides a code example using Jsoup to scrape Wikipedia for data on US presidents. It also discusses handling IP blocking with a rotating proxy service.

Scraping All Images from a Website with Java

Author: Mohan Ganesan

Date: Dec 13, 2023

Web scraping is the process of extracting data from websites automatically. This article explains how to scrape dog breed images from a Wikipedia page using Java and Jsoup library. It also discusses the use of CSS selectors and overcoming IP blocking.

Scraping All the Images from a Website with Go

Author: Mohan Ganesan

Date: Dec 13, 2023

This Go program scrapes dog breed images from a Wikipedia page using web scraping and goquery package.

Scraping All Images from a Website with Kotlin

Author: Mohan Ganesan

Date: Dec 13, 2023

Practical guide to scraping images from a website using Kotlin code. Learn how to extract data, download images, and overcome IP blocks.

Web Scraping with Ruby & ChatGPT

Author: Mohan Ganesan

Date: Sep 25, 2023

Web scraping in Ruby with Nokogiri, Mechanize, and ChatGPT. Get code snippets and explanations for scraping tasks.

Is Lxml better than BeautifulSoup?

Author: Mohan Ganesan

Date: Feb 5, 2024

Web scrapers extract data from websites using parser libraries like lxml and BeautifulSoup. lxml is faster and more valid, while BeautifulSoup is more convenient and resilient.

Web Scraping Wikipedia Data in Go

Author: Mohan Ganesan

Date: Dec 6, 2023

Web scraping is the process of automatically collecting structured data from websites. This tutorial demonstrates how to scrape a Wikipedia table using Golang and goquery library.

Downloading Images from a Website with Perl and Mojo::DOM

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to use Perl and modules like LWP::UserAgent and Mojo::DOM to download images of dog breeds from a Wikipedia page.

Web Scraping with CSharp & ChatGPT

Author: Mohan Ganesan

Date: Sep 25, 2023

Web scraping in C# using ChatGPT and HtmlAgilityPack for data extraction and code generation.

Scraping Multiple Pages with Python and BeautifulSoup

Author: Mohan Ganesan

Date: Oct 15, 2023

Web scraping using Python and BeautifulSoup to extract data from multiple pages. Make HTTP requests, parse HTML, and extract information.

Scraping Reddit Posts in CSharp

Author: Mohan Ganesan

Date: Jan 9, 2024

Download and parse a Reddit page using AngleSharp in C# to extract information from posts.

How to Build a Reddit Scraper in Java

Author: Mohan Ganesan

Date: Jan 9, 2024

Learn how to scrape Reddit posts using Java, web scraping, HTML parsing, selectors, and user-agent headers.

Scraping Data from Wikipedia in C++

Author: Mohan Ganesan

Date: Dec 6, 2023

Scraping Wikipedia using cURL and Gumbo to extract details on US presidents from a table.

Scraping Wikipedia Tables With Rust

Author: Mohan Ganesan

Date: Dec 6, 2023

Scraping Wikipedia allows for quick access to structured data, data availability, and hands-on practice with web scraping concepts. This article provides a step-by-step guide to scraping data on US presidents using web scraping techniques.

Scraping New York Times News Headlines in R

Author: Mohan Ganesan

Date: Dec 6, 2023

Web scraping is the process of extracting data from websites automatically through code. This article provides a beginner's tutorial on web scraping using R to extract article titles and links from The New York Times for further analysis.

Downloading Images from a Website with Scala and rucola

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to use Scala and libraries like scalaj-http and rucola to download images of dog breeds from a Wikipedia page.

What is the alternative to BeautifulSoup in Python?

Author: Mohan Ganesan

Date: Feb 5, 2024

BeautifulSoup is a popular Python library for parsing HTML, but there are alternatives like XML parsing, html.parser, and regular expressions.

Building a Simple Proxy Rotator with Scala and Scraping

Author: Mohan Ganesan

Date: Oct 2, 2023

A simple Scala proxy rotator using ScalaJS for web scraping, fetching and parsing proxies periodically from a proxy site.

Downloading Images from a Website with Objective-C and Ono

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to use Objective-C and AFNetworking and Ono libraries to download images from a Wikipedia page and scrape data.

Web Scraping in Python: A Comparison of Beautiful Soup, Selenium, and Scrapy

Author: Mohan Ganesan

Date: Oct 4, 2023

Web scraping with Python using Beautiful Soup, Selenium, and Scrapy. Each tool serves a different niche, from simple extraction to browser automation and large-scale scraping.

Web Scraping Wikipedia in Scala

Author: Mohan Ganesan

Date: Dec 6, 2023

Wikipedia scraping using Scala and Jsoup to extract structured data from tables. Simplified steps include importing libraries, defining URL, setting user agent, sending HTTP request, parsing HTML, extracting data, and printing scraped data.

Scraping eBay Listings in Go in 2023

Author: Mohan Ganesan

Date: Oct 5, 2023

Step-by-step tutorial for extracting data from eBay listings using Go. Use net/http and github.com/PuerkitoBio/goquery packages for HTML parsing.

Downloading Images from a Website with Go and goquery

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to use Go and goquery to download images from a Wikipedia page, extract data from HTML tables, and scrape websites. Use Proxies API for IP rotation and CAPTCHA solving.

Scraping eBay Listings in Rust in 2023

Author: Mohan Ganesan

Date: Oct 5, 2023

Learn how to scrape and extract data from eBay listings using Rust, reqwest, and select crates.

Scraping New York Times News Headlines in C++

Author: Mohan Ganesan

Date: Dec 6, 2023

Web scraping is a technique for extracting data from websites using C++. This article explains how to scrape article titles and links from The New York Times. It covers concepts like HTTP requests, HTML structure, libcurl, and Gumbo. It also mentions the challenges of IP blocking and suggests using a rotating proxy service like Proxies API.

Downloading Images from a Website with Elixir and Floki

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to use Elixir and libraries like HTTPoison and Floki to download images from a Wikipedia page and extract data from HTML tables.

Extracting Structured Data by Scraping Wikipedia with Kotlin

Author: Mohan Ganesan

Date: Dec 6, 2023

Scraping Wikipedia using Jsoup to extract structured data on US presidents.

Downloading Images from a Website with Kotlin and Jsoup

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to use Kotlin and Jsoup to download images from a Wikipedia page, extract data from HTML tables, and scrape websites. Use Proxies API for scaling web scraping.

Building a Simple Proxy Rotator with Objective-C

Author: Mohan Ganesan

Date: Oct 2, 2023

Fetch and parse proxies from free proxy pools to rotate and use in Objective-C projects, solving IP blocking problems with a rotating proxy service.

Scraping New York Times News Headlines with Rust

Author: Mohan Ganesan

Date: Dec 6, 2023

Automatically collect and analyze data from websites using web scraping in Rust. Learn how to make structured requests, parse HTML, and use CSS selectors to extract information.

Web Scraping with Objective-C & ChatGPT

Author: Mohan Ganesan

Date: Sep 25, 2023

Objective-C is a powerful language for web scraping on Apple platforms. ChatGPT is an AI assistant that provides explanations and code generation for scraping tasks.

Scraping Reddit Posts with Rust

Author: Mohan Ganesan

Date: Jan 9, 2024

Code walkthrough for scraping Reddit using Rust to extract post information.

Scraping all the Images from a Website with Ruby

Author: Mohan Ganesan

Date: Dec 13, 2023

Scraping dog breed information and images from Wikipedia using Ruby and Nokogiri library. Save locally with breed name, group, and local name.

Scraping Reddit Posts in Kotlin

Author: Mohan Ganesan

Date: Jan 9, 2024

Scrape Reddit posts using Kotlin script, send HTTP requests, parse HTML, and extract key data using selectors.

Scraping Multiple Pages in Ruby with Nokogiri

Author: Mohan Ganesan

Date: Oct 15, 2023

Web scraping in Ruby using Nokogiri to extract data from multiple pages. Use base URL pattern, loop through pages, parse HTML, and extract data.

Scraping Multiple Pages in Objective-C with NSURLSession and XPathQuery

Author: Mohan Ganesan

Date: Oct 15, 2023

Scrape multiple pages in Objective-C using NSURLSession and XPathQuery to extract data programmatically from websites.

Web Scraping New York Times News Headlines in Ruby

Author: Mohan Ganesan

Date: Dec 6, 2023

Web scraping allows automatic data extraction from websites. This article demonstrates web scraping using Ruby, Nokogiri, and Net::HTTP. It covers CSS selectors, handling errors, and overcoming IP blocks.

Scraping New York Times News Headlines with Objective-C

Author: Mohan Ganesan

Date: Dec 6, 2023

Web scraping is a valuable skill for extracting data from websites using Objective-C. This beginner-friendly guide walks you through the process of web scraping, from setting up the project to parsing HTML content. Learn how to simulate a browser request, send an HTTP GET request, handle errors, and extract the data you need. With the right techniques and tools, web scraping can be a powerful tool for data analysis and building web applications.

Scraping Data from Wikipedia with Perl

Author: Mohan Ganesan

Date: Dec 6, 2023

Scraping tabular data from Wikipedia using Perl. Extract and utilize structured data from Wikipedia pages.

How to Scrape Reddit Posts in Go

Author: Mohan Ganesan

Date: Jan 9, 2024

Learn how to scrape Reddit using Go with a step-by-step guide. Extract information about posts using HTML parsing and HTTP requests.

Scraping New York Times News Headlines in VB

Author: Mohan Ganesan

Date: Dec 6, 2023

Web scraping is the process of extracting data from websites automatically through code. This article provides a step-by-step guide on how to scrape article titles and links from The New York Times website using HTML parsing and XPath queries.

Is API better than web scraping?

Author: Mohan Ganesan

Date: Feb 20, 2024

APIs vs web scraping: pros and cons of structured data retrieval and HTML parsing for flexible data access.

Web Scraping with Go & ChatGPT

Author: Mohan Ganesan

Date: Sep 25, 2023

Go is a great language for web scraping with ChatGPT's assistance. It provides explanations, code generation, and supports HTML parsing and CSV output. A web scraping API like Proxies API can handle anti-scraping measures and JavaScript rendering.

Scraping Real Estate Listings From Realtor with Go

Author: Mohan Ganesan

Date: Jan 9, 2024

Learn how to scrape real estate listing data from Realtor.com using Go and the goquery library. Use web scraping to collect and analyze housing data.

Scraping Craigslist Listings with Rust

Author: Mohan Ganesan

Date: Oct 1, 2023

Learn how to scrape Craigslist apartment listings using Rust and the reqwest and selectors crates.

Web Scraping Yelp Business Listings with Rust

Author: Mohan Ganesan

Date: Dec 6, 2023

Learn how to scrape Yelp business listings using Rust, including setting up the development environment, handling proxies, making HTTP requests, parsing HTML, and extracting business details.

Scraping New York Times News Headlines in Perl

Author: Mohan Ganesan

Date: Dec 6, 2023

Learn how to scrape the NYT website using Perl, LWP::UserAgent, and Mojo::DOM. Extract headlines and links programmatically.

Scraping Real Estate Listings From Realtor in Kotlin

Author: Mohan Ganesan

Date: Jan 9, 2024

Learn how to use Jsoup for web scraping to extract key details from real estate listings on Realtor.com. This comprehensive guide covers crafting GET requests, selecting HTML elements with CSS selectors, extracting and transforming text, and dealing with missing data. By the end, you'll be able to scrape details like broker name, status, price, beds, baths, square footage, lot size, and full address from any Realtor.com search page.

Scraping Reddit Posts with Ruby

Author: Mohan Ganesan

Date: Jan 9, 2024

Learn how to scrape data from Reddit using Ruby, Nokogiri, and open-uri. Collect public data, analyze posting trends, and build Reddit bots or apps.

Scraping Wikipedia Tables in Objective-C for Beginners

Author: Mohan Ganesan

Date: Dec 6, 2023

Gathering data by scraping websites is made easy with just 34 lines of code in Objective-C using TFHpple library. Learn how to make HTTP requests, parse HTML content, extract data from a table, and clean and process the scraped content.

What are the features of BeautifulSoup?

Author: Mohan Ganesan

Date: Feb 5, 2024

Web scraping with BeautifulSoup: a powerful Python library for extracting data from websites using simple API and CSS selectors.

Scraping New York Times News Headlines in Scala

Author: Mohan Ganesan

Date: Dec 6, 2023

Web scraping is a technique for extracting data from websites automatically. This article explains how to scrape article titles and links from The New York Times homepage using Scala and the Jsoup library.

Is BeautifulSoup faster than selenium?

Author: Mohan Ganesan

Date: Feb 5, 2024

Web scraping involves extracting data from websites. BeautifulSoup is lightweight and efficient for scraping static content, while Selenium is necessary for dynamically loaded content. Together, they provide a comprehensive solution for web scraping.

Introduction to Scraping Reddit Posts in Scala

Author: Mohan Ganesan

Date: Jan 9, 2024

Beginner-friendly guide to scrape content from Reddit using Scala and Play Framework's WS library. Extract key information like post titles, permalinks, authors, and scores from Reddit posts on a webpage.

Scraping Yelp Business Listings using Ruby - A step by step guide

Author: Mohan Ganesan

Date: Dec 6, 2023

Learn how to scrape Yelp business listings using Ruby and Nokogiri, bypassing anti-bot mechanisms with premium proxies.

Scraping eBay Listings with Scala and HTTP4S in 2023

Author: Mohan Ganesan

Date: Oct 5, 2023

eBay is a large online marketplace. This tutorial explains how to scrape and extract data from eBay listings using Scala and the HTTP4S library.

Scraping Yelp Business Listings with Scala

Author: Mohan Ganesan

Date: Dec 6, 2023

Learn how to extract data from Yelp business listings using Scala and web scraping techniques.

Scraping Reddit Posts In C++

Author: Mohan Ganesan

Date: Jan 9, 2024

Web scraping C++ program that extracts post data from Reddit using HTML parsing and curl library.

Accessing Data on Websites: APIs vs Web Scraping

Author: Mohan Ganesan

Date: Feb 20, 2024

APIs provide official, supported access points to data, while web scraping 'scrapes' data from sites in an unofficial manner.

Scraping Reddit Posts in Elixir

Author: Mohan Ganesan

Date: Jan 9, 2024

Web scraping tutorial using Elixir code to extract post information from Reddit. Learn how to install dependencies, make requests, parse HTML, and use CSS selectors.

Can I use Selenium with BeautifulSoup?

Author: Mohan Ganesan

Date: Feb 5, 2024

Web scraping with Selenium and BeautifulSoup allows for dynamic page access and data extraction, making them a powerful combination.

how long does it take to learn web scraping

Author: Mohan Ganesan

Date: Feb 20, 2024

Learn web scraping in 0-3 months with Python or JavaScript. Master advanced techniques in 4-12 months. Keep leveling up your skills!

Scraping Real Estate Listings From Realtor Using Rust

Author: Mohan Ganesan

Date: Jan 9, 2024

Web scraping article using Rust programming language to extract real estate listing data from Realtor.com using HTML parsing and HTTP requests.

Scraping Real Estate Listings From Realtor in Perl

Author: Mohan Ganesan

Date: Jan 9, 2024

Step-by-step walkthrough of code to scrape real estate listings from Realtor.com using web scraping and XPath selectors.

What are the three basic parts of a scraper?

Author: Mohan Ganesan

Date: Feb 22, 2024

Web scrapers allow you to programmatically extract data from websites, transform it into a structured format like a CSV or JSON file, and save it to your computer for further analysis.

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!