Stories from the Web Crawling trenches in C++

The Complete Libxml2 C++ Cheatsheet

Author: Mohan Ganesan

Date: Oct 31, 2023

Libxml2 is a XML processing library written in C for use in C/C++ applications. It provides DOM, SAX, XMLReader, XPath and XPointer support.

How to Build a Super Simple HTTP Proxy in C++ in just 30 lines of code

Author: Mohan Ganesan

Date: Oct 1, 2023

Build a basic HTTP proxy in C++ in 30 lines of code. Use a rotating proxy service to avoid IP blocking with an API.

Web Scraping in C++ - The Complete Guide

Author: Mohan Ganesan

Date: Feb 20, 2024

Web scraping is a cool way to gather data from websites using code. This guide explores how to use web scraping with high-performance C++ and important libraries. C++ is a good language for web scraping due to its speed, efficiency, and integration with popular scraping tools. The article provides a step-by-step example of scraping a webpage and extracting structured data. It also discusses challenges and best practices for web scraping, such as rotating user agents and handling dynamic content.

The Ultimate Gumbo C++ Cheatsheet

Author: Mohan Ganesan

Date: Oct 31, 2023

Gumbo is an HTML5 parsing library in C++ that allows for easy manipulation and extraction of HTML. It provides various functions for selecting, traversing, and manipulating nodes in the DOM.

Downloading Images from a Website with C++ and cpp-selector

Author: Mohan Ganesan

Date: Oct 15, 2023

Learn how to use C++ and libraries like cpp-httplib and cpp-selector to scrape data and images from HTML tables and download them locally.

Using Proxies With C++ httplib in 2024

Author: Mohan Ganesan

Date: Jan 9, 2024

Using a proxy with C++ httplib is easy. Set up authentication, chain multiple proxies, customize settings, and troubleshoot issues. Proxies API offers a better solution for unblockable scraping.

Scraping Multiple Pages in C++ with cpp-netlib and cppxpath

Author: Mohan Ganesan

Date: Oct 15, 2023

Web scraping in C++ using cpp-netlib and cppxpath libraries to extract data from multiple pages. Use a base URL pattern, loop through pages, send requests, parse HTML, extract data using XPath, and print or store scraped data. Proxies API can help overcome challenges like CAPTCHAs, IP blocks, and bot detection for scraping production-level sites.

How to Scrape All the Images from a Website with C++

Author: Mohan Ganesan

Date: Dec 13, 2023

Scraping and downloading images from a website using C++ libraries like libcurl and libxml2. Requires HTML, CSS, and programming knowledge.

Scraping eBay Listings with C++ and libcurl in 2023

Author: Mohan Ganesan

Date: Oct 5, 2023

Scrape and extract key data from eBay listings using C++ and the libcurl library.

What are the fastest languages for web scraping?

Author: Mohan Ganesan

Date: Feb 5, 2024

Web scraping involves extracting data from websites. Choosing the right programming language is crucial for scraping large sites. C++ and Rust offer speed, while Go provides simplicity and speed.

Scraping Yelp Business Listings with C++

Author: Mohan Ganesan

Date: Dec 6, 2023

Web scraping article on extracting business listing data from Yelp using C++ and libraries libcurl and Gumbo.

Scraping New York Times News Headlines in C++

Author: Mohan Ganesan

Date: Dec 6, 2023

Web scraping is a technique for extracting data from websites using C++. This article explains how to scrape article titles and links from The New York Times. It covers concepts like HTTP requests, HTML structure, libcurl, and Gumbo. It also mentions the challenges of IP blocking and suggests using a rotating proxy service like Proxies API.

Downloading Images from URLs in C++

Author: Mohan Ganesan

Date: May 5, 2024

Download images efficiently using C++ with libcurl, Boost.Asio, Qt Network Module, OpenCV, or Poco Libraries.

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!