Stories from the Web Crawling trenches in user agents

Persisting Cookies with Python Requests for Effective Web Scraping

Author: Mohan Ganesan

Date: Oct 22, 2023

Cookies allow web scrapers to store and send session data. Python Requests library provides cookie persistence with Sessions, serialization, and rotating User Agents.

Web Scraping in C++ - The Complete Guide

Author: Mohan Ganesan

Date: Feb 20, 2024

Web scraping is a cool way to gather data from websites using code. This guide explores how to use web scraping with high-performance C++ and important libraries. C++ is a good language for web scraping due to its speed, efficiency, and integration with popular scraping tools. The article provides a step-by-step example of scraping a webpage and extracting structured data. It also discusses challenges and best practices for web scraping, such as rotating user agents and handling dynamic content.

Dealing with 403 Forbidden Errors in BeautifulSoup

Author: Mohan Ganesan

Date: Oct 6, 2023

Ways to handle and bypass 403 Forbidden errors in web scraping: checking error codes, using user agents, authenticating with login credentials, waiting and retrying, using proxies.

Does Instagram allow scraping?

Author: Mohan Ganesan

Date: Feb 20, 2024

Instagram's terms allow limited scraping for non-commercial personal use. Best practices to avoid blocks include scraping slowly, varying user agents, avoiding logging in, and using proxies. Commercial scraping alternatives include the Instagram API and data resellers.

How do websites detect web scraping?

Author: Mohan Ganesan

Date: Feb 20, 2024

Websites use detection methods like traffic patterns, browser fingerprints, cookies, and user agents to catch scrapers. Tips to avoid detection include slowing down requests, rotating IPs, using real browser user agents, and maintaining sessions/cookies.

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!