Stories from the Web Crawling trenches in best practices

Web Scraping in Python - The Complete Guide

Author: Mohan Ganesan

Date: Feb 20, 2024

Build robust web crawlers using libraries like BeautifulSoup. Overcome scraping challenges and learn best practices for large scale scraping.

How to fix TooManyRedirects error in Python requests

Author: Mohan Ganesan

Date: Oct 22, 2023

The TooManyRedirects error in Python requests occurs when the request exceeds the default limit of 30 redirects. This article explains the causes of the error and provides solutions to fix it, including modifying redirect behavior, increasing max redirects, disabling redirects, and implementing custom redirect handling. It also offers best practices for handling redirects and answers frequently asked questions about the error.

Concurrency and Thread Safety in Python's asyncio

Author: Mohan Ganesan

Date: Mar 17, 2024

Python's asyncio module enables concurrency within a single thread using an event loop. Sharing data between coroutines is thread-safe. Multithreading requires new event loops and explicit synchronization. Blocking code must execute in threads to avoid blocking the event loop. Following these best practices ensures efficient, thread-safe asyncio code.

How to write URL in Python?

Author: Mohan Ganesan

Date: Feb 8, 2024

Best practices for handling URLs in Python for web applications, APIs, and scraping websites.

Using Rotating Proxies in rvest in 2024

Author: Mohan Ganesan

Date: Jan 9, 2024

Configuring proxies in rvest for web scraping. Learn how to set up proxies, rotate them dynamically, and implement best practices for optimal performance.

Does Amazon allow web scraping?

Author: Mohan Ganesan

Date: Feb 20, 2024

Web scraping refers to extracting data from websites automatically through code. Amazon's terms of service restrict scraping, but there are exceptions based on fair use principles. Best practices include respecting robots.txt, making distributed requests, and not republishing full copies.

How to Use Proxy in WGet in 2024

Author: Mohan Ganesan

Date: Jan 9, 2024

Web scraping guide on configuring proxies with Wget, including different methods, tips for effective usage, common errors and solutions, and best practices for high performance. Introduces Proxies API as a solution to overcome DIY proxy limits.

Securely Share Sessions Between Services with Aiohttp Session Proxy

Author: Mohan Ganesan

Date: Feb 22, 2024

Aiohttp session proxy allows secure sharing of session data between microservices, improving user experience and ensuring encryption. Best practices include setting environment variables, using HTTPS, and handling timeouts.

Avoiding Excess Characters When Writing Files in Python

Author: Mohan Ganesan

Date: Feb 3, 2024

When writing data to files in Python, be aware of extra characters like newlines and padding. Use file.write() instead of print() and clean string formatting for clean file output.

Understanding Asyncio Event Loops in Python

Author: Mohan Ganesan

Date: Mar 25, 2024

The event loop is the core of asyncio in Python, handling asynchronous code and callbacks. Properly managing the event loop is key to writing efficient asyncio programs.

What is the future in asyncio python ?

Author: Mohan Ganesan

Date: Mar 17, 2024

Asyncio enables asynchronous programming in Python. It is gaining popularity and offers performance improvements, new idioms, and integration with other languages. It is set to become an indispensable part of the Python ecosystem.

Python: The Go-To Language for Web Scraping

Author: Mohan Ganesan

Date: Feb 20, 2024

Web scraping with Python: learn why Python is the go-to language, its advantages, popular libraries, handling complex websites, and best practices.

Executing Asyncio Coroutines: How Often to Call run()

Author: Mohan Ganesan

Date: Mar 24, 2024

The asyncio.run() function is used to execute asyncio coroutine functions. It should generally only be called once per asyncio program to avoid unexpected behavior.

The Murky Legality of Scraping Public APIs

Author: Mohan Ganesan

Date: Feb 20, 2024

APIs provide easy access to public data, but scraping them may be illegal. Factors like rate limits and terms of service impact legality. Best practices include respecting restrictions, citing sources, and not selling or spamming with scraped data.

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!