Stories from the Web Crawling trenches in performance

The Ultimate Loofah Cheatsheet for Ruby

Author: Mohan Ganesan

Date: Nov 4, 2023

Loofah is a Ruby library for parsing and manipulating HTML/XML documents. It provides a simple API for traversing, manipulating, and extracting data from markup. It also offers XSS sanitization and integrates with Rails. Loofah is built on top of Nokogiri, providing speed and Ruby idioms.

How to Build a Simple HTTP Proxy in Rust in just 40 lines

Author: Mohan Ganesan

Date: Oct 1, 2023

Rust is a great language for network programming. Learn how to build a basic HTTP proxy in just 40 lines of code. Also, discover the benefits of using a rotating proxy to avoid IP blocking.

Caching in Python

Author: Mohan Ganesan

Date: Dec 6, 2023

Learn how to cache API responses in Python to improve performance. Caching reduces API requests, improves speed, and lowers costs.

Keeping Sessions Alive with Persistent Connections in Python Requests

Author: Mohan Ganesan

Date: Feb 3, 2024

Using persistent sessions in Python Requests library improves performance and allows reusing connections for multiple requests.

Combining AsyncIO and Multiprocessing in Python

Author: Mohan Ganesan

Date: Mar 17, 2024

Python's asyncio library and multiprocessing module can be combined for improved resource utilization and cleaner code. Data passing between the two requires caution.

What is the fastest XML parser in Python?

Author: Mohan Ganesan

Date: Feb 5, 2024

Choosing the right XML parsing library is crucial for performance. lxml is the fastest option, taking only 0.35 seconds compared to over 2 seconds with xml.etree.ElementTree. It's well worth the extra setup.

What is PoolManager in urllib3?

Author: Mohan Ganesan

Date: Feb 20, 2024

Simplifying HTTP requests with PoolManager in Python. PoolManager manages a pool of connections for reusing, improving performance. Customize pool behavior for better resource usage.

Speed Up Your Python Web Requests: Requests vs. Urllib

Author: Mohan Ganesan

Date: Feb 3, 2024

Python's requests library provides a fast and simple interface for making HTTP requests, offering better performance than urllib for most use cases.

Unlocking Async Performance with Asyncio Redis

Author: Mohan Ganesan

Date: Mar 25, 2024

Redis is a popular in-memory data store known for its speed and versatility. By combining Redis with Python's asyncio module, you can build extremely fast and scalable applications.

The Definitive Guide to Handling Proxies in Go in 2024

Author: Mohan Ganesan

Date: Jan 9, 2024

Dealing with proxies in Go for web scraping: setup, security, privacy, performance, and troubleshooting. Proxies API offers a solution for developers.

Speed Up HTTP Requests: When to Use http.client over requests

Author: Mohan Ganesan

Date: Feb 3, 2024

Python offers options for HTTP requests with http.client and requests. http.client is faster for simple requests, while requests is more feature-rich. Use http.client for speed and requests for complex applications.

Tuning aiohttp Request Timeouts for Optimal Performance

Author: Mohan Ganesan

Date: Mar 3, 2024

Managing request timeouts in aiohttp is crucial for good performance. Default timeouts may cause resource exhaustion and unresponsive UI. Tuning timeouts based on application load and setting them globally can prevent failures and improve user experience.

Is Lxml better than BeautifulSoup?

Author: Mohan Ganesan

Date: Feb 5, 2024

Web scrapers extract data from websites using parser libraries like lxml and BeautifulSoup. lxml is faster and more valid, while BeautifulSoup is more convenient and resilient.

Benchmarking aiohttp Web Performance

Author: Mohan Ganesan

Date: Feb 22, 2024

The Python aiohttp library provides powerful async HTTP client/server functionality. Benchmarking quantifies metrics like requests per second, latency distributions, and resource usage to guide optimization and capacity planning.

Efficient URL Requests with urllib PoolManager

Author: Mohan Ganesan

Date: Feb 6, 2024

Making HTTP requests in Python is common. urllib's PoolManager helps in reusing connections to each host, boosting performance.

Whats the equivalent of pythons request package for rust?

Author: Mohan Ganesan

Date: Feb 3, 2024

Rust is a systems programming language focused on performance, reliability, and efficiency. reqwest is a popular HTTP client library for Rust, providing a similar developer experience to Python's requests package.

Making Asynchronous HTTP Requests in Python

Author: Mohan Ganesan

Date: Feb 3, 2024

Python Requests library provides simple interface for making HTTP requests. Supports synchronous and asynchronous requests using threads or processes.

Async IO vs Thread Pools in Python: When to Use Each

Author: Mohan Ganesan

Date: Mar 17, 2024

Python provides two major approaches for concurrent and parallel programming: asyncio and thread pools. Choosing the right concurrency tool can impact performance, scalability, and code complexity.

What is the difference between asyncio and multithreading python ?

Author: Mohan Ganesan

Date: Mar 17, 2024

Python developers often need to make their programs concurrent to improve performance. The two main options for concurrency in Python are asyncio and multithreading.

Is asyncio python better than threading?

Author: Mohan Ganesan

Date: Mar 17, 2024

Async IO vs Threading in Python: A Practical Comparison. Async IO and threading are two options for concurrency in Python. This article compares their strengths and weaknesses, including performance, scalability, and library compatibility.

Python Threads vs Processes: Which is Faster and When to Use Each

Author: Mohan Ganesan

Date: Mar 24, 2024

When writing Python programs, developers often wonder if it's better to use threads or processes. Processes are generally faster and more robust, but have higher overhead. Threads require less resources to create, but come with their own challenges.

Is asyncio part of Python?

Author: Mohan Ganesan

Date: Mar 17, 2024

Python's asyncio module enables non-blocking concurrency, improving performance, scalability, and user experience.

What is the fastest language for multithreading?

Author: Mohan Ganesan

Date: Mar 17, 2024

Multithreading improves performance. C++, Java, and Go are fastest. Optimize with thread pools, shared state, and reducing blocking.

Why is multithreading not faster in python?

Author: Mohan Ganesan

Date: Mar 24, 2024

Python's multithreading capabilities are limited by the Global Interpreter Lock (GIL), but can still provide performance benefits for I/O-bound tasks. Tips include using multiprocessing for CPU-bound tasks and avoiding shared memory between threads.

Running WSGI Apps with aiohttp

Author: Mohan Ganesan

Date: Feb 22, 2024

aiohttp library in Python allows running WSGI apps directly, providing better performance and leveraging aiohttp's features.

Choosing Between Curio and aiohttp for Async IO in Python

Author: Mohan Ganesan

Date: Feb 22, 2024

Python developers can choose between Curio and aiohttp for async IO. Curio is great for CPU-bound tasks, while aiohttp is ideal for IO-bound HTTP applications. Both libraries are well-optimized for performance.

Multithreading in Python: Choosing the Right Model

Author: Mohan Ganesan

Date: Mar 17, 2024

Multithreading in Python can improve performance and responsiveness. Choose the right model based on use case and tradeoffs. Options include threading, multiprocessing, and asyncio.

Is asyncio concurrent or parallel python?

Author: Mohan Ganesan

Date: Mar 17, 2024

Asyncio provides concurrency, not parallelism. It shines for I/O bound work and can achieve high performance. Use multiprocessing for CPU intensive tasks.

What is REST and soap API?

Author: Mohan Ganesan

Date: May 7, 2024

REST and SOAP are two types of APIs with key differences in architecture, data formats, verbs, and performance. REST is faster and more scalable, while SOAP offers more security and robust messaging.

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!