Stories from the Web Crawling trenches in data collection

Caching in Python

Author: Mohan Ganesan

Date: Dec 6, 2023

Learn how to cache API responses in Python to improve performance. Caching reduces API requests, improves speed, and lowers costs.

How to Set and Change User Agent when using curl

Author: Mohan Ganesan

Date: Jan 9, 2024

Learn how to change cURL's user agent to avoid blocks and mimic real browsers for web scraping and API testing.

How to Use Proxy in PHP Curl in 2024

Author: Mohan Ganesan

Date: Jan 9, 2024

Web scraping with proxies in PHP cURL: learn how to bypass blocks, set up basic and advanced configurations, and integrate proxies effectively.

Web Scraping All The Images From a Website in Node.js

Author: Mohan Ganesan

Date: Dec 13, 2023

Automate data collection from websites using web scraping with Node.js, axios, and cheerio. Extract dog breed information and images from a Wikipedia page.

The Ultimate Guide to Rotating Proxies

Author: Mohan Ganesan

Date: Jan 9, 2024

Rotating proxies are dynamic proxy servers that automatically change the source IP address with each new request, providing enhanced anonymity and efficient large-scale data retrieval compared to static proxies.

Web Scraping Wikipedia with CSharp

Author: Mohan Ganesan

Date: Dec 6, 2023

Learn how to scrape data from Wikipedia using C# and the HtmlAgilityPack library. Extract information from websites for data collection, analysis, and automation.

Web Scraping Wikipedia Data in Go

Author: Mohan Ganesan

Date: Dec 6, 2023

Web scraping is the process of automatically collecting structured data from websites. This tutorial demonstrates how to scrape a Wikipedia table using Golang and goquery library.

The Complete Guide to Datacenter Proxies

Author: Mohan Ganesan

Date: Jan 9, 2024

Datacenter proxies allow anonymous internet access. They act as intermediaries between users and websites, providing privacy and security. Forward proxies fetch web content for users, while reverse proxies distribute client traffic and add a protective layer. Datacenter proxies are used for accessing geo-restricted content, competitive price monitoring, gathering social media data, and more. Popular datacenter proxy providers include Bright Data, Oxylabs, and Smartproxy. Configuring datacenter proxies involves integrating server access credentials into programming scripts or browser settings. Choosing the right proxies depends on factors like shared vs. dedicated proxies, HTTP vs. SOCKS proxies, and rotating vs. static proxies. Pro tips for maximizing proxy usage include chaining multiple providers, automating IP cycling, persisting sessions, and caching common responses. Datacenter proxies are legal but usage should respect website terms. Proxies API is a SaaS platform that simplifies large-scale scraping by handling proxy configuration and rotation automatically.

Scraping New York Times News Headlines in Scala

Author: Mohan Ganesan

Date: Dec 6, 2023

Web scraping is a technique for extracting data from websites automatically. This article explains how to scrape article titles and links from The New York Times homepage using Scala and the Jsoup library.

Scraping Yelp Business Listings in NodeJS

Author: Mohan Ganesan

Date: Dec 6, 2023

Learn how to scrape business listings from Yelp using web scraping techniques and premium proxies with Node.js and Axios.

Managing Cookies in aiohttp for Effective Web Scraping

Author: Mohan Ganesan

Date: Mar 3, 2024

Properly managing cookies is essential for robust and efficient web scraping with Python aiohttp library. Take control of cookie persistence, security settings, and expiration to build robust crawlers.

Scraping YouTube Data: What's Allowed and Best Practices

Author: Mohan Ganesan

Date: Feb 20, 2024

YouTube allows limited web scraping for non-commercial personal use cases like academic research, but with significant restrictions and best practices to follow.

Scraping Reddit Posts with Ruby

Author: Mohan Ganesan

Date: Jan 9, 2024

Learn how to scrape data from Reddit using Ruby, Nokogiri, and open-uri. Collect public data, analyze posting trends, and build Reddit bots or apps.

Do hackers use web scraping?

Author: Mohan Ganesan

Date: Feb 20, 2024

Hackers use web scraping to steal data, but ethical scraping is done with permission and within reason. Scrapers are valuable tools for businesses, journalists, and academics.

How do I scrape Google without being banned?

Author: Mohan Ganesan

Date: Feb 20, 2024

Collect Google Search data without getting blocked by following guidelines, using APIs, proxies, delays, and randomizing identifiers.

Accessing Data on Websites: APIs vs Web Scraping

Author: Mohan Ganesan

Date: Feb 20, 2024

APIs provide official, supported access points to data, while web scraping 'scrapes' data from sites in an unofficial manner.

How do websites detect web scraping?

Author: Mohan Ganesan

Date: Feb 20, 2024

Websites use detection methods like traffic patterns, browser fingerprints, cookies, and user agents to catch scrapers. Tips to avoid detection include slowing down requests, rotating IPs, using real browser user agents, and maintaining sessions/cookies.

How many tweets can you scrape?

Author: Mohan Ganesan

Date: Feb 20, 2024

Twitter provides a useful public API for accessing Tweets, but it does have rate limits in place to prevent abuse. Here are some key factors to consider for optimizing your data collection and respecting user privacy.

How Google Leverages Data Collection Methods Like Web Scraping

Author: Mohan Ganesan

Date: Feb 20, 2024

Google relies on web scraping for data collection, SEO, AI models, Knowledge Graph, and local business info. However, it raises ethical concerns.

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!