May 5th, 2020

Intro to Web Scraping with Python and BeautifulSoup

Today we are going to see how we can scrape Amazon reviews using Python and BeautifulSoup is a simple and elegant manner.

This article aims to get you started on a real-world problem solving while keeping it super simple, so you get familiar and get practical results as fas,t as possible.

So the first thing we need is to make sure we have Python 3 installed. If not, you can just get Python 3 and get it installed before you proceed.

Then you can install beautiful soup.

pip3 install beautifulsoup4

We will also need the library's requests, lxml, and soupsieve to fetch data, break it down to XML, and to use CSS selectors. Install them using.

pip3 install requests soupsieve lxml

Once installed, open an editor and type in.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

Now let's go to amazon and inspect the data we can get. I have chosen a random USB charger page to scrape located here.

This is how it looks.

The reviews section looks like this.

Back to our code now. Let's try and get this data by pretending we are a browser like this.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.amazon.com/Anker-Charger-PowerPort-PowerIQ-Foldable/dp/B071YMZ4LD/ref=pd_rhf_gw_s_gcx-rhf_0_3?_encoding=UTF8&ie=UTF8&pd_rd_i=B071YMZ4LD&pd_rd_r=K6T1ZZK2TV1653CYEFF5&pd_rd_w=pKJsd&pd_rd_wg=YKdY5&pf_rd_p=e4428c85-fd48-4538-856a-c8e08f1d4118&pf_rd_r=K6T1ZZK2TV1653CYEFF5&pf_rd_r=K6T1ZZK2TV1653CYEFF5&pf_rd_s=recent-history-footer&pf_rd_t=gateway&psc=1&refRID=K6T1ZZK2TV1653CYEFF5&smid=A294P4X9EWVXLJ#customerReviews'

response=requests.get(url,headers=headers)
print(response.text())

Save this as amazon_bs.py.

If you run it.

python3 amazon_bs.py

You will see the whole HTML page.

Now, let's use CSS selectors to get to the data we want. To do that, let's go back to Chrome and open the inspect tool. You can see that all the review titles elements have a class called review-title in them.

Let's use CSS selectors to get this data like so.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.amazon.com/Anker-Charger-PowerPort-PowerIQ-Foldable/dp/B071YMZ4LD/ref=pd_rhf_gw_s_gcx-rhf_0_3?_encoding=UTF8&ie=UTF8&pd_rd_i=B071YMZ4LD&pd_rd_r=K6T1ZZK2TV1653CYEFF5&pd_rd_w=pKJsd&pd_rd_wg=YKdY5&pf_rd_p=e4428c85-fd48-4538-856a-c8e08f1d4118&pf_rd_r=K6T1ZZK2TV1653CYEFF5&pf_rd_r=K6T1ZZK2TV1653CYEFF5&pf_rd_s=recent-history-footer&pf_rd_t=gateway&psc=1&refRID=K6T1ZZK2TV1653CYEFF5&smid=A294P4X9EWVXLJ#customerReviews'

response=requests.get(url,headers=headers)


soup=BeautifulSoup(response.content,'lxml')

print(soup.select('.review-title')[0].get_text())

This will print the title of the first review. We now need to get to all reviews. We notice that the

The class 'review' holds all the individual data together.

To get to them individually, we run through them like this and try and get to the review title from 'inside' the 'review'

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.amazon.com/Anker-Charger-PowerPort-PowerIQ-Foldable/dp/B071YMZ4LD/ref=pd_rhf_gw_s_gcx-rhf_0_3?_encoding=UTF8&ie=UTF8&pd_rd_i=B071YMZ4LD&pd_rd_r=K6T1ZZK2TV1653CYEFF5&pd_rd_w=pKJsd&pd_rd_wg=YKdY5&pf_rd_p=e4428c85-fd48-4538-856a-c8e08f1d4118&pf_rd_r=K6T1ZZK2TV1653CYEFF5&pf_rd_r=K6T1ZZK2TV1653CYEFF5&pf_rd_s=recent-history-footer&pf_rd_t=gateway&psc=1&refRID=K6T1ZZK2TV1653CYEFF5&smid=A294P4X9EWVXLJ#customerReviews'

response=requests.get(url,headers=headers)


soup=BeautifulSoup(response.content,'lxml')

print(soup.select('.review-title')[0].get_text())

for item in soup.select('.review'):
	try:
		print(item.select('.review-title')[0].get_text())
	except Exception as e:
		#raise e
		print('')

And when you run it, you get.

Bingo!! We got the review titles.

Now with the same process, we get the class names of all the other data like review date, number of stars, reviewer name, and the review itself like this.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.amazon.com/Anker-Charger-PowerPort-PowerIQ-Foldable/dp/B071YMZ4LD/ref=pd_rhf_gw_s_gcx-rhf_0_3?_encoding=UTF8&ie=UTF8&pd_rd_i=B071YMZ4LD&pd_rd_r=K6T1ZZK2TV1653CYEFF5&pd_rd_w=pKJsd&pd_rd_wg=YKdY5&pf_rd_p=e4428c85-fd48-4538-856a-c8e08f1d4118&pf_rd_r=K6T1ZZK2TV1653CYEFF5&pf_rd_r=K6T1ZZK2TV1653CYEFF5&pf_rd_s=recent-history-footer&pf_rd_t=gateway&psc=1&refRID=K6T1ZZK2TV1653CYEFF5&smid=A294P4X9EWVXLJ#customerReviews'

response=requests.get(url,headers=headers)


soup=BeautifulSoup(response.content,'lxml')

print(soup.select('.review-title')[0].get_text())

for item in soup.select('.review'):
	try:
		print('----------------------------------------')
		print(item.select('.review-title')[0].get_text())
		print(item.select('.review-date')[0].get_text())
		print(item.select('.review-rating')[0].get_text())
		print(item.select('.reviewText')[0].get_text())
		print(item.select('.a-profile-name')[0].get_text())
		print('----------------------------------------')
	except Exception as e:
		#raise e
		print('')

That, when run, should print everything we need from each review like this.

If you want to use this in production and want to scale to thousands of links, then you will find that you will get IP blocked quickly by Amazon. In this scenario, using a rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

A simple API can access the whole thing like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Share this article:

Get our articles in your inbox

Dont miss our best tips/tricks/tutorials about Web Scraping
Only great content, we don’t share your email with third parties.
Icon