May 4th, 2020
Scraping all the Links from a Website using Beautiful Soup

Here is a simple and uncomplicated way to get just the links from a website. This is very useful many times for just being able to count the number of links, use it later in a high-speed web crawler or any other analysis.

First, we import Beautiful soup and the requests module.

The code does that and also retrieves the HTML contents of the URL and uses Beautiful soup to get it ready for being able to query on it.

from bs4 import BeautifulSoup
import requests

def getAllLinks(url):
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'lxml')


print( getAllLinks("https://copyblogger.com") )

BeautifulSoup has a function findAll and we will use the CSS selector 'a' to select all the hyperlinks we can find on the page.

    links = []

    for link in soup.findAll('a'):
    	url=link.get('href')
      links.append(url)

Notice how we are appending all the links we find into an array.

Putting the whole code together this is what it will look like.

from bs4 import BeautifulSoup
import requests

def getAllLinks(url):
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'lxml')
    links = []

    for link in soup.findAll('a'):
    	url=link.get('href')
    	if (url[0]!='#'):
        	links.append(url)

    return links

print( getAllLinks("https://copyblogger.com") )

The codes below check for hyperlinks that only point to elements within the page and ignores them.

    	if (url[0]!='#'):
        	links.append(url)

And when we run it, we simply get all the links.

In more advanced implementations you will need to even rotate this string so Wikipedia cant tell its the same browser! Welcome to web scraping.

If we get a little bit more advanced, you will realize that Wikipedia can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Share this article:

Get our articles in your inbox

Dont miss our best tips/tricks/tutorials about Web Scraping
Only great content, we don’t share your email with third parties.
Icon