6 Ways you can scrape LinkedIn reliably

Mar 19th, 2023

There are many tools that can be used to scrape LinkedIn. Some are open source and others are extensions. I am going to avoid commercial tools as much as possible.

LinkedIn-Scraper (https://github.com/joeyism/linkedin-scraper) LinkedIn-Scraper is a Python library that aims to simplify scraping LinkedIn profiles and public pages. It provides an interface to fetch LinkedIn profiles, search results, company profiles, and more. The library handles authentication, pagination, and HTML parsing, allowing users to focus on extracting the desired information.

Example Code:

pythonCopy code
from linkedin_scraper import Person

person = Person("<https://www.linkedin.com/in/example/>")
print(person.name)
print(person.title)
print(person.location)
# Access other profile attributes as needed

Dux-Soup (https://www.dux-soup.com/) Dux-Soup is a LinkedIn automation tool that includes scraping capabilities. It allows users to visit LinkedIn profiles, extract data, and save it to CSV or other formats. Dux-Soup operates within the browser and provides a point-and-click interface for scraping LinkedIn data.

Snov.io (https://snov.io/) Snov.io is an email finder and verifier tool that also provides LinkedIn scraping capabilities. It offers a Chrome extension that allows users to extract LinkedIn profiles, including personal and professional details, such as names, titles, locations, and more.

PhantomJS (https://phantomjs.org/) PhantomJS is a headless web browser that can be used for scraping dynamic web pages, including LinkedIn. It allows users to interact with web pages, execute JavaScript, and extract data. PhantomJS can be scripted using various programming languages, including JavaScript and Python.

Example Code (Python with Selenium and PhantomJS):

pythonCopy code
from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get('<https://www.linkedin.com/>')
# Code to interact with the page and extract desired information
driver.quit()

Scrapy (https://scrapy.org/) Scrapy is a powerful web scraping framework written in Python. While not specific to LinkedIn, it can be used to scrape LinkedIn data. Scrapy provides a robust and flexible environment for building web scrapers, allowing users to define rules and pipelines for extracting and processing data.

Example Code:

pythonCopy code
import scrapy

class LinkedInSpider(scrapy.Spider):
    name = 'linkedin'
    start_urls = ['<https://www.linkedin.com/>']

    def parse(self, response):
        # Code to extract desired information from the response
        pass

# Instantiate and run the spider
process = scrapy.crawler.CrawlerProcess()
process.crawl(LinkedInSpider)
process.start()

Using Python and Beautiful soup

import requests
from bs4 import BeautifulSoup

url = '<https://www.linkedin.com/in/example/>'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

name_element = soup.find('li', class_='inline t-24 t-black t-normal break-words')
name = name_element.text.strip()

title_element = soup.find('h2', class_='mt1 t-18 t-black t-normal break-words')
title = title_element.text.strip()

location_element = soup.find('li', class_='t-16 t-black t-normal inline-block')
location = location_element.text.strip()

print("Name:", name)
print("Title:", title)
print("Location:", location)

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,With our automatic IP rotationWith our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so.

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Get our articles in your inbox