Retrieving and Parsing Text from URLs with Python's urllib

Feb 8, 2024 ยท 2 min read

The urllib module in Python provides useful tools for retrieving and parsing content from URLs. It comes built-in with Python, making it easy to access in your code.

Fetching Text Content

To fetch text content from a URL, you can use urllib.request.urlopen():

import urllib.request

with urllib.request.urlopen('http://example.com') as response:
    html = response.read()

This opens the URL, downloads the response content as bytes, and stores it in the html variable.

You can also read line by line by treating the response as a file object:

with urllib.request.urlopen('http://example.com') as response:
    for line in response:
        print(line)

Parsing Text

Once you have retrieved the text content, you may want to parse it to extract relevant information.

For example, to parse HTML you can use a parser like Beautiful Soup. To parse JSON, you can use the built-in json module.

Here's an example parsing JSON from a URL:

import json
import urllib.request 

with urllib.request.urlopen("http://api.example.com") as url:
    data = json.loads(url.read().decode())
    print(data["key"])

This fetches the JSON data, decodes the bytes to text, parses it to a Python dict with json.loads(), and accesses a key's value.

Handling Errors

Make sure to wrap calls to urlopen() in try/except blocks to handle errors gracefully:

try:
    with urllib.request.urlopen('http://example.com') as response:
        # Code here   
except urllib.error.URLError as e:
    print(f"URL Error: {e.reason}")

This way you can catch common issues like connection issues, HTTP errors, redirect loops, etc.

Overall, urllib offers a straightforward way to programmatically access text content from the web in Python without needing third-party libraries.

Browse by tags:

Browse by language:

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!