urllib read

Feb 8, 2024 ยท 2 min read

The urllib module in Python provides useful functionality for retrieving data from URLs. This allows you to easily download and read web pages into your Python programs.

Fetching Web Pages

To fetch a web page, you first need to import urllib.request:

import urllib.request

Then you can use urllib.request.urlopen() to open a handle to the page. For example:

with urllib.request.urlopen('http://example.com') as response:
   html = response.read()

This reads the raw HTML content into the html variable as a bytes object.

Decoding and Parsing

Since the content is bytes, you'll typically want to decode it to string data:

html = html.decode() 

Now you can parse or process the HTML however you want, such as extracting data or searching the content.

The html.parser module in Python can parse HTML:

from html.parser import HTMLParser
parser = HTMLParser()
parser.feed(html)

Handling Errors

urllib will throw errors for problems like invalid URLs or network issues. You can handle them with try/except:

try:
   with urllib.request.urlopen('http://badurl') as response: 
      # Code here   
except Exception as e:
   print(f"Failed with error: {e}") 

This prints a nice error message instead of crashing your program.

Practical Example: Checking Broken Links

A handy use case is writing a web crawler that checks for broken links by trying to open URLs and catching errors. This can help find dead pages on your site.

The urllib module provides the capabilities needed to easily fetch and read web pages in Python. With some parsing and error handling, it enables practical programs for web scraping, checking links, and more.

Browse by tags:

Browse by language:

Tired of getting blocked while scraping the web?

ProxiesAPI handles headless browsers and rotates proxies for you.
Get access to 1,000 free API credits, no credit card required!