Decoding URL Responses with Python's urllib

When fetching data from a URL using Python's urllib module, the response body is returned as bytes. Often, we want to work with this data as text strings instead. Converting between bytes and strings is easy with a few methods.

import urllib.request

response = urllib.request.urlopen("http://example.com")
html_bytes = response.read() # Read response body as bytes

To decode bytes to a string, we need to know the character encoding that was used to encode the bytes. Common encodings are UTF-8, ASCII, and Latin-1.

We can usually find the encoding in the response headers:

encoding = response.headers.get_content_charset() # Get encoding from headers

If there is no encoding specified, UTF-8 is a safe bet.

Once we have the encoding, we can decode the bytes:

html_string = html_bytes.decode(encoding) # Decode bytes
print(html_string)

The decode() method converts bytes to a string using the provided encoding.

We may also encode strings into bytes:

data = "hello world"
data_bytes = data.encode(encoding) # Encode string to bytes

When posting data to a URL, it often needs to be URL encoded into bytes before sending:

from urllib.parse import quote_plus 

data = "hello world"
url_encoded_data = quote_plus(data) # URL encode string
data_bytes = url_encoded_data.encode(encoding) # Encode to bytes

So in Python's urllib, we can easily convert between bytes and strings for request/response bodies using encode() and decode(). Specifying the correct text encoding is key to avoid decoding errors.

Decoding URL Responses with Python's urllib

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Decoding URL Responses with Python's urllib

The easiest way to do Web Scraping

Don't leave just yet!