Automating Downloads in Python with urllib and wget

Feb 8, 2024 ยท 3 min read

Python provides several modules for programmatically downloading files and web content. Two commonly used modules are urllib and wget. While they share some overlapping functionality, each has unique capabilities that make them useful in different scenarios.

urllib - Downloading in Pure Python

The urllib module is part of Python's standard library, making it widely available without any extra installations. urllib provides functions for fetching URLs, handling redirects, parsing response data, encoding/decoding URLs, and more.

A basic example of using urllib to download a file:

import urllib.request

url = 'http://example.com/file.zip'
urllib.request.urlretrieve(url, 'file.zip') 

This downloads the file from the URL and saves it locally as file.zip.

Some key advantages of urllib:

  • No extra dependencies - included with Python by default
  • More control from within Python code
  • Supports FTP, file, and HTTP/HTTPS urls
  • Handling redirects, proxies, cookies, compression
  • Powerful URL encoding/decoding functions
  • Extensible with custom URL opener objects
  • The main downside is that the API involves dealing with lower level details instead of just a simple download interface. But overall, urllib excels when you need downloading capabilities directly from Python.

    wget - Feature-rich Command Line Tool

    wget is a popular command line program available on Linux, macOS, and Windows (via wget for Windows). wget can download web content and files but also has advanced capabilities like:

  • Resume interrupted downloads
  • Recursively download page contents
  • Download galleries/sections
  • Spider through links to capture entire sites
  • Restrict downloads based on rules
  • Authentication, cookies, and sessions
  • Adaptive bandwidth throttling
  • Output logging and reporting
  • This power and flexibility has made wget a go-to tool for web scraping and archiving websites.

    To simply download a file with wget:

    wget http://example.com/file.zip

    Some advantages of using wget:

  • Robust feature set for complex jobs
  • Handles unstable connections reliably
  • Scripting capabilities to automate tasks
  • Available without installing anything in Python
  • The main downside is it's an external command line tool, so you need to execute wget and then parse its output in Python code.

    Choosing the Right Tool

    So which one should you use? Here are some guidelines:

  • urllib - Need downloading fully inside Python code. Less complexity is better.
  • wget - Require advanced features like recursive crawling. Want battle-tested tool.
  • Both - Use urllib for simple one-off downloads. wget for heavy lifting.
  • The great news is you can choose either tool based on your specific requirements. And it's totally fine to use both in conjunction when building an application with lots of data scraping or processing.

    Practical Examples

    Let's look at some practical code snippets for common use cases.

    Download a file only if newer using urllib:

    import urllib.request 
    import time
    import os.path
    
    url = 'http://example.com/data.csv'
    file = 'data.csv'
    
    if not os.path.exists(file) or (os.path.getmtime(file) < time.time() - 86400):
        urllib.request.urlretrieve(url, file) 

    This checks if the file is older than one day before downloading the latest version.

    Resume a failed download with wget:

    wget -c http://example.com/large_file.zip

    The -c flag continues the download instead of starting from scratch.

    So take advantage of both urllib and wget for your Python downloading tasks. Choose the right tool or combine them as needed to create robust solutions.

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!