Extracting URLs from Text in Python

Feb 20, 2024 · 2 min read

When working with text data in Python, you may need to identify and extract any URLs (web addresses) found within strings and text documents. Python has some helpful built-in methods and modules to detect, validate, and extract links from text.

Using Regular Expressions

One of the most common ways to find URLs is with regular expressions (regex). Here is an example regex pattern that will match most URLs:

import re

url_regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"

text = "Visit my blog at https://www.myblog.com and my wiki at http://example.wiki.org!" 

print(re.findall(url_regex, text))

This will print out a list of all matches:

['https://www.myblog.com', 'http://example.wiki.org']

The regex handles HTTP/HTTPS, with or without "www.", and domain suffixes like ".com" properly.

Validating URLs

We can take it a step further and validate that extracted strings are valid URLs using Python's urllib module:

from urllib.parse import urlparse

def is_valid_url(url):
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except ValueError:
        return False

print(is_valid_url("https://example.com")) # True
print(is_valid_url("example")) # False

This checks for the presence of a scheme like "http" and a network location.

Practical Usage

Some use cases where you may want to find URLs:

  • Extracting links from a scraped web page to crawl
  • Validating user-entered URLs from a form
  • Finding malicious links in chat messages
  • Gathering anchors from Markdown/HTML documents
  • The key is choosing the right technique based on your data source and end goal. Regex gives flexibility but can cause issues at scale.

    Hopefully this gives you a starter kit for effectively detecting links in text with Python!

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!