Parsing XML with BeautifulSoup

Oct 6, 2023 ยท 5 min read

While BeautifulSoup is mainly designed for parsing HTML, it can also handle XML documents quite well with just a little configuration. Here's how to leverage BeautifulSoup for scraping and analyzing XML files or responses.

Loading the XML

Loading an XML document into a BeautifulSoup object is the same process as with HTML:

from bs4 import BeautifulSoup

with open("file.xml") as f:
  data = f.read()

soup = BeautifulSoup(data, "xml")

Notice here we explicitly tell it to use the "xml" parser.

Navigating the Tree

You can navigate and search the parsed XML tree using the same methods as HTML:

titles = soup.find_all("title")

first_title = titles[0]
print(first_title.text)

The tag and attribute names will match those defined in the XML.

Finding by Attributes

Searching by attributes works the same:

songs = soup.find_all("song", {"length": "short"})

This finds all tags with a "length" attribute of "short".

Modifying the Tree

You can also modify and add to the XML tree:

new_tag = soup.new_tag("priority")
new_tag.string = "urgent"

first_title.append(new_tag)

This adds a new tag to the first .</span></p><h2><span>Outputting XML</span></h2><p><span>To output the modified XML document, use </span><span><ccode>prettify()</ccode></span><span>:</span></p><div class="code-container"><button class="copy-btn" onclick="copyToClipboard(this)">Copy</button><pre><code>print(soup.prettify()) </code></pre></div><p><span>This will print out the new XML with indentation.</span></p><p><span>You can also convert a BeautifulSoup XML object back into a string, perform additional processing, and write it back out to a file.</span></p><p><span>Here is an example demonstrating parsing an XML file with BeautifulSoup and extracting some data:</span></p><div class="code-container"><button class="copy-btn" onclick="copyToClipboard(this)">Copy</button><pre><code>from bs4 import BeautifulSoup xml = """ <catalog> <book id="1"> <author>Mark Twain</author> <title>The Adventures of Huckleberry Finn</title> <genre>Novel</genre> <price>7.99</price> </book> <book id="2"> <author>J.K. Rowling</author> <title>Harry Potter and the Philosopher's Stone</title> <genre>Fantasy</genre> <price>6.99</price> </book> </catalog> """ # Load XML and parse soup = BeautifulSoup(xml, "xml") # Find all book tags books = soup.find_all('book') # Print out author and title for each book for book in books: author = book.find("author").text title = book.find("title").text print(f"{title} by {author}") </code></pre></div><p><span>This would print:</span></p><div class="code-container"><button class="copy-btn" onclick="copyToClipboard(this)">Copy</button><pre><code>The Adventures of Huckleberry Finn by Mark Twain Harry Potter and the Philosopher's Stone by J.K. Rowling </code></pre></div><p><span>We locate the <book> elements, then extract the inner <author> and <title> text for each.</span></p><p><span>Here is an example of parsing the XML and displaying the extracted book data in a table using BeautifulSoup and Pandas:</span></p><div class="code-container"><button class="copy-btn" onclick="copyToClipboard(this)">Copy</button><pre><code>from bs4 import BeautifulSoup import pandas as pd xml = """ <catalog> <book id="1"> <author>Mark Twain</author> <title>The Adventures of Huckleberry Finn</title> <genre>Novel</genre> <price>7.99</price> </book> <book id="2"> <author>J.K. Rowling</author> <title>Harry Potter and the Philosopher's Stone</title> <genre>Fantasy</genre> <price>6.99</price> </book> </catalog> """ soup = BeautifulSoup(xml, 'xml') books = [] for book in soup.find_all('book'): book_data = { "id": book['id'], "author": book.find('author').text, "title": book.find('title').text, "genre": book.find('genre').text, "price": float(book.find('price').text) } books.append(book_data) df = pd.DataFrame(books) print(df) </code></pre></div><p><span>We extract the book attributes into a dictionary per book, store in a list, then convert to a Pandas DataFrame for a nice tabular display.</span></p><p><span>This provides a simple way to parse XML and view the extracted data in table format using Python. The DataFrame could also easily be output to CSV or other formats.</span></p><p><span>Here is an example of using BeautifulSoup to parse an RSS feed and save the extracted data to a CSV file:</span></p><div class="code-container"><button class="copy-btn" onclick="copyToClipboard(this)">Copy</button><pre><code>import requests from bs4 import BeautifulSoup import csv feed_url = "<https://www.example.com/feed.rss>" response = requests.get(feed_url) soup = BeautifulSoup(response.content, "xml") items = soup.find_all("item") csv_file = open('feed.csv', 'w') csv_writer = csv.writer(csv_file) csv_writer.writerow(['Title', 'Link','Published']) for item in items: title = item.find("title").text link = item.find("link").text pub_date = item.find("pubDate").text csv_writer.writerow([title, link, pub_date]) csv_file.close() </code></pre></div><p><span>This loads and parses the RSS feed, then extracts the title, link, and publish date for each <item> in the feed.</span></p><p><span>We write this data out row by row into a CSV file using the csv module.</span></p><p><span>The end result is a feed.csv file containing nicely extracted data from the RSS feed in tabular format.</span></p><p><span>This demonstrates how BeautifulSoup can easily parse and extract data from XML formats like RSS into structured datasets readable by other programs.</span></p><p></p> <div id='moreArea'> </div> <style type="text/css"> /* Reset some default styles */ .medium-header, .medium-grid { margin: 0; padding: 0; list-style: none; } /* Style the header */ .medium-header { text-align: center; padding: 20px; } /* Style the navigation menu */ .medium-nav { padding: 10px; } .medium-grid { display: grid; grid-template-columns: repeat(auto-fill, minmax(150px, 1fr)); gap: 20px; justify-content: center; align-items: center; text-align: center; } .medium-link { text-decoration: none; color: #00ab6c; /* Medium green */ font-weight: bold; transition: color 0.3s; } .medium-link:hover { color: #007e53; /* Darker green on hover */ } </style> <div> <header class="medium-header"> <h3>Browse by tags:</h3> </header> <nav class="medium-nav"><div class="medium-grid"><a class="medium-link" href="https://proxiesapi.com/articles/tag-data+extraction">data extraction</a><a class="medium-link" href="https://proxiesapi.com/articles/tag-HTML">HTML</a><a class="medium-link" href="https://proxiesapi.com/articles/tag-Python">Python</a><a class="medium-link" href="https://proxiesapi.com/articles/tag-XML">XML</a><a class="medium-link" href="https://proxiesapi.com/articles/tag-scraping">scraping</a><a class="medium-link" href="https://proxiesapi.com/articles/tag-parsing">parsing</a><a class="medium-link" href="https://proxiesapi.com/articles/tag-BeautifulSoup">BeautifulSoup</a></div></nav> <header class="medium-header"> <h3>Browse by language:</h3> </header> <nav class="medium-nav"> <div class="medium-grid"> <a class="medium-link" href="https://proxiesapi.com/articles/csharp">C#</a> <a class="medium-link" href="https://proxiesapi.com/articles/php">PHP</a> <a class="medium-link" href="https://proxiesapi.com/articles/python">Python</a> <a class="medium-link" href="https://proxiesapi.com/articles/javascript">JavaScript</a> <a class="medium-link" href="https://proxiesapi.com/articles/rust">Rust</a> <a class="medium-link" href="https://proxiesapi.com/articles/ruby">Ruby</a> <a class="medium-link" href="https://proxiesapi.com/articles/go">Go</a> <a class="medium-link" href="https://proxiesapi.com/articles/cplusplus">C++</a> <a class="medium-link" href="https://proxiesapi.com/articles/objectivec">Objective-C</a> <a class="medium-link" href="https://proxiesapi.com/articles/scala">Scala</a> <a class="medium-link" href="https://proxiesapi.com/articles/elixir">Elixir</a> <a class="medium-link" href="https://proxiesapi.com/articles/kotlin">Kotlin</a> <a class="medium-link" href="https://proxiesapi.com/articles/perl">Perl</a> <a class="medium-link" href="https://proxiesapi.com/articles/r">R</a> <a class="medium-link" href="https://proxiesapi.com/articles/java">Java</a> </div> </nav> <div id='topArea'> </div> </div> </main> </div> <style type="text/css"> .clearfix::before, .clearfix::after { content: ""; display: table; } .clearfix::after { clear: both; } .story-area{ background: #374A57; color: rgba(255, 255, 255, 0.5) !important; width:95%; max-width: 1200px; margin: 0 auto; /* This will center the div horizontally */ padding: 30px; text-align: left; margin-bottom: 20px; margin-top: 30px; display: flex; flex-direction: column; align-items: center; } .story-area-header{ color: #fff; font-weight: bold; font-size: 20px; text-align: left; margin-bottom: 20px; } </style> <div class="calltoAction" style=" "> <div class="contentWrapper" style=""> <div class="leftContent"> <!-- Content for the left div --> <h2 class="cta">Tired of getting blocked while scraping the web?</h2> <p class="subtitlecta" style="color: #fff;">ProxiesAPI handles headless browsers and rotates proxies for you. <br> Get access to 1,000 free API credits, no credit card required!</p> </div> <div class="rightContent"> <!-- Content for the right div --> <a href="assets/r.php?pid=143" class="biggerbtn">Try ProxiesAPI for free</a> </div> </div> </div> </div> <div class="sticky-footer"><br> <!-- Footer content goes here --> <span style="font-size: 20px;">Tired of getting blocked while scraping the web?</span> <br><br> <span style="font-size: 16px;">Get access to 1,000 free API credits, no credit card required!</span><br><br> <!-- Content for the right div --> <center><a href="https://proxiesapi.com/articles/assets/r.php?pid=143" style=" background-color: #009b72;color: white;padding: 10px 25px;border: none;border-radius: 20px;text-decoration: none;font-size: 16px;margin: 10px 0;">Try for free</a></center> <br><br> </div> <script type="text/javascript"> // Function to fetch and display related posts var htmlContent = '<div class="nonsticky-footer"><br>\ <!-- Footer content goes here -->\ <span style="font-size: 20px;">Tired of getting blocked while scraping the web?</span>\ <br><br>\ <span style="font-size: 16px;">Get access to 1,000 free API credits, no credit card required!</span><br><br>\ <!-- Content for the right div -->\ <center><a href="https://proxiesapi.com/articles/assets/r.php?pid=143" style=" background-color: #009b72;color: white;padding: 10px 25px;border: none;border-radius: 20px;text-decoration: none;font-size: 16px;margin: 10px 0;">Try for free</a></center>\ <br><br>\ </div>'; console.log(htmlContent); document.getElementById('adcontent').innerHTML=htmlContent; function fetchAndDisplayRelatedPosts(blogPostId) { // Make an AJAX request to your PHP script fetch('assets/related.php?blog_post_id=' + blogPostId) .then(response => response.json()) .then(data => { // Check if data is empty or an error occurred if (data.length === 0 || data.error) { console.error('Error fetching related posts:', data.error); return; } // Get the 'moreArea' div element const moreArea = document.getElementById('moreArea'); // Create a list of related articles const relatedList = document.createElement('ul'); data.forEach(post => { const listItem = document.createElement('li'); const link = document.createElement('a'); link.href = post.slug; link.textContent = post.title; listItem.appendChild(link); relatedList.appendChild(listItem); }); // Update the 'moreArea' div with related articles moreArea.innerHTML = '<h3>Related articles:</h3><br>'; moreArea.appendChild(relatedList); }) .catch(error => { console.error('Error fetching related posts:', error); }); } function fetchAndDisplayTopPosts(blogPostId) { // Make an AJAX request to your PHP script fetch('assets/top.php?blog_post_id=' + blogPostId) .then(response => response.json()) .then(data => { // Check if data is empty or an error occurred if (data.length === 0 || data.error) { console.error('Error fetching related posts:', data.error); return; } // Get the 'moreArea' div element const moreArea = document.getElementById('topArea'); // Create a list of related articles const relatedList = document.createElement('ul'); data.forEach(post => { const listItem = document.createElement('li'); const link = document.createElement('a'); link.href = post.slug; link.textContent = post.title; listItem.appendChild(link); relatedList.appendChild(listItem); }); // Update the 'moreArea' div with related articles moreArea.innerHTML = '<h3>Popular articles:</h3><br>'; moreArea.appendChild(relatedList); }) .catch(error => { console.error('Error fetching related posts:', error); }); } // Usage example: call this function with the blog post ID fetchAndDisplayRelatedPosts(143); // Replace 123 with the actual blog post ID fetchAndDisplayTopPosts(143); // Replace 123 with the actual blog post ID // JavaScript to handle the sticky footer window.addEventListener("scroll", function() { // Check if the screen width is greater than a certain threshold (e.g., 768 pixels for tablets) if (window.innerWidth > 768) { var footer = document.querySelector(".sticky-footer"); var content = document.querySelector(".container"); var rightContent = document.querySelector(".rightContent"); // Calculate the distance between the top of the page and the content var contentTop = content.getBoundingClientRect().top; var rightContentTop = rightContent.getBoundingClientRect().top; // Show the sticky footer when the content is near the top of the viewport if (contentTop < -200 && rightContentTop > window.innerHeight) { footer.classList.add("active"); } else { footer.classList.remove("active"); } } else { // Hide the footer on mobile devices var footer = document.querySelector(".sticky-footer"); footer.classList.remove("active"); } }); </script> </body>