Parsing XML with BeautifulSoup

Oct 6, 2023 ยท 5 min read

While BeautifulSoup is mainly designed for parsing HTML, it can also handle XML documents quite well with just a little configuration. Here's how to leverage BeautifulSoup for scraping and analyzing XML files or responses.

Loading the XML

Loading an XML document into a BeautifulSoup object is the same process as with HTML:

from bs4 import BeautifulSoup

with open("file.xml") as f:
  data = f.read()

soup = BeautifulSoup(data, "xml")

Notice here we explicitly tell it to use the "xml" parser.

Navigating the Tree

You can navigate and search the parsed XML tree using the same methods as HTML:

titles = soup.find_all("title")

first_title = titles[0]
print(first_title.text)

The tag and attribute names will match those defined in the XML.

Finding by Attributes

Searching by attributes works the same:

songs = soup.find_all("song", {"length": "short"})

This finds all tags with a "length" attribute of "short".

Modifying the Tree

You can also modify and add to the XML tree:

new_tag = soup.new_tag("priority")
new_tag.string = "urgent"

first_title.append(new_tag)

This adds a new tag to the first .</span></p><h2><span>Outputting XML</span></h2><p><span>To output the modified XML document, use </span><span><ccode>prettify()</ccode></span><span>:</span></p><div class="code-container"><button class="copy-btn" onclick="copyToClipboard(this)">Copy</button><pre><code>print(soup.prettify()) </code></pre></div><p><span>This will print out the new XML with indentation.</span></p><p><span>You can also convert a BeautifulSoup XML object back into a string, perform additional processing, and write it back out to a file.</span></p><p><span>Here is an example demonstrating parsing an XML file with BeautifulSoup and extracting some data:</span></p><div class="code-container"><button class="copy-btn" onclick="copyToClipboard(this)">Copy</button><pre><code>from bs4 import BeautifulSoup xml = """ <catalog> <book id="1"> <author>Mark Twain</author> <title>The Adventures of Huckleberry Finn</title> <genre>Novel</genre> <price>7.99</price> </book> <book id="2"> <author>J.K. Rowling</author> <title>Harry Potter and the Philosopher's Stone</title> <genre>Fantasy</genre> <price>6.99</price> </book> </catalog> """ # Load XML and parse soup = BeautifulSoup(xml, "xml") # Find all book tags books = soup.find_all('book') # Print out author and title for each book for book in books: author = book.find("author").text title = book.find("title").text print(f"{title} by {author}") </code></pre></div><p><span>This would print:</span></p><div class="code-container"><button class="copy-btn" onclick="copyToClipboard(this)">Copy</button><pre><code>The Adventures of Huckleberry Finn by Mark Twain Harry Potter and the Philosopher's Stone by J.K. Rowling </code></pre></div><p><span>We locate the <book> elements, then extract the inner <author> and <title> text for each.</span></p><p><span>Here is an example of parsing the XML and displaying the extracted book data in a table using BeautifulSoup and Pandas:</span></p><div class="code-container"><button class="copy-btn" onclick="copyToClipboard(this)">Copy</button><pre><code>from bs4 import BeautifulSoup import pandas as pd xml = """ <catalog> <book id="1"> <author>Mark Twain</author> <title>The Adventures of Huckleberry Finn</title> <genre>Novel</genre> <price>7.99</price> </book> <book id="2"> <author>J.K. Rowling</author> <title>Harry Potter and the Philosopher's Stone</title> <genre>Fantasy</genre> <price>6.99</price> </book> </catalog> """ soup = BeautifulSoup(xml, 'xml') books = [] for book in soup.find_all('book'): book_data = { "id": book['id'], "author": book.find('author').text, "title": book.find('title').text, "genre": book.find('genre').text, "price": float(book.find('price').text) } books.append(book_data) df = pd.DataFrame(books) print(df) </code></pre></div><p><span>We extract the book attributes into a dictionary per book, store in a list, then convert to a Pandas DataFrame for a nice tabular display.</span></p><p><span>This provides a simple way to parse XML and view the extracted data in table format using Python. The DataFrame could also easily be output to CSV or other formats.</span></p><p><span>Here is an example of using BeautifulSoup to parse an RSS feed and save the extracted data to a CSV file:</span></p><div class="code-container"><button class="copy-btn" onclick="copyToClipboard(this)">Copy</button><pre><code>import requests from bs4 import BeautifulSoup import csv feed_url = "<https://www.example.com/feed.rss>" response = requests.get(feed_url) soup = BeautifulSoup(response.content, "xml") items = soup.find_all("item") csv_file = open('feed.csv', 'w') csv_writer = csv.writer(csv_file) csv_writer.writerow(['Title', 'Link','Published']) for item in items: title = item.find("title").text link = item.find("link").text pub_date = item.find("pubDate").text csv_writer.writerow([title, link, pub_date]) csv_file.close() </code></pre></div><p><span>This loads and parses the RSS feed, then extracts the title, link, and publish date for each <item> in the feed.</span></p><p><span>We write this data out row by row into a CSV file using the csv module.</span></p><p><span>The end result is a feed.csv file containing nicely extracted data from the RSS feed in tabular format.</span></p><p><span>This demonstrates how BeautifulSoup can easily parse and extract data from XML formats like RSS into structured datasets readable by other programs.</span></p><p></p> <div id='moreArea'> </div> <style type="text/css"> /* Reset some default styles */ .medium-header, .medium-grid { margin: 0; padding: 0; list-style: none; } /* Style the header */ .medium-header { text-align: center; padding: 20px; } /* Style the navigation menu */ .medium-nav { padding: 10px; } .medium-grid { display: grid; grid-template-columns: repeat(auto-fill, minmax(150px, 1fr)); gap: 20px; justify-content: center; align-items: center; text-align: center; } .medium-link { text-decoration: none; color: #00ab6c; /* Medium green */ font-weight: bold; transition: color 0.3s; } .medium-link:hover { color: #007e53; /* Darker green on hover */ } </style> <div> <header class="medium-header"> <h3>Browse by tags:</h3> </header> <nav class="medium-nav"><div class="medium-grid"><a class="medium-link" href="https://proxiesapi.com/articles/tag-data+extraction">data extraction</a><a class="medium-link" href="https://proxiesapi.com/articles/tag-HTML">HTML</a><a class="medium-link" href="https://proxiesapi.com/articles/tag-Python">Python</a><a class="medium-link" href="https://proxiesapi.com/articles/tag-XML">XML</a><a class="medium-link" href="https://proxiesapi.com/articles/tag-scraping">scraping</a><a class="medium-link" href="https://proxiesapi.com/articles/tag-parsing">parsing</a><a class="medium-link" href="https://proxiesapi.com/articles/tag-BeautifulSoup">BeautifulSoup</a></div></nav> <header class="medium-header"> <h3>Browse by language:</h3> </header> <nav class="medium-nav"> <div class="medium-grid"> <a class="medium-link" href="https://proxiesapi.com/articles/csharp">C#</a> <a class="medium-link" href="https://proxiesapi.com/articles/php">PHP</a> <a class="medium-link" href="https://proxiesapi.com/articles/python">Python</a> <a class="medium-link" href="https://proxiesapi.com/articles/javascript">JavaScript</a> <a class="medium-link" href="https://proxiesapi.com/articles/rust">Rust</a> <a class="medium-link" href="https://proxiesapi.com/articles/ruby">Ruby</a> <a class="medium-link" href="https://proxiesapi.com/articles/go">Go</a> <a class="medium-link" href="https://proxiesapi.com/articles/cplusplus">C++</a> <a class="medium-link" href="https://proxiesapi.com/articles/objectivec">Objective-C</a> <a class="medium-link" href="https://proxiesapi.com/articles/scala">Scala</a> <a class="medium-link" href="https://proxiesapi.com/articles/elixir">Elixir</a> <a class="medium-link" href="https://proxiesapi.com/articles/kotlin">Kotlin</a> <a class="medium-link" href="https://proxiesapi.com/articles/perl">Perl</a> <a class="medium-link" href="https://proxiesapi.com/articles/r">R</a> <a class="medium-link" href="https://proxiesapi.com/articles/java">Java</a> </div> </nav> <div id='topArea'> </div> </div> </main> </div> <style type="text/css"> .clearfix::before, .clearfix::after { content: ""; display: table; } .clearfix::after { clear: both; } .story-area{ background: #374A57; color: rgba(255, 255, 255, 0.5) !important; width:95%; max-width: 1200px; margin: 0 auto; /* This will center the div horizontally */ padding: 30px; text-align: left; margin-bottom: 20px; margin-top: 30px; display: flex; flex-direction: column; align-items: center; } .story-area-header{ color: #fff; font-weight: bold; font-size: 20px; text-align: left; margin-bottom: 20px; } </style> <style type="text/css"> .window { border-radius: 3px; background: #222; color: #fff; overflow: hidden; position: relative; margin: 0 auto; width: 100%; } .window:before { content: ' '; display: block; height: 48px; background: #C6C6C6; } .window:after { content: '. . .'; position: absolute; text-align: left; left: 12px; right: 0; top: -3px; font-family: "Times New Roman", Times, serif; font-size: 96px; color: #fff; line-height: 0; letter-spacing: -12px; } .terminal { margin: 20px; font-family: monospace; font-size: 13px; color: #22da26; height: 290px; } .terminal .command { width: 0%; white-space: nowrap; overflow: hidden; animation: write-command 4s both; -webkit-animation-delay: 3s; animation-delay: 3s; color: #22da26; } .terminal .command:before { content: '$ '; color: #22da26; } .terminal .htresults { width: 0%; white-space: nowrap; overflow: hidden; text-align:left; color: #C6C6C6; animation: write-command 1s both; -webkit-animation-delay: 7s; animation-delay: 7s; font-size: 14px; } .terminal .htresults:before { //content: '$ '; color: #22da26; } .terminal .log { white-space: nowrap; overflow: hidden; animation: write-log 1s both; } .terminal p:nth-child(2) { } .terminal p:nth-child(3) { animation-delay: 7s; } @keyframes write-command { 0% { width: 0%; } 100% { width: 100%; } } @keyframes write-log { 0% { height: 0; } 16% { height: 0; } 17% { height: 18px; } 33% { height: 18px; } 34% { height: 37px; } 51% { height: 37px; } 52% { height: 55px; } 69% { height: 55px; } 70% { height: 74px; } 87% { height: 74px; } 88% { height: 92px; } 88% { height: 92px; } 99% { height: 92px; } 100% { height: 110px; } } .terminalcontainer { display: flex; align-items: center; /* Center items vertically */ flex-wrap: wrap; /* Allow items to wrap on smaller screens */ } .leftheader { width: 100%; /* Occupy full width by default */ padding: 20px; /* Add padding to all sides */ box-sizing: border-box; /* Include padding in the width calculation */ font-size: 24px; /* Original font size */ text-align: center; /* Center align text */ } section { width: 100%; /* Occupy full width by default */ padding: 20px; /* Add padding to all sides */ box-sizing: border-box; /* Include padding in the width calculation */ } .window { max-width: 100%; /* Ensure the terminal window does not overflow */ overflow-x: auto; /* Add horizontal scroll if necessary */ } .command, .htresults { white-space: pre-wrap; /* Preserve line breaks */ } @media only screen and (min-width: 600px) { /* Adjust layout for screens wider than 600px */ .leftheader { width: 40%; /* Occupy 40% of the container */ padding: 20px; /* Reset padding */ text-align: left; /* Align text to the left */ } section { width: 60%; /* Occupy 60% of the container */ padding: 20px; /* Reset padding */ } } .bottomwrapper{ width: 100%; background-color: #25373F; text-align: center; color: #fff; height: 500px; } </style> <div class="bottomwrapper" style=" "> <div class="terminalcontainer"> <div class='leftheader'> <h2 class="cta">The easiest way to do Web Scraping</h2> <p class="subtitlecta" style="color: #fff;text-align: left;"> Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you </p> <br> <a href="assets/r.php?pid=143" class="biggerbtn">Try ProxiesAPI for free</a> </div> <section> <div class="window"> <div class="terminal"> <p class="command">curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"</p> <!-- <p class="log"> </p>--> <p class="htresults"><!doctype html><br><html><br> <head><br>     <title>Example Domain</title><br>     <meta charset="utf-8" /><br>     <meta http-equiv="Content-type" content="text/html; charset=utf-8" /><br>     <meta name="viewport" content="width=device-width, initial-scale=1" /><br> ...<br><br></p> </div> </div> </section> </div> </div> </div> <div class="sticky-footer"><br> <span style="font-size: 20px;">Tired of getting blocked while scraping the web?</span> <br><br> <span style="font-size: 16px;">Get access to 1,000 free API credits, no credit card required!</span><br><br> <center><a href="https://proxiesapi.com/articles/assets/r.php?pid=143" style=" background-color: #009b72;color: white;padding: 10px 25px;border: none;border-radius: 20px;text-decoration: none;font-size: 16px;margin: 10px 0;">Try for free</a></center> <br><br> </div> <script type="text/javascript"> // Function to fetch and display related posts var htmlContent = '<div class="nonsticky-footer"><br>\ <!-- Footer content goes here -->\ <span style="font-size: 20px;">Tired of getting blocked while scraping the web?</span>\ <br><br>\ <span style="font-size: 16px;">Get access to 1,000 free API credits, no credit card required!</span><br><br>\ <!-- Content for the right div -->\ <center><a href="https://proxiesapi.com/articles/assets/r.php?pid=143" style=" background-color: #009b72;color: white;padding: 10px 25px;border: none;border-radius: 20px;text-decoration: none;font-size: 16px;margin: 10px 0;">Try for free</a></center>\ <br><br>\ </div>'; console.log(htmlContent); // Check if the element with id "adcontent" exists var adcontentElement = document.getElementById('adcontent'); if (adcontentElement) { // If the element exists, set its innerHTML adcontentElement.innerHTML = htmlContent; } else { // If the element doesn't exist, handle the error console.error("Element with id 'adcontent' not found."); } function fetchAndDisplayRelatedPosts(blogPostId) { // Make an AJAX request to your PHP script fetch('assets/related.php?blog_post_id=' + blogPostId) .then(response => response.json()) .then(data => { // Check if data is empty or an error occurred if (data.length === 0 || data.error) { console.error('Error fetching related posts:', data.error); return; } // Get the 'moreArea' div element const moreArea = document.getElementById('moreArea'); // Create a list of related articles const relatedList = document.createElement('ul'); data.forEach(post => { const listItem = document.createElement('li'); const link = document.createElement('a'); link.href = post.slug; link.textContent = post.title; listItem.appendChild(link); relatedList.appendChild(listItem); }); // Update the 'moreArea' div with related articles moreArea.innerHTML = '<h3>Related articles:</h3><br>'; moreArea.appendChild(relatedList); }) .catch(error => { console.error('Error fetching related posts:', error); }); } function fetchAndDisplayTopPosts(blogPostId) { // Make an AJAX request to your PHP script fetch('assets/top.php?blog_post_id=' + blogPostId) .then(response => response.json()) .then(data => { // Check if data is empty or an error occurred if (data.length === 0 || data.error) { console.error('Error fetching related posts:', data.error); return; } // Get the 'moreArea' div element const moreArea = document.getElementById('topArea'); // Create a list of related articles const relatedList = document.createElement('ul'); data.forEach(post => { const listItem = document.createElement('li'); const link = document.createElement('a'); link.href = post.slug; link.textContent = post.title; listItem.appendChild(link); relatedList.appendChild(listItem); }); // Update the 'moreArea' div with related articles moreArea.innerHTML = '<h3>Popular articles:</h3><br>'; moreArea.appendChild(relatedList); }) .catch(error => { console.error('Error fetching related posts:', error); }); } // Usage example: call this function with the blog post ID fetchAndDisplayRelatedPosts(143); // Replace 123 with the actual blog post ID fetchAndDisplayTopPosts(143); // Replace 123 with the actual blog post ID // JavaScript to handle the sticky footer window.addEventListener("scroll", function() { // Check if the screen width is greater than a certain threshold (e.g., 768 pixels for tablets) if (window.innerWidth > 768) { var footer = document.querySelector(".sticky-footer"); var content = document.querySelector(".container"); var rightContent = document.querySelector(".bottomwrapper"); // Calculate the distance between the top of the page and the content var contentTop = content.getBoundingClientRect().top; var rightContentTop = rightContent.getBoundingClientRect().top; // Show the sticky footer when the content is near the top of the viewport if (contentTop < -200 && rightContentTop > window.innerHeight) { footer.classList.add("active"); } else { footer.classList.remove("active"); } } else { // Hide the footer on mobile devices var footer = document.querySelector(".sticky-footer"); footer.classList.remove("active"); } }); </script> <div id="exitPopup" class="exit-popup"> <div class="exit-content"> <span class="close-btn" onclick="closeExitPopup()">X</span> <h2>Don't leave just yet!</h2> <p>Enter your email below to claim your free API key:</p> <form id="emailForm"> <input type="email" name="email" id="email" placeholder="Your email address" required> <button type="submit">Claim My Free API Key</button> </form> </div> </div> <script> document.addEventListener("mouseleave", function(event) { if (event.clientY < 0) { showExitPopup(); } }); function showExitPopup() { document.getElementById("exitPopup").style.display = "block"; } function closeExitPopup() { document.getElementById("exitPopup").style.display = "none"; } document.getElementById("emailForm").addEventListener("submit", function(event) { event.preventDefault(); var email = document.getElementById("email").value; // Here you can handle the email submission, for example, sending it to your server for processing. // Then you can provide the API key to the user. console.log("Email submitted: " + email); // Redirect to signup.php with email as a GET parameter window.location.href = "assets/r.php?route=epopup&pid=143&email=" + encodeURIComponent(email); }); </script> </body>