Parsing tables and XML with Beautiful Soup 4




Welcome to part 3 of the web scraping with Beautiful Soup 4 tutorial mini-series. In this tutorial, we're going to talk more about scraping what you want, specifically with a table example, as well as scraping XML documents.

We begin with our same starting code:

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read()
soup = bs.BeautifulSoup(source,'lxml')

This page just has one table, so we can get away with doing:

table = soup.table

OR we could do:

table = soup.find('table')

Either of these will work for us. Next, we can find the table rows within the table:

table_rows = table.find_all('tr')

Then we can iterate through the rows, find the td tags, and then print out each of the table data tags:

for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)
[]
['Python', '932914021', 'Definitely']
['Pascal', '532', 'Unlikely']
['Lisp', '1522', 'Uncertain']
['D#', '12', 'Possibly']
['Cobol', '3', 'No.']
['Fortran', '52124', 'Yes.']
['Haskell', '24', 'lol.']
>>> 

The first row is empty, since it has table header (th) tags, not table data (td) tags.

While this works just fine, since the topic is scraping tables, I will just show a non-beautiful soup method, using Pandas (if you don't have it, you can do pip install pandas, but the install will take some time):

import pandas as pd

dfs = pd.read_html('https://pythonprogramming.net/parsememcparseface/',header=0)
for df in dfs:
    print(df)

Pandas is a data analysis library, and is better suited for working with table data in many cases, especially if you're planning to do any sort of analysis with it. If you are interested in Pandas and data analysis, you can check out the Pandas for Data Analysis tutorial series.

Finally, let's talk about parsing XML. XML uses tags much like HTML, but is slightly different. We can use a variety of libraries to parse XML, including standard library options, but, since this is a Beautiful Soup 4 tutorial, let's talk about how to do it with BS4.

One of the most common reasons that you might deal with an XML document is if you are trying to scrape a sitemap for a website. PythonProgramming.net has a sitemap.xml, so we'll use that.

The sitemap looks like:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<url>
		<loc>
		https://pythonprogramming.net/pickle-data-analysis-python-pandas-tutorial/
		</loc>
		<lastmod>2016-10-15</lastmod>
	</url>
	<url>
		<loc>
		https://pythonprogramming.net/training-testing-machine-learning-tutorial/
		</loc>
		<lastmod>2016-10-15</lastmod>
	</url>
</urlset>

To parse XML, we need to change some of our initial code:

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://pythonprogramming.net/sitemap.xml').read()
soup = bs.BeautifulSoup(source,'xml')

Note that we're grabbing source data from a new link, but also when we call bs.BeautifulSoup, rather than having lxml, our second parameter is xml

Now, say we just want to grab the urls:

for url in soup.find_all('loc'):
    print(url.text)

The next tutorial:





  • Web scraping and parsing with Beautiful Soup 4 Introduction
  • Navigation with Beautiful Soup 4
  • Parsing tables and XML with Beautiful Soup 4
  • Scraping Dynamic Javascript Text