Welcome to part 2 of the web scraping with Beautiful Soup 4 tutorial mini-series. In this tutorial, we're going to talk about navigating source code to get just the slice of data we want.
We'll begin with the same starting code:
import bs4 as bs import urllib.request source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read() soup = bs.BeautifulSoup(source,'lxml')
Now, rather than working with the entire soup, we can specify a new Beautiful Soup object. An example might be:
nav = soup.nav
Next, we can grab the links from just the nav bar:
for url in nav.find_all('a'): print(url.get('href'))
In this case, we're grabbing the first nav tags that we can find (the navigation bar). You could also go for soup.body
to get the body section, then grab the .text
from there:
body = soup.body for paragraph in body.find_all('p'): print(paragraph.text)
Finally, sometimes there might be multiple tags with the same names, but different classes, and you might want to grab information from a specific tag with a specific class. For example, our page that we're working with has a div
tag with the class of "body"
. We can work with this data like so:
for div in soup.find_all('div', class_='body'): print(div.text)
Note the class_='body'
, which allows us to work with a specific class of tag.
In the next tutorial, we're going to cover working with tables and XML.