Welcome to part 2 of the web scraping with Beautiful Soup 4 tutorial mini-series. In this tutorial, we're going to talk about navigating source code to get just the slice of data we want.
We'll begin with the same starting code:
import bs4 as bs import urllib.request source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read() soup = bs.BeautifulSoup(source,'lxml')
Now, rather than working with the entire soup, we can specify a new Beautiful Soup object. An example might be:
nav = soup.nav
Next, we can grab the links from just the nav bar:
for url in nav.find_all('a'): print(url.get('href'))
In this case, we're grabbing the first nav tags that we can find (the navigation bar). You could also go for
soup.body to get the body section, then grab the
.text from there:
body = soup.body for paragraph in body.find_all('p'): print(paragraph.text)
Finally, sometimes there might be multiple tags with the same names, but different classes, and you might want to grab information from a specific tag with a specific class. For example, our page that we're working with has a
div tag with the class of
"body". We can work with this data like so:
for div in soup.find_all('div', class_='body'): print(div.text)
class_='body', which allows us to work with a specific class of tag.
In the next tutorial, we're going to cover working with tables and XML.