Python Programming Tutorials

Navigation with Beautiful Soup 4

Welcome to part 2 of the web scraping with Beautiful Soup 4 tutorial mini-series. In this tutorial, we're going to talk about navigating source code to get just the slice of data we want.

We'll begin with the same starting code:

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read()
soup = bs.BeautifulSoup(source,'lxml')

Now, rather than working with the entire soup, we can specify a new Beautiful Soup object. An example might be:

nav = soup.nav

Next, we can grab the links from just the nav bar:

for url in nav.find_all('a'):
    print(url.get('href'))

In this case, we're grabbing the first nav tags that we can find (the navigation bar). You could also go for soup.body to get the body section, then grab the .text from there:

body = soup.body
for paragraph in body.find_all('p'):
    print(paragraph.text)

Finally, sometimes there might be multiple tags with the same names, but different classes, and you might want to grab information from a specific tag with a specific class. For example, our page that we're working with has a div tag with the class of "body". We can work with this data like so:

for div in soup.find_all('div', class_='body'):
    print(div.text)

Note the class_='body', which allows us to work with a specific class of tag.

In the next tutorial, we're going to cover working with tables and XML.

The next tutorial:

Web scraping and parsing with Beautiful Soup 4 Introduction
Navigation with Beautiful Soup 4
Parsing tables and XML with Beautiful Soup 4
Scraping Dynamic Javascript Text