Welcome to a tutorial on web scraping with Beautiful Soup 4. Beautiful Soup is a Python library aimed at helping programmers who are trying to scrape data from websites.
To use beautiful soup, you need to install it: $ pip install beautifulsoup4
. Beautiful Soup also relies on a parser, the default is lxml
. You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml
or $ apt-get install python-lxml
.
To begin, we need HTML. I have created an example page for us to work with.
To begin, we need to import Beautiful Soup and urllib, and grab source code:
import bs4 as bs import urllib.request source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read()
Then, we create the "soup." This is a beautiful soup object:
soup = bs.BeautifulSoup(source,'lxml')
If you do print(soup)
and print(source)
, it looks the same, but the source is just plain the response data, and the soup is an object that we can actually interact with, by tag, now, like so:
# title of the page print(soup.title) # get attributes: print(soup.title.name) # get values: print(soup.title.string) # beginning navigation: print(soup.title.parent.name) # getting specific values: print(soup.p)
Finding paragraph tags <p>
is a fairly common task. In the case above, we're just finding the first one. What if we wanted to find them all?
print(soup.find_all('p'))
We can also iterate through them:
for paragraph in soup.find_all('p'): print(paragraph.string) print(str(paragraph.text))
The difference between string and text is that string produces a NavigableString object, and text is just typical unicode text. Notice that, if there are child tags in the paragraph item that we're attempting to use .string
on, we will get None
returned.
Another common task is to grab links. For example:
for url in soup.find_all('a'): print(url.get('href'))
In this case, if we just grabbed the .text
from the tag, you'd get the anchor text, but we actually want the link itself. That's why we're using .get('href')
to get the true URL.
Finally, you may just want to grab text. You can use .get_text()
on a Beautiful Soup object, including the full soup:
print(soup.get_text())
This concludes the introduction to Beautiful Soup. In the next tutorial, we're going cover navigating a page's elements to get more specifically what you want.