Python Programming Tutorials

How to Parse a Website with regex and urllib Python Tutorial

In this video, we use two of Python 3's standard library modules, re and urllib, to parse paragraph data from a website. As we saw, initially, when you use Python 3 and urllib to parse a website, you get all of the HTML data, like using "view source" on a web page. This HTML data is great if you are viewing via a browser, but is incredibly messy if you are viewing the raw source. For this reason, we need to build something that can sift through the mess and just pull the article data that we are interested in. There are some web scraping libraries out there, namely BeautifulSoup, which are aimed at doing this same sort of task.

On to the code:

import urllib.request
import re

url = 'http://pythonprogramming.net/parse-website-using-regular-expressions-urllib/'

req = urllib.request.Request(url)
resp = urllib.request.urlopen(req)
respData = resp.read()

Up to this point, everything should look pretty typical, as you've seen it all before. We specify our url, our values dict, encode the values, build our request, make our request, and then store the request to respData. We can print it out if we want to see what we're working with. If you are using an IDE, sometimes printing out the source code is not the greatest idea. Many webpages, especially larger ones, have very large amounts of code in their source. Printing all of this out can take quite a while in the IDLE. Personally, I prefer to just view-source. In Google Chrome, for example, control+u will view-source.

Alternatively, you should be able to just right-click on the page and select view-source. Once there, you want to look for your "target data." In our case, we just want to take the paragraph text data. If you're looking for something specific, then what I suggest you do is copy some of the "thing" you are looking for. So in the case of specific paragraph text, highlight some of it, copy it, then view the source. Once there, do a find operation, control+f usually will open one up, then paste in what you are looking for. Once you've done that, you should be able to find some identifiers near what you are looking for. In the case of paragraph data, it is paragraph data because people tell the browser it is. This means usually that there are literally paragraph tags around what we want that look like:

<p>text goes here</p>

Some websites get fancy with their HTML and do things like

<p class="derp">text here</p>

...keep this in mind. With that in mind, most websites just use simple paragraph tags, so let's show that:

paragraphs = re.findall(r'<p>(.*?)</p>',str(respData))

The above regular expression states: Find me anything that starts with a paragraph tag, then in our parenthesis, we say exactly "what" we're looking for, and that's basically any character, except for a newline, one or more repetitions of that character, and finally there may be 0 or 1 of THIS expression. After that, we have a closing paragraph tag. We find as many of these that exist. This will generate a list, which we can then iterate through with:

for eachP in paragraphs:
    print(eachP)

The output should be a bunch of paragraph data from our website.

The next tutorial:

Python Introduction
Print Function and Strings
Math with Python
Variables Python Tutorial
While Loop Python Tutorial
For Loop Python Tutorial
If Statement Python Tutorial
If Else Python Tutorial
If Elif Else Python Tutorial
Functions Python Tutorial
Function Parameters Python Tutorial
Function Parameter Defaults Python Tutorial
Global and Local Variables Python Tutorial
Installing Modules Python Tutorial
How to download and install Python Packages and Modules with Pip
Common Errors Python Tutorial
Writing to a File Python Tutorial
Appending Files Python Tutorial
Reading from Files Python Tutorial
Classes Python Tutorial
Frequently asked Questions Python Tutorial
Getting User Input Python Tutorial
Statistics Module Python Tutorial
Module import Syntax Python Tutorial
Making your own Modules Python Tutorial
Python Lists vs Tuples
List Manipulation Python Tutorial
Multi-dimensional lists Python Tutorial
Reading CSV files in Python
Try and Except Error handling Python Tutorial
Multi-Line printing Python Tutorial
Python dictionaries
Built in functions Python Tutorial
OS Module Python Tutorial
SYS module Python Tutorial
Python urllib tutorial for Accessing the Internet
Regular Expressions with re Python Tutorial
How to Parse a Website with regex and urllib Python Tutorial
Tkinter intro
Tkinter buttons
Tkinter event handling
Tkinter menu bar
Tkinter images, text, and conclusion
Threading module
CX_Freeze Python Tutorial
The Subprocess Module Python Tutorial
Matplotlib Crash Course Python Tutorial
Python ftplib Tutorial
Sockets with Python Intro
Simple Port Scanner with Sockets
Threaded Port Scanner
Binding and Listening with Sockets
Client Server System with Sockets
Python 2to3 for Converting Python 2 scripts to Python 3
Python Pickle Module for saving Objects by serialization
Eval Module with Python Tutorial
Exec with Python Tutorial