Welcome to part 12 of the intermediate Python programming tutorial series. In this part, we're going to talk more about the built-in library: multiprocessing. Here, we're going to be covering the beginnings to building a spider, using the multiprocessing library. The idea here will be to quickly access and process many websites at the same time.
If you're just now joining us, you may want to start with the multiprocessing tutorial, as this is meant to simply be an example of what we learned.
To begin, let's make some imports:
from multiprocessing import Pool import bs4 as bs import random import requests import string
We will obviously be using
multiprocessing, and we're going to use the
Pool so we can access the returned values from a process. Next, we're going to make use of the Beautiful Soup library for parsing the HTML. If you're not familiar with Beautiful Soup, you can check out the Beautiful Soup miniseries. We'll be using
string to generate random strings, and
requests to actually make the request and grab the source code.
def random_starting_url(): starting = ''.join(random.SystemRandom().choice(string.ascii_lowercase) for _ in range(3)) url = ''.join(['http://', starting, '.com']) return url
Now, with a spider, we need to figure out at least where to begin. Once with a starting point, a spider simply will continue crawling around, and then networking out to other websites via links. To figure out where to begin, we're going to write a function that generates a random combination of three characters, and then we'll slap an "http://" and a ".com" and we've got probably a decent starting place, since most 3 letter .com domain names have at least something. If one doesn't no matter, since we're going to start with parsing a handful of these.
Many times, websites will have local links, basically where the link doesn't actually start with http or https, and instead it starts with a slash, like
/login/. A browser knows this is really a link to
http://thewebsite.com/login/, but our program wont without us telling it:
def handle_local_links(url,link): if link.startswith('/'): return ''.join([url,link]) else: return link
Now, we need to find those links!
def get_links(url): resp = requests.get(url) soup = bs.BeautifulSoup(resp.text, 'lxml') body = soup.body links = [link.get('href') for link in body.find_all('a')] links = [handle_local_links(url,link) for link in links] links = [str(link.encode("ascii")) for link in links] return links
In this function, we're grabbing the source code, then parsing it with Beautiful Soup. That said, there could some issues. First, the domain may not have a server. If it does have a server, maybe there's nothing on it being returned. If they do have a website, maybe they don't allow bot connections. If we are able to connect and read the source code, we might not find any links at all. Thus, we have a few exceptions to handle for:
def get_links(url): try: resp = requests.get(url) soup = bs.BeautifulSoup(resp.text, 'lxml') body = soup.body links = [link.get('href') for link in body.find_all('a')] links = [handle_local_links(url,link) for link in links] links = [str(link.encode("ascii")) for link in links] return links except TypeError as e: print(e) print('Got a TypeError, probably got a None that we tried to iterate over') return  except IndexError as e: print(e) print('We probably did not find any useful links, returning empty list') return  except AttributeError as e: print(e) print('Likely got None for links, so we are throwing this') return  except Exception as e: print(str(e)) # log this error return 
The final exception could arguably be excluded and we could further handle the explicit errors. In general, it's a bad idea to just silently move along on errors. At the very least, you might want to log the error either in a plain text file or even using something like Python's logging standard library.
Now, using multiprocessing, let's put it all together:
def main(): how_many = 50 p = Pool(processes=how_many) parse_us = [random_starting_url() for _ in range(how_many)] data = p.map(get_links, [link for link in parse_us]) data = [url for url_list in data for url in url_list] p.close() with open('urls.txt','w') as f: f.write(str(data)) if __name__ == '__main__': main()
In this case, we're taking the list of all urls, and then writing them to a file, but, with a full spider, you'd actually take that list, and it becomes the new
parse_us, and this loop goes on forever.
If you're more curious about what each line is doing, check out the video version of this tutorial above, as each line is explained.
In the next tutorial, we're going to introduce Object Oriented Programming (OOP).