Multiprocessing Spider Example




Welcome to part 12 of the intermediate Python programming tutorial series. In this part, we're going to talk more about the built-in library: multiprocessing. Here, we're going to be covering the beginnings to building a spider, using the multiprocessing library. The idea here will be to quickly access and process many websites at the same time.

If you're just now joining us, you may want to start with the multiprocessing tutorial, as this is meant to simply be an example of what we learned.

To begin, let's make some imports:

from multiprocessing import Pool
import bs4 as bs
import random
import requests
import string

We will obviously be using multiprocessing, and we're going to use the Pool so we can access the returned values from a process. Next, we're going to make use of the Beautiful Soup library for parsing the HTML. If you're not familiar with Beautiful Soup, you can check out the Beautiful Soup miniseries. We'll be using random and string to generate random strings, and requests to actually make the request and grab the source code.

def random_starting_url():
    starting = ''.join(random.SystemRandom().choice(string.ascii_lowercase) for _ in range(3))
    url = ''.join(['http://', starting, '.com'])
    return url

Now, with a spider, we need to figure out at least where to begin. Once with a starting point, a spider simply will continue crawling around, and then networking out to other websites via links. To figure out where to begin, we're going to write a function that generates a random combination of three characters, and then we'll slap an "http://" and a ".com" and we've got probably a decent starting place, since most 3 letter .com domain names have at least something. If one doesn't no matter, since we're going to start with parsing a handful of these.

Many times, websites will have local links, basically where the link doesn't actually start with http or https, and instead it starts with a slash, like /login/. A browser knows this is really a link to http://thewebsite.com/login/, but our program wont without us telling it:

def handle_local_links(url,link):
    if link.startswith('/'):
        return ''.join([url,link])
    else:
        return link

Now, we need to find those links!

def get_links(url):
    resp = requests.get(url)
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    body = soup.body
    links = [link.get('href') for link in body.find_all('a')]
    links = [handle_local_links(url,link) for link in links]
    links = [str(link.encode("ascii")) for link in links]
    return links

In this function, we're grabbing the source code, then parsing it with Beautiful Soup. That said, there could some issues. First, the domain may not have a server. If it does have a server, maybe there's nothing on it being returned. If they do have a website, maybe they don't allow bot connections. If we are able to connect and read the source code, we might not find any links at all. Thus, we have a few exceptions to handle for:

def get_links(url):
    try:
        resp = requests.get(url)
        soup = bs.BeautifulSoup(resp.text, 'lxml')
        body = soup.body
        links = [link.get('href') for link in body.find_all('a')]
        links = [handle_local_links(url,link) for link in links]
        links = [str(link.encode("ascii")) for link in links]
        return links

    except TypeError as e:
        print(e)
        print('Got a TypeError, probably got a None that we tried to iterate over')
        return []
    except IndexError as e:
        print(e)
        print('We probably did not find any useful links, returning empty list')
        return []
    except AttributeError as e:
        print(e)
        print('Likely got None for links, so we are throwing this')
        return []
    except Exception as e:
        print(str(e))
        # log this error 
        return []

The final exception could arguably be excluded and we could further handle the explicit errors. In general, it's a bad idea to just silently move along on errors. At the very least, you might want to log the error either in a plain text file or even using something like Python's logging standard library.

Now, using multiprocessing, let's put it all together:

def main():
    how_many = 50
    p = Pool(processes=how_many)
    parse_us = [random_starting_url() for _ in range(how_many)]
    
    data = p.map(get_links, [link for link in parse_us])
    data = [url for url_list in data for url in url_list]
    p.close()

    with open('urls.txt','w') as f:
        f.write(str(data))

if __name__ == '__main__':
    main()

In this case, we're taking the list of all urls, and then writing them to a file, but, with a full spider, you'd actually take that list, and it becomes the new parse_us, and this loop goes on forever.

If you're more curious about what each line is doing, check out the video version of this tutorial above, as each line is explained.

In the next tutorial, we're going to introduce Object Oriented Programming (OOP).

The next tutorial:





  • Intermediate Python Programming introduction
  • String Concatenation and Formatting Intermediate Python Tutorial part 2
  • Argparse for CLI Intermediate Python Tutorial part 3
  • List comprehension and generator expressions Intermediate Python Tutorial part 4
  • More on list comprehension and generators Intermediate Python Tutorial part 5
  • Timeit Module Intermediate Python Tutorial part 6
  • Enumerate Intermediate Python Tutorial part 7
  • Python's Zip function
  • More on Generators with Python
  • Multiprocessing with Python intro
  • Getting Values from Multiprocessing Processes
  • Multiprocessing Spider Example
  • Introduction to Object Oriented Programming
  • Creating Environment for our Object with PyGame
  • Many Blobs - Object Oriented Programming
  • Blob Class and Modularity - Object Oriented Programming
  • Inheritance - Object Oriented Programming
  • Decorators in Python Tutorial
  • Operator Overloading Python Tutorial
  • Detecting Collisions in our Game Python Tutorial
  • Special Methods, OOP, and Iteration Python Tutorial
  • Logging Python Tutorial
  • Headless Error Handling Python Tutorial
  • __str__ and __repr_ in Python 3
  • Args and Kwargs
  • Asyncio Basics - Asynchronous programming with coroutines