Simple website scraping

This is just a very basic example of web scraping on your own.

For more advanced parsing, you can improve the regular expression, or look into a module like Beautiful Soup.

I pasted the code from this specific video below, though some superior regex would be something like:

import time
import urllib2
from urllib2 import urlopen
import re
import cookielib, urllib2
from cookielib import CookieJar
import datetime

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

def main():
        page = ''
        sourceCode =
        #print sourceCode

            titles = re.findall(r'<title>(.*?)</title>',sourceCode)
            links = re.findall(r'<link.*?href=\"(.*?)\"',sourceCode)
            #for title in titles:
                #print title
            for link in links:
                if '.rdf' in link:
                    print 'let\'s visit:', link
                    linkSource =
                    content = re.findall(r'<p>(.*?)</p>',linkSource)
                    for theContent in content:
                        print theContent
        except Exception, e:
            print str(e)


    except Exception,e:
        print str(e)


The next tutorial:

  • Simple RSS feed scraping
  • Simple website scraping
  • More Parsing/Scraping
  • Installing the Natural Language Toolkit (NLTK)
  • NLTK Part of Speech Tagging Tutorial
  • Named Entity Recognition NLTK tutorial
  • Building a Knowledge-base
  • More Named Entity Recognition with NLTK
  • Pulling related Sentiment about Named Entities
  • Populating a knowledge-base
  • What next?
  • Accuracy Testing
  • Building back-testing
  • Machine learning and Sentiment Analysis