Python Programming Tutorials

Simple RSS feed scraping

The first step to using NLTK or doing any natural language processing is going to be acquiring data. There are many ways to do this, but I would like to at least show a very basic method for acquiring data. First, since many websites offer RSS feeds of their content, we're going to cover how to pull links from an RSS feed.

import time
import urllib2
from urllib2 import urlopen
import re
import cookielib, urllib2
from cookielib import CookieJar
import datetime

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

def main():
    try:
        page = 'http://www.huffingtonpost.com/feeds/index.xml'
        sourceCode = opener.open(page).read()
        #print sourceCode

        try:
            titles = re.findall(r'(.*?)',sourceCode)
            links = re.findall(r'(.*?)',sourceCode)
            for title in titles:
                print title
            for link in links:
                print link
        except Exception, e:
            print str(e)

    except Exception,e:
        print str(e)
        pass

main()

The next tutorial:

Simple RSS feed scraping
Simple website scraping
More Parsing/Scraping
Installing the Natural Language Toolkit (NLTK)
NLTK Part of Speech Tagging Tutorial
Named Entity Recognition NLTK tutorial
Building a Knowledge-base
More Named Entity Recognition with NLTK
Pulling related Sentiment about Named Entities
Populating a knowledge-base
What next?
Accuracy Testing
Building back-testing
Machine learning and Sentiment Analysis