Simple RSS feed scraping




The first step to using NLTK or doing any natural language processing is going to be acquiring data. There are many ways to do this, but I would like to at least show a very basic method for acquiring data. First, since many websites offer RSS feeds of their content, we're going to cover how to pull links from an RSS feed.

import time
import urllib2
from urllib2 import urlopen
import re
import cookielib, urllib2
from cookielib import CookieJar
import datetime

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

def main():
    try:
        page = 'http://www.huffingtonpost.com/feeds/index.xml'
        sourceCode = opener.open(page).read()
        #print sourceCode

        try:
            titles = re.findall(r'(.*?)',sourceCode)
            links = re.findall(r'(.*?)',sourceCode)
            for title in titles:
                print title
            for link in links:
                print link
        except Exception, e:
            print str(e)

    except Exception,e:
        print str(e)
        pass

main()
		

The next tutorial:





  • Simple RSS feed scraping
  • Simple website scraping
  • More Parsing/Scraping
  • Installing the Natural Language Toolkit (NLTK)
  • NLTK Part of Speech Tagging Tutorial
  • Named Entity Recognition NLTK tutorial
  • Building a Knowledge-base
  • More Named Entity Recognition with NLTK
  • Pulling related Sentiment about Named Entities
  • Populating a knowledge-base
  • What next?
  • Accuracy Testing
  • Building back-testing
  • Machine learning and Sentiment Analysis