Python Programming Tutorials

Building an Inverted Index Using Python and NLTK

Hi,

I need to build a python program that reads a set of txt files (some gutenberg files) and then use NLTK library to tokenize, normalize stem, remove stop words, and then building an inverted index for all tokens in all files. So basically the idea is to build a program that searches for each token in all provided files, and build an inverted index that shows each token along with it corresponding occurrences.

This is what I have coded so far.


import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import BracketParseCorpusReader

stop_words = set(stopwords.words("english"))
stop_words = stop_words.union(",","(",")","[","]","{","}","#","@","!",":",";",".","?")

path = r"gutenberg"
file_type = r".*.txt"
ptb = BracketParseCorpusReader(path, file_type)
file_names = ptb.fileids()
print(file_names)

stemmed_tokens = []
porter = nltk.PorterStemmer()

for link in file_names:
    tokens = []
    link = 'gutenberg/'+link
    raw = open(link).read()
    tokens = word_tokenize(raw.lower().decode('utf-8'))
    for w in tokens:
        x = porter.stem(w)
        if x not in stop_words and x not in stemmed_tokens and not x.isdigit():
           stemmed_tokens.append(x)

print(stemmed_tokens)
print("")
print(stop_words)
print("")

Thanks in advance!

You must be logged in to post. Please login or register an account.