I need to build a python program that reads a set of txt files (some gutenberg files) and then use NLTK library to tokenize, normalize stem, remove stop words, and then building an inverted index for all tokens in all files. So basically the idea is to build a program that searches for each token in all provided files, and build an inverted index that shows each token along with it corresponding occurrences.
This is what I have coded so far.
import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import PorterStemmer from nltk.corpus import BracketParseCorpusReader
for link in file_names: tokens = [] link = 'gutenberg/'+link raw = open(link).read() tokens = word_tokenize(raw.lower().decode('utf-8')) for w in tokens: x = porter.stem(w) if x not in stop_words and x not in stemmed_tokens and not x.isdigit(): stemmed_tokens.append(x)