Converting words to Features with NLTK




In this tutorial, we're going to be building off the previous video and compiling feature lists of words from positive reviews and words from the negative reviews to hopefully see trends in specific types of words in positive or negative reviews.

To start, our code:

import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

Mostly the same as before, only with now a new variable, word_features, which contains the top 3,000 most common words. Next, we're going to build a quick function that will find these top 3,000 words in our positive and negative documents, marking their presence as either positive or negative:

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

Next, we can print one feature set like:

print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

Then we can do this for all of our documents, saving the feature existence booleans and their respective positive or negative categories by doing:

featuresets = [(find_features(rev), category) for (rev, category) in documents]

Awesome, now that we have our features and labels, what is next? Typically the next step is to go ahead and train an algorithm, then test it. So, let's go ahead and do that, starting with the Naive Bayes classifier in the next tutorial!

The next tutorial:





  • Tokenizing Words and Sentences with NLTK
  • Stop words with NLTK
  • Stemming words with NLTK
  • Part of Speech Tagging with NLTK
  • Chunking with NLTK
  • Chinking with NLTK
  • Named Entity Recognition with NLTK
  • Lemmatizing with NLTK
  • The corpora with NLTK
  • Wordnet with NLTK
  • Text Classification with NLTK
  • Converting words to Features with NLTK
  • Naive Bayes Classifier with NLTK
  • Saving Classifiers with NLTK
  • Scikit-Learn Sklearn with NLTK
  • Combining Algorithms with NLTK
  • Investigating bias with NLTK
  • Improving Training Data for sentiment analysis with NLTK
  • Creating a module for Sentiment Analysis with NLTK
  • Twitter Sentiment Analysis with NLTK
  • Graphing Live Twitter Sentiment Analysis with NLTK with NLTK
  • Named Entity Recognition with Stanford NER Tagger
  • Testing NLTK and Stanford NER Taggers for Accuracy
  • Testing NLTK and Stanford NER Taggers for Speed
  • Using BIO Tags to Create Readable Named Entity Lists