With this new dataset, and new classifier, we're ready to move forward. As you probably noticed, this new data set takes even longer to train against, since it's a larger set. As you've already been shown, we can actually save tons of time by pickling, or serializing, the trained classifiers, which are just objects.
You've already been shown how to use pickle to do this, so I encourage you to attempt to do it on your own. In case you need help, I will paste the full code to do that here...but seriously, do it yourself!
This process will take a while.. You may want to just go run some errands. It took me about 30-40 minutes to run it in full, and I am running an i7 3930k. For the typical processor in the year I am writing this (2015), it may be hours. This is a one and done process, however.
import nltk import random #from nltk.corpus import movie_reviews from nltk.classify.scikitlearn import SklearnClassifier import pickle from sklearn.naive_bayes import MultinomialNB, BernoulliNB from sklearn.linear_model import LogisticRegression, SGDClassifier from sklearn.svm import SVC, LinearSVC, NuSVC from nltk.classify import ClassifierI from statistics import mode from nltk.tokenize import word_tokenize class VoteClassifier(ClassifierI): def __init__(self, *classifiers): self._classifiers = classifiers def classify(self, features): votes = [] for c in self._classifiers: v = c.classify(features) votes.append(v) return mode(votes) def confidence(self, features): votes = [] for c in self._classifiers: v = c.classify(features) votes.append(v) choice_votes = votes.count(mode(votes)) conf = choice_votes / len(votes) return conf short_pos = open("short_reviews/positive.txt","r").read() short_neg = open("short_reviews/negative.txt","r").read() # move this up here all_words = [] documents = [] # j is adject, r is adverb, and v is verb #allowed_word_types = ["J","R","V"] allowed_word_types = ["J"] for p in short_pos.split('\n'): documents.append( (p, "pos") ) words = word_tokenize(p) pos = nltk.pos_tag(words) for w in pos: if w[1][0] in allowed_word_types: all_words.append(w[0].lower()) for p in short_neg.split('\n'): documents.append( (p, "neg") ) words = word_tokenize(p) pos = nltk.pos_tag(words) for w in pos: if w[1][0] in allowed_word_types: all_words.append(w[0].lower()) save_documents = open("pickled_algos/documents.pickle","wb") pickle.dump(documents, save_documents) save_documents.close() all_words = nltk.FreqDist(all_words) word_features = list(all_words.keys())[:5000] save_word_features = open("pickled_algos/word_features5k.pickle","wb") pickle.dump(word_features, save_word_features) save_word_features.close() def find_features(document): words = word_tokenize(document) features = {} for w in word_features: features[w] = (w in words) return features featuresets = [(find_features(rev), category) for (rev, category) in documents] random.shuffle(featuresets) print(len(featuresets)) testing_set = featuresets[10000:] training_set = featuresets[:10000] classifier = nltk.NaiveBayesClassifier.train(training_set) print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100) classifier.show_most_informative_features(15) ############### save_classifier = open("pickled_algos/originalnaivebayes5k.pickle","wb") pickle.dump(classifier, save_classifier) save_classifier.close() MNB_classifier = SklearnClassifier(MultinomialNB()) MNB_classifier.train(training_set) print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100) save_classifier = open("pickled_algos/MNB_classifier5k.pickle","wb") pickle.dump(MNB_classifier, save_classifier) save_classifier.close() BernoulliNB_classifier = SklearnClassifier(BernoulliNB()) BernoulliNB_classifier.train(training_set) print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100) save_classifier = open("pickled_algos/BernoulliNB_classifier5k.pickle","wb") pickle.dump(BernoulliNB_classifier, save_classifier) save_classifier.close() LogisticRegression_classifier = SklearnClassifier(LogisticRegression()) LogisticRegression_classifier.train(training_set) print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100) save_classifier = open("pickled_algos/LogisticRegression_classifier5k.pickle","wb") pickle.dump(LogisticRegression_classifier, save_classifier) save_classifier.close() LinearSVC_classifier = SklearnClassifier(LinearSVC()) LinearSVC_classifier.train(training_set) print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100) save_classifier = open("pickled_algos/LinearSVC_classifier5k.pickle","wb") pickle.dump(LinearSVC_classifier, save_classifier) save_classifier.close() ##NuSVC_classifier = SklearnClassifier(NuSVC()) ##NuSVC_classifier.train(training_set) ##print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100) SGDC_classifier = SklearnClassifier(SGDClassifier()) SGDC_classifier.train(training_set) print("SGDClassifier accuracy percent:",nltk.classify.accuracy(SGDC_classifier, testing_set)*100) save_classifier = open("pickled_algos/SGDC_classifier5k.pickle","wb") pickle.dump(SGDC_classifier, save_classifier) save_classifier.close()
Now, you just need to run this one time. You can always run it again if you wanted, but now, you are ready to create the sentiment analysis module. Here's the file that we're going to call sentiment_mod.py
#File: sentiment_mod.py import nltk import random #from nltk.corpus import movie_reviews from nltk.classify.scikitlearn import SklearnClassifier import pickle from sklearn.naive_bayes import MultinomialNB, BernoulliNB from sklearn.linear_model import LogisticRegression, SGDClassifier from sklearn.svm import SVC, LinearSVC, NuSVC from nltk.classify import ClassifierI from statistics import mode from nltk.tokenize import word_tokenize class VoteClassifier(ClassifierI): def __init__(self, *classifiers): self._classifiers = classifiers def classify(self, features): votes = [] for c in self._classifiers: v = c.classify(features) votes.append(v) return mode(votes) def confidence(self, features): votes = [] for c in self._classifiers: v = c.classify(features) votes.append(v) choice_votes = votes.count(mode(votes)) conf = choice_votes / len(votes) return conf documents_f = open("pickled_algos/documents.pickle", "rb") documents = pickle.load(documents_f) documents_f.close() word_features5k_f = open("pickled_algos/word_features5k.pickle", "rb") word_features = pickle.load(word_features5k_f) word_features5k_f.close() def find_features(document): words = word_tokenize(document) features = {} for w in word_features: features[w] = (w in words) return features featuresets_f = open("pickled_algos/featuresets.pickle", "rb") featuresets = pickle.load(featuresets_f) featuresets_f.close() random.shuffle(featuresets) print(len(featuresets)) testing_set = featuresets[10000:] training_set = featuresets[:10000] open_file = open("pickled_algos/originalnaivebayes5k.pickle", "rb") classifier = pickle.load(open_file) open_file.close() open_file = open("pickled_algos/MNB_classifier5k.pickle", "rb") MNB_classifier = pickle.load(open_file) open_file.close() open_file = open("pickled_algos/BernoulliNB_classifier5k.pickle", "rb") BernoulliNB_classifier = pickle.load(open_file) open_file.close() open_file = open("pickled_algos/LogisticRegression_classifier5k.pickle", "rb") LogisticRegression_classifier = pickle.load(open_file) open_file.close() open_file = open("pickled_algos/LinearSVC_classifier5k.pickle", "rb") LinearSVC_classifier = pickle.load(open_file) open_file.close() open_file = open("pickled_algos/SGDC_classifier5k.pickle", "rb") SGDC_classifier = pickle.load(open_file) open_file.close() voted_classifier = VoteClassifier( classifier, LinearSVC_classifier, MNB_classifier, BernoulliNB_classifier, LogisticRegression_classifier) def sentiment(text): feats = find_features(text) return voted_classifier.classify(feats),voted_classifier.confidence(feats)
So here, there's really nothing new, besides the final function, which is quite simple. This function is the crux of what we will be interacting with from here on out. This function, which we're calling "sentiment," takes one parameter, which is text. From there, we break down the features with the find_features function we created long ago. From there, now all we need to do is use our voted_classifier to return not only the classification, but also the confidence in that classification.
With that, we can now use this file, and the sentiment function as a module. Here's an example script that might utilize the module:
import sentiment_mod as s print(s.sentiment("This movie was awesome! The acting was great, plot was wonderful, and there were pythons...so yea!")) print(s.sentiment("This movie was utter junk. There were absolutely 0 pythons. I don't see what the point was at all. Horrible movie, 0/10"))
As expected, the movie with pythons obviously did very well with reviewers, and the movie without any pythons was junk. Both of these were with 100% confidence as well.
It took me about 5 seconds to import the module, since we pickled the classifiers, as compared to the 30ish minutes it took without pickling. Yay for pickling. Your time will vary greatly depending on your processor. If you continue down this path, I will just throw out there that you may also want to look into joblib.
Now that we have this awesome module, and it works easily, what can we do? I propose we take to Twitter to perform live sentiment analysis!