Now it is time to choose an algorithm, separate our data into training and testing sets, and press go! The algorithm that we're going to use first is the Naive Bayes classifier. This is a pretty popular algorithm used in text classification, so it is only fitting that we try it out first. Before we can train and test our algorithm, however, we need to go ahead and split up the data into a training set and a testing set.
You could train and test on the same dataset, but this would present you with some serious bias issues, so you should never train and test against the exact same data. To do this, since we've shuffled our data set, we'll assign the first 1,900 shuffled reviews, consisting of both positive and negative reviews, as the training set. Then, we can test against the last 100 to see how accurate we are.
This is called supervised machine learning, because we're showing the machine data, and telling it "hey, this data is positive," or "this data is negative." Then, after that training is done, we show the machine some new data and ask the computer, based on what we taught the computer before, what the computer thinks the category of the new data is.
We can split the data with:
# set that we'll train our classifier with training_set = featuresets[:1900] # set that we'll test against. testing_set = featuresets[1900:]
Next, we can define, and train our classifier like:
classifier = nltk.NaiveBayesClassifier.train(training_set)
First we just simply are invoking the Naive Bayes classifier, then we go ahead and use .train() to train it all in one line.
Easy enough, now it is trained. Next, we can test it:
print("Classifier accuracy percent:",(nltk.classify.accuracy(classifier, testing_set))*100)
Boom, you have your answer. In case you missed it, the reason why we can "test" the data is because we still have the correct answers. So, in testing, we show the computer the data without giving it the correct answer. If it guesses correctly what we know the answer to be, then the computer got it right. Given the shuffling that we've done, you and me might come up with varying accuracy, but you should see something from 60-75% on average.
Next, we can take it a step further to see what the most valuable words are when it comes to positive or negative reviews:
This is going to vary again for each person, but you should see something like:
What this tells you is the ratio of occurences in negative to positive, or visa versa, for every word. So here, we can see that the term "insulting" appears 10.6 more times as often in negative reviews as it does in positive reviews. Ludicrous, 10.1.
Now, let's say you were totally content with your results, and you wanted to move forward, maybe using this classifier to predict things right now. It would be very impractical to train the classifier, and retrain it every time you needed to use it. As such, you can save the classifier using the pickle module. Let's do that next.