Python Programming Tutorials

Scikit-Learn Sklearn with NLTK

We've seen by now how easy it can be to use classifiers out of the box, and now we want to try some more! The best module for Python to do this with is the Scikit-learn (sklearn) module.

If you would like to learn more about the Scikit-learn Module, I have some tutorials on machine learning with Scikit-Learn.

Luckily for us, the people behind NLTK forsaw the value of incorporating the sklearn module into the NLTK classifier methodology. As such, they created the SklearnClassifier API of sorts. To use that, you just need to import it like:

from nltk.classify.scikitlearn import SklearnClassifier

From here, you can use just about any of the sklearn classifiers. For example, lets bring in a couple more variations of the Naive Bayes algorithm:

from sklearn.naive_bayes import MultinomialNB,BernoulliNB

With this, how might we use them? It turns out, this is very simple:

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MultinomialNB accuracy percent:",nltk.classify.accuracy(MNB_classifier, testing_set))

BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(training_set)
print("BernoulliNB accuracy percent:",nltk.classify.accuracy(BNB_classifier, testing_set))

It is as simple as that. Let's bring in some more:

from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

Now, all of our classifiers should look something like:

print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)

SVC_classifier = SklearnClassifier(SVC())
SVC_classifier.train(training_set)
print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)

LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)

The result of running this should give you something along the lines of:

Original Naive Bayes Algo accuracy percent: 63.0
Most Informative Features
                thematic = True              pos : neg    =      9.1 : 1.0
                secondly = True              pos : neg    =      8.5 : 1.0
                narrates = True              pos : neg    =      7.8 : 1.0
                 rounded = True              pos : neg    =      7.1 : 1.0
                 supreme = True              pos : neg    =      7.1 : 1.0
                 layered = True              pos : neg    =      7.1 : 1.0
                  crappy = True              neg : pos    =      6.9 : 1.0
               uplifting = True              pos : neg    =      6.2 : 1.0
                     ugh = True              neg : pos    =      5.3 : 1.0
                   mamet = True              pos : neg    =      5.1 : 1.0
                 gaining = True              pos : neg    =      5.1 : 1.0
                   wanda = True              neg : pos    =      4.9 : 1.0
                   onset = True              neg : pos    =      4.9 : 1.0
               fantastic = True              pos : neg    =      4.5 : 1.0
                kentucky = True              pos : neg    =      4.4 : 1.0
MNB_classifier accuracy percent: 66.0
BernoulliNB_classifier accuracy percent: 72.0
LogisticRegression_classifier accuracy percent: 64.0
SGDClassifier_classifier accuracy percent: 61.0
SVC_classifier accuracy percent: 45.0
LinearSVC_classifier accuracy percent: 68.0
NuSVC_classifier accuracy percent: 59.0

So, we can see SVC is wrong more often than it is right right out of the gate, so we should probably dump that one. But then what? The next thing we can try is to use all of these algorithms at once. An algo of algos! To do this, we can create another classifier, and make the result of that classifier based on what the other algorithms said. Sort of like a voting system, so we'll just need an odd number of algorithms. That's what we'll be talking about in the next tutorial.

The next tutorial:

Tokenizing Words and Sentences with NLTK
Stop words with NLTK
Stemming words with NLTK
Part of Speech Tagging with NLTK
Chunking with NLTK
Chinking with NLTK
Named Entity Recognition with NLTK
Lemmatizing with NLTK
The corpora with NLTK
Wordnet with NLTK
Text Classification with NLTK
Converting words to Features with NLTK
Naive Bayes Classifier with NLTK
Saving Classifiers with NLTK
Scikit-Learn Sklearn with NLTK
Combining Algorithms with NLTK
Investigating bias with NLTK
Improving Training Data for sentiment analysis with NLTK
Creating a module for Sentiment Analysis with NLTK
Twitter Sentiment Analysis with NLTK
Graphing Live Twitter Sentiment Analysis with NLTK with NLTK
Named Entity Recognition with Stanford NER Tagger
Testing NLTK and Stanford NER Taggers for Accuracy
Testing NLTK and Stanford NER Taggers for Speed
Using BIO Tags to Create Readable Named Entity Lists