Scikit-Learn Sklearn with NLTK

We've seen by now how easy it can be to use classifiers out of the box, and now we want to try some more! The best module for Python to do this with is the Scikit-learn (sklearn) module.

If you would like to learn more about the Scikit-learn Module, I have some tutorials on machine learning with Scikit-Learn.

Luckily for us, the people behind NLTK forsaw the value of incorporating the sklearn module into the NLTK classifier methodology. As such, they created the SklearnClassifier API of sorts. To use that, you just need to import it like:

from nltk.classify.scikitlearn import SklearnClassifier

From here, you can use just about any of the sklearn classifiers. For example, lets bring in a couple more variations of the Naive Bayes algorithm:

from sklearn.naive_bayes import MultinomialNB,BernoulliNB

With this, how might we use them? It turns out, this is very simple:

MNB_classifier = SklearnClassifier(MultinomialNB())
print("MultinomialNB accuracy percent:",nltk.classify.accuracy(MNB_classifier, testing_set))

BNB_classifier = SklearnClassifier(BernoulliNB())
print("BernoulliNB accuracy percent:",nltk.classify.accuracy(BNB_classifier, testing_set))

It is as simple as that. Let's bring in some more:

from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

Now, all of our classifiers should look something like:

print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)

MNB_classifier = SklearnClassifier(MultinomialNB())
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)

SVC_classifier = SklearnClassifier(SVC())
print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)

LinearSVC_classifier = SklearnClassifier(LinearSVC())
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

NuSVC_classifier = SklearnClassifier(NuSVC())
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)

The result of running this should give you something along the lines of:

Original Naive Bayes Algo accuracy percent: 63.0
Most Informative Features
                thematic = True              pos : neg    =      9.1 : 1.0
                secondly = True              pos : neg    =      8.5 : 1.0
                narrates = True              pos : neg    =      7.8 : 1.0
                 rounded = True              pos : neg    =      7.1 : 1.0
                 supreme = True              pos : neg    =      7.1 : 1.0
                 layered = True              pos : neg    =      7.1 : 1.0
                  crappy = True              neg : pos    =      6.9 : 1.0
               uplifting = True              pos : neg    =      6.2 : 1.0
                     ugh = True              neg : pos    =      5.3 : 1.0
                   mamet = True              pos : neg    =      5.1 : 1.0
                 gaining = True              pos : neg    =      5.1 : 1.0
                   wanda = True              neg : pos    =      4.9 : 1.0
                   onset = True              neg : pos    =      4.9 : 1.0
               fantastic = True              pos : neg    =      4.5 : 1.0
                kentucky = True              pos : neg    =      4.4 : 1.0
MNB_classifier accuracy percent: 66.0
BernoulliNB_classifier accuracy percent: 72.0
LogisticRegression_classifier accuracy percent: 64.0
SGDClassifier_classifier accuracy percent: 61.0
SVC_classifier accuracy percent: 45.0
LinearSVC_classifier accuracy percent: 68.0
NuSVC_classifier accuracy percent: 59.0

So, we can see SVC is wrong more often than it is right right out of the gate, so we should probably dump that one. But then what? The next thing we can try is to use all of these algorithms at once. An algo of algos! To do this, we can create another classifier, and make the result of that classifier based on what the other algorithms said. Sort of like a voting system, so we'll just need an odd number of algorithms. That's what we'll be talking about in the next tutorial.

The next tutorial:

  • Tokenizing Words and Sentences with NLTK
  • Stop words with NLTK
  • Stemming words with NLTK
  • Part of Speech Tagging with NLTK
  • Chunking with NLTK
  • Chinking with NLTK
  • Named Entity Recognition with NLTK
  • Lemmatizing with NLTK
  • The corpora with NLTK
  • Wordnet with NLTK
  • Text Classification with NLTK
  • Converting words to Features with NLTK
  • Naive Bayes Classifier with NLTK
  • Saving Classifiers with NLTK
  • Scikit-Learn Sklearn with NLTK
  • Combining Algorithms with NLTK
  • Investigating bias with NLTK
  • Improving Training Data for sentiment analysis with NLTK
  • Creating a module for Sentiment Analysis with NLTK
  • Twitter Sentiment Analysis with NLTK
  • Graphing Live Twitter Sentiment Analysis with NLTK with NLTK
  • Named Entity Recognition with Stanford NER Tagger
  • Testing NLTK and Stanford NER Taggers for Accuracy
  • Testing NLTK and Stanford NER Taggers for Speed
  • Using BIO Tags to Create Readable Named Entity Lists