Named Entity Recognition with Stanford NER Tagger

Guest Post by Chuck Dishmon



An alternative to NLTK's named entity recognition (NER) classifier is provided by the Stanford NER tagger. This tagger is largely seen as the standard in named entity recognition, but since it uses an advanced statistical learning algorithm it's more computationally expensive than the option provided by NLTK.

A big benefit of the Stanford NER tagger is that is provides us with a few different models for pulling out named entities. We can use any of the following:

  • 3 class model for recognizing locations, persons, and organizations
  • 4 class model for recognizing locations, persons, organizations, and miscellaneous entities
  • 7 class model for recognizing locations, persons, organizations, times, money, percents, and dates

In order to move forward we'll need to download the models and a jar file, since the NER classifier is written in Java. These are available for free from the Stanford Natural Language Processing Group. Conveniently for us, NTLK provides a wrapper to the Stanford tagger so we can use it in the best language ever (ahem, Python)!

The parameters passed to the StanfordNERTagger class include:

  1. Classification model path (3 class model used below)
  2. Stanford tagger jar file path
  3. Training data encoding (default of ASCII)

Here's how we set it up to tag a sentence with the 3 class model:

# -*- coding: utf-8 -*-

from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize

st = StanfordNERTagger('/usr/share/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',
					   '/usr/share/stanford-ner/stanford-ner.jar',
					   encoding='utf-8')

text = 'While in France, Christine Lagarde discussed short-term stimulus efforts in a recent interview with the Wall Street Journal.'

tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)

print(classified_text)

Once we've tokenized by word and classified the sentence, we see the tagger produces a list of tuples as follows:

[('While', 'O'), ('in', 'O'), ('France', 'LOCATION'), (',', 'O'), ('Christine', 'PERSON'), ('Lagarde', 'PERSON'), ('discussed', 'O'), ('short-term', 'O'), ('stimulus', 'O'), ('efforts', 'O'), ('in', 'O'), ('a', 'O'), ('recent', 'O'), ('interview', 'O'), ('with', 'O'), ('the', 'O'), ('Wall', 'ORGANIZATION'), ('Street', 'ORGANIZATION'), ('Journal', 'ORGANIZATION'), ('.', 'O')]

Nice! Each token is tagged (using our 3 class model) with either 'PERSON', 'LOCATION', 'ORGANIZATION', or 'O'. The 'O' simply stands for other, i.e., non-named entities.

The list is now ready for testing with annotated data, which we'll cover in the next tutorial.

The next tutorial:






  • Tokenizing Words and Sentences with NLTK
  • Stop words with NLTK
  • Stemming words with NLTK
  • Part of Speech Tagging with NLTK
  • Chunking with NLTK
  • Chinking with NLTK
  • Named Entity Recognition with NLTK
  • Lemmatizing with NLTK
  • The corpora with NLTK
  • Wordnet with NLTK
  • Text Classification with NLTK
  • Converting words to Features with NLTK
  • Naive Bayes Classifier with NLTK
  • Saving Classifiers with NLTK
  • Scikit-Learn Sklearn with NLTK
  • Combining Algorithms with NLTK
  • Investigating bias with NLTK
  • Improving Training Data for sentiment analysis with NLTK
  • Creating a module for Sentiment Analysis with NLTK
  • Twitter Sentiment Analysis with NLTK
  • Graphing Live Twitter Sentiment Analysis with NLTK with NLTK
  • Named Entity Recognition with Stanford NER Tagger
  • Testing NLTK and Stanford NER Taggers for Accuracy
  • Testing NLTK and Stanford NER Taggers for Speed
  • Using BIO Tags to Create Readable Named Entity Lists