Using BIO Tags to Create Readable Named Entity Lists

Guest Post by Chuck Dishmon

Now that we're done our testing, let's get our named entities in a nice readable format.

Again, we'll use the same short article from NBC news:

House Speaker John Boehner became animated Tuesday over the proposed Keystone Pipeline, castigating the Obama administration for not having approved the project yet.

Republican House Speaker John Boehner says there's "nothing complex about the Keystone Pipeline," and that it's time to build it.

"Complex? You think the Keystone Pipeline is complex?!" Boehner responded to a questioner. "It's been under study for five years! We build pipelines in America every day. Do you realize there are 200,000 miles of pipelines in the United States?"

The speaker went on: "And the only reason the president's involved in the Keystone Pipeline is because it crosses an international boundary. Listen, we can build it. There's nothing complex about the Keystone Pipeline -- it's time to build it."

Boehner said the president had no excuse at this point to not give the pipeline the go-ahead after the State Department released a report on Friday indicating the project would have a minimal impact on the environment.

Republicans have long pushed for construction of the project, which enjoys some measure of Democratic support as well. The GOP is considering conditioning an extension of the debt limit on approval of the project by Obama.

The White House, though, has said that it has no timetable for a final decision on the project.

Our NTLK output is already in a tree (only requiring one last step), so let's get our Stanford output there as well. We'll start by BIO tagging the tokens, with B assigned to the beginning of named entities, I assigned to inside, and O assigned to other. For instance, if we have the sentence "Barack Obama went to Greece today", we should BIO tag it as "Barack-B Obama-I went-O to-O Greece-B today-O." In order to do this we'll write a series of conditionals to examine 'O' tags for current and previous tokens.

# -*- coding: utf-8 -*-

import nltk
import os
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from nltk import pos_tag
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk.chunk import conlltags2tree
from nltk.tree import Tree


# Process text  
def process_text(txt_file):
	raw_text = open("/usr/share/news_article.txt").read()
	token_text = word_tokenize(raw_text)
	return token_text

# Stanford NER tagger    
def stanford_tagger(token_text):
	st = StanfordNERTagger('/usr/share/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',
	ne_tagged = st.tag(token_text)
# NLTK POS and NER taggers   
def nltk_tagger(token_text):
	tagged_words = nltk.pos_tag(token_text)
	ne_tagged = nltk.ne_chunk(tagged_words)

# Tag tokens with standard NLP BIO tags
def bio_tagger(ne_tagged):
		bio_tagged = []
		prev_tag = "O"
		for token, tag in ne_tagged:
			if tag == "O": #O
				bio_tagged.append((token, tag))
				prev_tag = tag
			if tag != "O" and prev_tag == "O": # Begin NE
				bio_tagged.append((token, "B-"+tag))
				prev_tag = tag
			elif prev_tag != "O" and prev_tag == tag: # Inside NE
				bio_tagged.append((token, "I-"+tag))
				prev_tag = tag
			elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
				bio_tagged.append((token, "B-"+tag))
				prev_tag = tag
		return bio_tagged

Now we'll write the BIO tagged tokens into trees, so they're in the same formate as the NLTK output.

# Create tree       
def stanford_tree(bio_tagged):
	tokens, ne_tags = zip(*bio_tagged)
	pos_tags = [pos for token, pos in pos_tag(tokens)]

	conlltags = [(token, pos, ne) for token, pos, ne in zip(tokens, pos_tags, ne_tags)]
	ne_tree = conlltags2tree(conlltags)
	return ne_tree

Iterate through and parse out all the named entities.

# Parse named entities from tree
def structure_ne(ne_tree):
	ne = []
	for subtree in ne_tree:
		if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
			ne_label = subtree.label()
			ne_string = " ".join([token for token, pos in subtree.leaves()])
			ne.append((ne_string, ne_label))
	return ne

We'll group all our additional functions together in our call:

def stanford_main():

def nltk_main():

And then call the functions:

if __name__ == '__main__':

Here's the nice looking output from Stanford:

[('House', 'ORGANIZATION'), ('John Boehner', 'PERSON'), ('Keystone Pipeline', 'ORGANIZATION'), ('Obama', 'PERSON'), ('Republican House', 'ORGANIZATION'), ('John Boehner', 'PERSON'), ('Keystone Pipeline', 'ORGANIZATION'), ('Keystone Pipeline', 'ORGANIZATION'), ('Boehner', 'PERSON'), ('America', 'LOCATION'), ('United States', 'LOCATION'), ('Keystone Pipeline', 'ORGANIZATION'), ('Keystone Pipeline', 'ORGANIZATION'), ('Boehner', 'PERSON'), ('State Department', 'ORGANIZATION'), ('Republicans', 'MISC'), ('Democratic', 'MISC'), ('GOP', 'MISC'), ('Obama', 'PERSON'), ('White House', 'LOCATION')]

And from NLTK:

[('House', 'ORGANIZATION'), ('John Boehner', 'PERSON'), ('Keystone Pipeline', 'PERSON'), ('Obama', 'ORGANIZATION'), ('Republican', 'ORGANIZATION'), ('House', 'ORGANIZATION'), ('John Boehner', 'PERSON'), ('Keystone Pipeline', 'ORGANIZATION'), ('Keystone Pipeline', 'ORGANIZATION'), ('Boehner', 'PERSON'), ('America', 'GPE'), ('United States', 'GPE'), ('Keystone Pipeline', 'ORGANIZATION'), ('Listen', 'PERSON'), ('Keystone', 'ORGANIZATION'), ('Boehner', 'PERSON'), ('State Department', 'ORGANIZATION'), ('Democratic', 'ORGANIZATION'), ('GOP', 'ORGANIZATION'), ('Obama', 'PERSON'), ('White House', 'FACILITY')]

Nicely chunked together and readable. Sweet!

That's all for now. For more tutorials, head to the:

  • Tokenizing Words and Sentences with NLTK
  • Stop words with NLTK
  • Stemming words with NLTK
  • Part of Speech Tagging with NLTK
  • Chunking with NLTK
  • Chinking with NLTK
  • Named Entity Recognition with NLTK
  • Lemmatizing with NLTK
  • The corpora with NLTK
  • Wordnet with NLTK
  • Text Classification with NLTK
  • Converting words to Features with NLTK
  • Naive Bayes Classifier with NLTK
  • Saving Classifiers with NLTK
  • Scikit-Learn Sklearn with NLTK
  • Combining Algorithms with NLTK
  • Investigating bias with NLTK
  • Improving Training Data for sentiment analysis with NLTK
  • Creating a module for Sentiment Analysis with NLTK
  • Twitter Sentiment Analysis with NLTK
  • Graphing Live Twitter Sentiment Analysis with NLTK with NLTK
  • Named Entity Recognition with Stanford NER Tagger
  • Testing NLTK and Stanford NER Taggers for Accuracy
  • Testing NLTK and Stanford NER Taggers for Speed
  • Using BIO Tags to Create Readable Named Entity Lists