Now that we're done our testing, let's get our named entities in a nice readable format.
Again, we'll use the same short article from NBC news:
House Speaker John Boehner became animated Tuesday over the proposed Keystone Pipeline, castigating the Obama administration for not having approved the project yet.
Republican House Speaker John Boehner says there's "nothing complex about the Keystone Pipeline," and that it's time to build it.
"Complex? You think the Keystone Pipeline is complex?!" Boehner responded to a questioner. "It's been under study for five years! We build pipelines in America every day. Do you realize there are 200,000 miles of pipelines in the United States?"
The speaker went on: "And the only reason the president's involved in the Keystone Pipeline is because it crosses an international boundary. Listen, we can build it. There's nothing complex about the Keystone Pipeline -- it's time to build it."
Boehner said the president had no excuse at this point to not give the pipeline the go-ahead after the State Department released a report on Friday indicating the project would have a minimal impact on the environment.
Republicans have long pushed for construction of the project, which enjoys some measure of Democratic support as well. The GOP is considering conditioning an extension of the debt limit on approval of the project by Obama.
The White House, though, has said that it has no timetable for a final decision on the project.
Our NTLK output is already in a tree (only requiring one last step), so let's get our Stanford output there as well. We'll start by BIO tagging the tokens, with B assigned to the beginning of named entities, I assigned to inside, and O assigned to other. For instance, if we have the sentence "Barack Obama went to Greece today", we should BIO tag it as "Barack-B Obama-I went-O to-O Greece-B today-O." In order to do this we'll write a series of conditionals to examine 'O' tags for current and previous tokens.
# -*- coding: utf-8 -*- import nltk import os import numpy as np import matplotlib.pyplot as plt from matplotlib import style from nltk import pos_tag from nltk.tag import StanfordNERTagger from nltk.tokenize import word_tokenize from nltk.chunk import conlltags2tree from nltk.tree import Tree style.use('fivethirtyeight') # Process text def process_text(txt_file): raw_text = open("/usr/share/news_article.txt").read() token_text = word_tokenize(raw_text) return token_text # Stanford NER tagger def stanford_tagger(token_text): st = StanfordNERTagger('/usr/share/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', '/usr/share/stanford-ner/stanford-ner.jar', encoding='utf-8') ne_tagged = st.tag(token_text) return(ne_tagged) # NLTK POS and NER taggers def nltk_tagger(token_text): tagged_words = nltk.pos_tag(token_text) ne_tagged = nltk.ne_chunk(tagged_words) return(ne_tagged) # Tag tokens with standard NLP BIO tags def bio_tagger(ne_tagged): bio_tagged = [] prev_tag = "O" for token, tag in ne_tagged: if tag == "O": #O bio_tagged.append((token, tag)) prev_tag = tag continue if tag != "O" and prev_tag == "O": # Begin NE bio_tagged.append((token, "B-"+tag)) prev_tag = tag elif prev_tag != "O" and prev_tag == tag: # Inside NE bio_tagged.append((token, "I-"+tag)) prev_tag = tag elif prev_tag != "O" and prev_tag != tag: # Adjacent NE bio_tagged.append((token, "B-"+tag)) prev_tag = tag return bio_tagged
Now we'll write the BIO tagged tokens into trees, so they're in the same formate as the NLTK output.
# Create tree def stanford_tree(bio_tagged): tokens, ne_tags = zip(*bio_tagged) pos_tags = [pos for token, pos in pos_tag(tokens)] conlltags = [(token, pos, ne) for token, pos, ne in zip(tokens, pos_tags, ne_tags)] ne_tree = conlltags2tree(conlltags) return ne_tree
Iterate through and parse out all the named entities.
# Parse named entities from tree def structure_ne(ne_tree): ne = [] for subtree in ne_tree: if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O" ne_label = subtree.label() ne_string = " ".join([token for token, pos in subtree.leaves()]) ne.append((ne_string, ne_label)) return ne
We'll group all our additional functions together in our call:
def stanford_main(): print(structure_ne(stanford_tree(bio_tagger(stanford_tagger(process_text(txt_file)))))) def nltk_main(): print(structure_ne(nltk_tagger(process_text(txt_file))))
And then call the functions:
if __name__ == '__main__': stanford_main() nltk_main()
Here's the nice looking output from Stanford:
[('House', 'ORGANIZATION'), ('John Boehner', 'PERSON'), ('Keystone Pipeline', 'ORGANIZATION'), ('Obama', 'PERSON'), ('Republican House', 'ORGANIZATION'), ('John Boehner', 'PERSON'), ('Keystone Pipeline', 'ORGANIZATION'), ('Keystone Pipeline', 'ORGANIZATION'), ('Boehner', 'PERSON'), ('America', 'LOCATION'), ('United States', 'LOCATION'), ('Keystone Pipeline', 'ORGANIZATION'), ('Keystone Pipeline', 'ORGANIZATION'), ('Boehner', 'PERSON'), ('State Department', 'ORGANIZATION'), ('Republicans', 'MISC'), ('Democratic', 'MISC'), ('GOP', 'MISC'), ('Obama', 'PERSON'), ('White House', 'LOCATION')]
And from NLTK:
[('House', 'ORGANIZATION'), ('John Boehner', 'PERSON'), ('Keystone Pipeline', 'PERSON'), ('Obama', 'ORGANIZATION'), ('Republican', 'ORGANIZATION'), ('House', 'ORGANIZATION'), ('John Boehner', 'PERSON'), ('Keystone Pipeline', 'ORGANIZATION'), ('Keystone Pipeline', 'ORGANIZATION'), ('Boehner', 'PERSON'), ('America', 'GPE'), ('United States', 'GPE'), ('Keystone Pipeline', 'ORGANIZATION'), ('Listen', 'PERSON'), ('Keystone', 'ORGANIZATION'), ('Boehner', 'PERSON'), ('State Department', 'ORGANIZATION'), ('Democratic', 'ORGANIZATION'), ('GOP', 'ORGANIZATION'), ('Obama', 'PERSON'), ('White House', 'FACILITY')]
Nicely chunked together and readable. Sweet!
That's all for now. For more tutorials, head to the: