Welcome to part five of the Deep Learning with Neural Networks and TensorFlow tutorials. Now that we've covered a simple example of an artificial neural network, let's further break this model down and learn how we might approach this if we had some data that wasn't preloaded and setup for us. This is usually the first challenge you will come up against afer you learn based on demos. The demo works, and that's awesome, and then you begin to wonder how you can stuff the data you have into the code. It's always a good idea to grab a dataset from somewhere, and try to do it yourself, as it will give you a better idea of how everything works and what formats you need data in.
To do this, we will use two files: and . The pos file has ~5,000 positive sentiment statements, and the neg file has ~5,000 negative sentiment statements.
What we are going to try to do here is use a neural network to correctly identify sentiment, training with this data. Right away, we're faced with some challenges.
First, our data is in language/word format, not numerical form, which we need be converted to a vector of features. So, then we begin pondering about how we'll convert words to numbers, and then we make a second realization: our texts may not be the same length of words or characters. This is a big deal, since we need all featuresets to be exactly the same length going into training, and of course for training.
One option we have is to compile a list of all unique words in the training set. Let's say that's 3,500 unique words. These words are our lexicon. Now, we create a vector, the training vector
of zeros that is 1x3500 in size, and then we have a list of all unique words that is also 1x3500. From here, for every word that is in our sample sentence, we check to see if it is in our unique word vector. If so, the index value of that word in the unique word index is set to 1 in the training vector
. This is a very simple bag-of-words model.
For further example, let's say our unique word list is [chair, table, spoon, television]
. Let's then say we have a training sentence that is I pulled my chair up to the table
. We first create our training vector to be a vector of zeros that is the same size as our unique word list. This would be a 1x4: [0 0 0 0]
. Now, we iterate through all of the worlds in that sample sentence and, if they are in the unique word list, we make that index's value in the training vector
equal to 1. Since chair (index: 0) and table (index:1) are in the unique word list, and no others are, our new training feature vector is [1 1 0 0]
. Then, we either the data is positive or negative. Again, here, we will just use one-hot encoding and have the label vector be [POS,NEG], where positive data is [1,0]
and negative data would be [0,1]
.
To aid us in the pre-processsing, we're going to make use of NLTK (Natural Language Toolkit). Our main interest here is for the word tokenizer, as well as the Lemmatizer. Word tokenizers separate words for us. A lemmatizer takes similar words and converts them into the same single word. The concept is very similar to stemming, only a lemma is an actual word, one you could look up in a dictionary or something like WordNet. This will help us keep our lexicon much smaller, without losing too much value.
If you do not have NLTK, you need to install it.
pip3 install nltk
python3
import nltk nltk.download()
This will open either a GUI, or stay headless. Go ahead and just download all. If you are in GUI form, just choose download all. If you are in the text version, type d
, then all
. Once that's done, you're ready to progress. If you're lost or confused, check out the first NLTK tutorial for installing NLTK.
That's the plan, let's make it happen. We'll start by creating a create_sentiment_featuresets.py
file, which we'll keep separate from the neural network/TensorFlow modeling program.
import nltk from nltk.tokenize import word_tokenize import numpy as np import random import pickle from collections import Counter from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() hm_lines = 100000
These are just some necessary imports. NLTK has been explained, numpy is a given, random will be used to shuffle the data, Counter will be used for sorting most common lemmas, and pickle to save the process so that we dont need to do it every time. We define the lemmatizer, and then we set the hm_lines
value. 100,000 will do all of the lines, there are just over 10,000 lines. If you want to test something new, or shrink the total data size for a smaller computer/processor, you can set a smaller number here. I mostly used this for quickly testing new functions..etc. No reason to run through the entire set to just quickly test a different method.