Python Programming Tutorials

Deep Learning with our own Data

Welcome to part five of the Deep Learning with Neural Networks and TensorFlow tutorials. Now that we've covered a simple example of an artificial neural network, let's further break this model down and learn how we might approach this if we had some data that wasn't preloaded and setup for us. This is usually the first challenge you will come up against afer you learn based on demos. The demo works, and that's awesome, and then you begin to wonder how you can stuff the data you have into the code. It's always a good idea to grab a dataset from somewhere, and try to do it yourself, as it will give you a better idea of how everything works and what formats you need data in.

To do this, we will use two files: and . The pos file has ~5,000 positive sentiment statements, and the neg file has ~5,000 negative sentiment statements.

What we are going to try to do here is use a neural network to correctly identify sentiment, training with this data. Right away, we're faced with some challenges.

First, our data is in language/word format, not numerical form, which we need be converted to a vector of features. So, then we begin pondering about how we'll convert words to numbers, and then we make a second realization: our texts may not be the same length of words or characters. This is a big deal, since we need all featuresets to be exactly the same length going into training, and of course for training.

One option we have is to compile a list of all unique words in the training set. Let's say that's 3,500 unique words. These words are our lexicon. Now, we create a vector, the training vector of zeros that is 1x3500 in size, and then we have a list of all unique words that is also 1x3500. From here, for every word that is in our sample sentence, we check to see if it is in our unique word vector. If so, the index value of that word in the unique word index is set to 1 in the training vector. This is a very simple bag-of-words model.

For further example, let's say our unique word list is [chair, table, spoon, television]. Let's then say we have a training sentence that is I pulled my chair up to the table. We first create our training vector to be a vector of zeros that is the same size as our unique word list. This would be a 1x4: [0 0 0 0]. Now, we iterate through all of the worlds in that sample sentence and, if they are in the unique word list, we make that index's value in the training vector equal to 1. Since chair (index: 0) and table (index:1) are in the unique word list, and no others are, our new training feature vector is [1 1 0 0]. Then, we either the data is positive or negative. Again, here, we will just use one-hot encoding and have the label vector be [POS,NEG], where positive data is [1,0] and negative data would be [0,1].

To aid us in the pre-processsing, we're going to make use of NLTK (Natural Language Toolkit). Our main interest here is for the word tokenizer, as well as the Lemmatizer. Word tokenizers separate words for us. A lemmatizer takes similar words and converts them into the same single word. The concept is very similar to stemming, only a lemma is an actual word, one you could look up in a dictionary or something like WordNet. This will help us keep our lexicon much smaller, without losing too much value.

If you do not have NLTK, you need to install it.

pip3 install nltk

python3

import nltk
nltk.download()

This will open either a GUI, or stay headless. Go ahead and just download all. If you are in GUI form, just choose download all. If you are in the text version, type d, then all. Once that's done, you're ready to progress. If you're lost or confused, check out the first NLTK tutorial for installing NLTK.

That's the plan, let's make it happen. We'll start by creating a create_sentiment_featuresets.py file, which we'll keep separate from the neural network/TensorFlow modeling program.

import nltk
from nltk.tokenize import word_tokenize
import numpy as np
import random
import pickle
from collections import Counter
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
hm_lines = 100000

These are just some necessary imports. NLTK has been explained, numpy is a given, random will be used to shuffle the data, Counter will be used for sorting most common lemmas, and pickle to save the process so that we dont need to do it every time. We define the lemmatizer, and then we set the hm_lines value. 100,000 will do all of the lines, there are just over 10,000 lines. If you want to test something new, or shrink the total data size for a smaller computer/processor, you can set a smaller number here. I mostly used this for quickly testing new functions..etc. No reason to run through the entire set to just quickly test a different method.

The next tutorial:

Practical Machine Learning Tutorial with Python Introduction
Regression - Intro and Data
Regression - Features and Labels
Regression - Training and Testing
Regression - Forecasting and Predicting
Pickling and Scaling
Regression - Theory and how it works
Regression - How to program the Best Fit Slope
Regression - How to program the Best Fit Line
Regression - R Squared and Coefficient of Determination Theory
Regression - How to Program R Squared
Creating Sample Data for Testing
Classification Intro with K Nearest Neighbors
Applying K Nearest Neighbors to Data
Euclidean Distance theory
Creating a K Nearest Neighbors Classifer from scratch
Creating a K Nearest Neighbors Classifer from scratch part 2
Testing our K Nearest Neighbors classifier
Final thoughts on K Nearest Neighbors
Support Vector Machine introduction
Vector Basics
Support Vector Assertions
Support Vector Machine Fundamentals
Constraint Optimization with Support Vector Machine
Beginning SVM from Scratch in Python
Support Vector Machine Optimization in Python
Support Vector Machine Optimization in Python part 2
Visualization and Predicting with our Custom SVM
Kernels Introduction
Why Kernels
Soft Margin Support Vector Machine
Kernels, Soft Margin SVM, and Quadratic Programming with Python and CVXOPT
Support Vector Machine Parameters
Machine Learning - Clustering Introduction
Handling Non-Numerical Data for Machine Learning
K-Means with Titanic Dataset
K-Means from Scratch in Python
Finishing K-Means from Scratch in Python
Hierarchical Clustering with Mean Shift Introduction
Mean Shift applied to Titanic Dataset
Mean Shift algorithm from scratch in Python
Dynamically Weighted Bandwidth for Mean Shift
Introduction to Neural Networks
Installing TensorFlow for Deep Learning - OPTIONAL
Introduction to Deep Learning with TensorFlow
Deep Learning with TensorFlow - Creating the Neural Network Model
Deep Learning with TensorFlow - How the Network will run
Deep Learning with our own Data
Simple Preprocessing Language Data for Deep Learning
Training and Testing on our Data for Deep Learning
10K samples compared to 1.6 million samples with Deep Learning
How to use CUDA and the GPU Version of Tensorflow for Deep Learning
Recurrent Neural Network (RNN) basics and the Long Short Term Memory (LSTM) cell
RNN w/ LSTM cell example in TensorFlow and Python
Convolutional Neural Network (CNN) basics
Convolutional Neural Network CNN with TensorFlow tutorial
TFLearn - High Level Abstraction Layer for TensorFlow Tutorial
Using a 3D Convolutional Neural Network on medical imaging data (CT Scans) for Kaggle
Classifying Cats vs Dogs with a Convolutional Neural Network on Kaggle
Using a neural network to solve OpenAI's CartPole balancing environment