Deep Learning with our own Data

Welcome to part five of the Deep Learning with Neural Networks and TensorFlow tutorials. Now that we've covered a simple example of an artificial neural network, let's further break this model down and learn how we might approach this if we had some data that wasn't preloaded and setup for us. This is usually the first challenge you will come up against afer you learn based on demos. The demo works, and that's awesome, and then you begin to wonder how you can stuff the data you have into the code. It's always a good idea to grab a dataset from somewhere, and try to do it yourself, as it will give you a better idea of how everything works and what formats you need data in.

To do this, we will use two files: and . The pos file has ~5,000 positive sentiment statements, and the neg file has ~5,000 negative sentiment statements.

What we are going to try to do here is use a neural network to correctly identify sentiment, training with this data. Right away, we're faced with some challenges.

First, our data is in language/word format, not numerical form, which we need be converted to a vector of features. So, then we begin pondering about how we'll convert words to numbers, and then we make a second realization: our texts may not be the same length of words or characters. This is a big deal, since we need all featuresets to be exactly the same length going into training, and of course for training.

One option we have is to compile a list of all unique words in the training set. Let's say that's 3,500 unique words. These words are our lexicon. Now, we create a vector, the training vector of zeros that is 1x3500 in size, and then we have a list of all unique words that is also 1x3500. From here, for every word that is in our sample sentence, we check to see if it is in our unique word vector. If so, the index value of that word in the unique word index is set to 1 in the training vector. This is a very simple bag-of-words model.

For further example, let's say our unique word list is [chair, table, spoon, television]. Let's then say we have a training sentence that is I pulled my chair up to the table. We first create our training vector to be a vector of zeros that is the same size as our unique word list. This would be a 1x4: [0 0 0 0]. Now, we iterate through all of the worlds in that sample sentence and, if they are in the unique word list, we make that index's value in the training vector equal to 1. Since chair (index: 0) and table (index:1) are in the unique word list, and no others are, our new training feature vector is [1 1 0 0]. Then, we either the data is positive or negative. Again, here, we will just use one-hot encoding and have the label vector be [POS,NEG], where positive data is [1,0] and negative data would be [0,1].

To aid us in the pre-processsing, we're going to make use of NLTK (Natural Language Toolkit). Our main interest here is for the word tokenizer, as well as the Lemmatizer. Word tokenizers separate words for us. A lemmatizer takes similar words and converts them into the same single word. The concept is very similar to stemming, only a lemma is an actual word, one you could look up in a dictionary or something like WordNet. This will help us keep our lexicon much smaller, without losing too much value.

If you do not have NLTK, you need to install it.

pip3 install ntlk


import nltk

This will open either a GUI, or stay headless. Go ahead and just download all. If you are in GUI form, just choose download all. If you are in the text version, type d, then all. Once that's done, you're ready to progress. If you're lost or confused, check out the first NLTK tutorial for installing NLTK.

That's the plan, let's make it happen. We'll start by creating a file, which we'll keep separate from the neural network/TensorFlow modeling program.

import nltk
from nltk.tokenize import word_tokenize
import numpy as np
import random
import pickle
from collections import Counter
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
hm_lines = 100000

These are just some necessary imports. NLTK has been explained, numpy is a given, random will be used to shuffle the data, Counter will be used for sorting most common lemmas, and pickle to save the process so that we dont need to do it every time. We define the lemmatizer, and then we set the hm_lines value. 100,000 will do all of the lines, there are just over 10,000 lines. If you want to test something new, or shrink the total data size for a smaller computer/processor, you can set a smaller number here. I mostly used this for quickly testing new functions..etc. No reason to run through the entire set to just quickly test a different method.

The next tutorial:

  • Practical Machine Learning Tutorial with Python Introduction
  • Regression - Intro and Data
  • Regression - Features and Labels
  • Regression - Training and Testing
  • Regression - Forecasting and Predicting
  • Pickling and Scaling
  • Regression - Theory and how it works
  • Regression - How to program the Best Fit Slope
  • Regression - How to program the Best Fit Line
  • Regression - R Squared and Coefficient of Determination Theory
  • Regression - How to Program R Squared
  • Creating Sample Data for Testing
  • Classification Intro with K Nearest Neighbors
  • Applying K Nearest Neighbors to Data
  • Euclidean Distance theory
  • Creating a K Nearest Neighbors Classifer from scratch
  • Creating a K Nearest Neighbors Classifer from scratch part 2
  • Testing our K Nearest Neighbors classifier
  • Final thoughts on K Nearest Neighbors
  • Support Vector Machine introduction
  • Vector Basics
  • Support Vector Assertions
  • Support Vector Machine Fundamentals
  • Constraint Optimization with Support Vector Machine
  • Beginning SVM from Scratch in Python
  • Support Vector Machine Optimization in Python
  • Support Vector Machine Optimization in Python part 2
  • Visualization and Predicting with our Custom SVM
  • Kernels Introduction
  • Why Kernels
  • Soft Margin Support Vector Machine
  • Kernels, Soft Margin SVM, and Quadratic Programming with Python and CVXOPT
  • Support Vector Machine Parameters
  • Machine Learning - Clustering Introduction
  • Handling Non-Numerical Data for Machine Learning
  • K-Means with Titanic Dataset
  • K-Means from Scratch in Python
  • Finishing K-Means from Scratch in Python
  • Hierarchical Clustering with Mean Shift Introduction
  • Mean Shift applied to Titanic Dataset
  • Mean Shift algorithm from scratch in Python
  • Dynamically Weighted Bandwidth for Mean Shift
  • Introduction to Neural Networks
  • Installing TensorFlow for Deep Learning - OPTIONAL
  • Introduction to Deep Learning with TensorFlow
  • Deep Learning with TensorFlow - Creating the Neural Network Model
  • Deep Learning with TensorFlow - How the Network will run
  • Deep Learning with our own Data
    You are currently here.
  • Simple Preprocessing Language Data for Deep Learning
  • Training and Testing on our Data for Deep Learning
  • 10K samples compared to 1.6 million samples with Deep Learning
  • How to use CUDA and the GPU Version of Tensorflow for Deep Learning
  • Recurrent Neural Network (RNN) basics and the Long Short Term Memory (LSTM) cell
  • RNN w/ LSTM cell example in TensorFlow and Python
  • Convolutional Neural Network (CNN) basics
  • Convolutional Neural Network CNN with TensorFlow tutorial
  • TFLearn - High Level Abstraction Layer for TensorFlow Tutorial
  • Using a 3D Convolutional Neural Network on medical imaging data (CT Scans) for Kaggle
  • Classifying Cats vs Dogs with a Convolutional Neural Network on Kaggle
  • Using a neural network to solve OpenAI's CartPole balancing environment