10K samples compared to 1.6 million samples with Deep Learning




Welcome to part eight of the Deep Learning with Neural Networks and TensorFlow tutorials. In the last tutorial, we applied a deep neural network to our own dataset, but we didn't get very useful results. We're wondering what might happen if we significantly increase the size of the dataset. Before, we were using ~10,000 samples, how about we try with 1.6 million samples?

The dataset that we will use this time is from Stanford, and containes 1.6 million examples of positive and negative sentiment: Sentiment140 dataset.

Now, at the moment, this dataset isn't likely too large for you to fit into memory, but, once we convert it to the bag of words model from before, it definitely will be. So, this time, we have to begin considering what to do when datasets are much larger. When working with large datasets, we have a few changes:

  • We want to buffer that data coming in to an acceptable size. Rather than reading an entire file all at once, we can instead read it in segments of 10mb at a time with buffering.
  • We need to for sure run the data through our network in batches. Our previous examples have not actually required this.
  • Now that training is taking much longer, we want to be able to save our progress both for the purposes of continuing where we left off, but also so that we do not need to keep re-training our model every single time.

With this in mind, we need to re-pre-process our data for this specific dataset. I am going to assume that you've been following along, so we will be running through this a bit quicker than usual. You can call this file whatever you like, we won't be importing it, we'll just be using it to pre-process our data. Something like data_preprocessing.py will suffice:

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import pickle
import numpy as np
import pandas as pd

lemmatizer = WordNetLemmatizer()

'''
polarity 0 = negative. 2 = neutral. 4 = positive.
id
date
query
user
tweet
'''

These are the packages we'll be using, along with some notes on the dataset. First, we'll convert the sentiment values of the dataset:

def init_process(fin,fout):
    outfile = open(fout,'a')
    with open(fin, buffering=200000, encoding='latin-1') as f:
        try:
            for line in f:
                line = line.replace('"','')
                initial_polarity = line.split(',')[0]
                if initial_polarity == '0':
                    initial_polarity = [1,0]
                elif initial_polarity == '4':
                    initial_polarity = [0,1]

                tweet = line.split(',')[-1]
                outline = str(initial_polarity)+':::'+tweet
                outfile.write(outline)
        except Exception as e:
            print(str(e))
    outfile.close()

Here, we just simply pass a file in (the original dataset), then output the new file. We're modifying the sentiment label, and some of the formatting. You only need to run this function once for the training and testing data, like so:

init_process('training.1600000.processed.noemoticon.csv','train_set.csv')
init_process('testdata.manual.2009.06.14.csv','test_set.csv')

Next, we create our lexicon. This is very similar to our method before, the only difference is that this lexicon is based on a random one out of every 2500 samples that we have, to get a random sampling:

def create_lexicon(fin):
    lexicon = []
    with open(fin, 'r', buffering=100000, encoding='latin-1') as f:
        try:
            counter = 1
            content = ''
            for line in f:
                counter += 1
                if (counter/2500.0).is_integer():
                    tweet = line.split(':::')[1]
                    content += ' '+tweet
                    words = word_tokenize(content)
                    words = [lemmatizer.lemmatize(i) for i in words]
                    lexicon = list(set(lexicon + words))
                    print(counter, len(lexicon))

        except Exception as e:
            print(str(e))

    with open('lexicon.pickle','wb') as f:
        pickle.dump(lexicon,f)

This lexicon creation only needs to be run once as well:

create_lexicon('train_set.csv')

Now, we can either vectorize the data into the bag of words model prior to training the network, or we can do it inline with the network. With this dataset's size, it could be possible for us to vectorize the data first, saving it somewhere, then feeding through the network, but this is going to be an impractical practice later on with a much larger set.

For our test set, however, we can easily do this, so here's the code for that:

def create_test_data_pickle(fin):

    feature_sets = []
    labels = []
    counter = 0
    with open(fin, buffering=20000) as f:
        for line in f:
            try:
                features = list(eval(line.split('::')[0]))
                label = list(eval(line.split('::')[1]))

                feature_sets.append(features)
                labels.append(label)
                counter += 1
            except:
                pass
    print(counter)
    feature_sets = np.array(feature_sets)
    labels = np.array(labels)

create_test_data_pickle('processed-test-set.csv')

Go ahead and run the above script once, which will preprocess our dataset and create our lexicon for us. We'll head over to our neural network python script now:

import tensorflow as tf
import pickle
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

n_nodes_hl1 = 500
n_nodes_hl2 = 500

n_classes = 2

batch_size = 32
total_batches = int(1600000/batch_size)
hm_epochs = 10

x = tf.placeholder('float')
y = tf.placeholder('float')

hidden_1_layer = {'f_fum':n_nodes_hl1,
                  'weight':tf.Variable(tf.random_normal([2638, n_nodes_hl1])),
                  'bias':tf.Variable(tf.random_normal([n_nodes_hl1]))}

hidden_2_layer = {'f_fum':n_nodes_hl2,
                  'weight':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])),
                  'bias':tf.Variable(tf.random_normal([n_nodes_hl2]))}

output_layer = {'f_fum':None,
                'weight':tf.Variable(tf.random_normal([n_nodes_hl2, n_classes])),
                'bias':tf.Variable(tf.random_normal([n_classes])),}

This should all look familiar. The only addition here is the batch_size and total_batches value, which we'll use shortly.

The deep neural network model itself is left unchanged:

def neural_network_model(data):
    l1 = tf.add(tf.matmul(data,hidden_1_layer['weight']), hidden_1_layer['bias'])
    l1 = tf.nn.relu(l1)
    l2 = tf.add(tf.matmul(l1,hidden_2_layer['weight']), hidden_2_layer['bias'])
    l2 = tf.nn.relu(l2)
    output = tf.matmul(l2,output_layer['weight']) + output_layer['bias']
    return output

Now, before we get into the next function, we'll add:

saver = tf.train.Saver()
tf_log = 'tf.log'

This is how we're going to save our model in the form of checkpoints as time goes on. Now we get to the training of the network:

def train_neural_network(x):
    prediction = neural_network_model(x)
    cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y) )
    optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost)
    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        try:
            epoch = int(open(tf_log,'r').read().split('\n')[-2])+1
            print('STARTING:',epoch)
        except:
            epoch = 1

        while epoch <= hm_epochs:
            if epoch != 1:
                saver.restore(sess,"model.ckpt")
            epoch_loss = 1
            with open('lexicon.pickle','rb') as f:
                lexicon = pickle.load(f)
            with open('train_set_shuffled.csv', buffering=20000, encoding='latin-1') as f:
                batch_x = []
                batch_y = []
                batches_run = 0
                for line in f:
                    label = line.split(':::')[0]
                    tweet = line.split(':::')[1]
                    current_words = word_tokenize(tweet.lower())
                    current_words = [lemmatizer.lemmatize(i) for i in current_words]

                    features = np.zeros(len(lexicon))

                    for word in current_words:
                        if word.lower() in lexicon:
                            index_value = lexicon.index(word.lower())
                            # OR DO +=1, test both
                            features[index_value] += 1
                    line_x = list(features)
                    line_y = eval(label)
                    batch_x.append(line_x)
                    batch_y.append(line_y)
                    if len(batch_x) >= batch_size:
                        _, c = sess.run([optimizer, cost], feed_dict={x: np.array(batch_x),
                                                                  y: np.array(batch_y)})
                        epoch_loss += c
                        batch_x = []
                        batch_y = []
                        batches_run +=1
                        print('Batch run:',batches_run,'/',total_batches,'| Epoch:',epoch,'| Batch Loss:',c,)

            saver.save(sess, "model.ckpt")
            print('Epoch', epoch, 'completed out of',hm_epochs,'loss:',epoch_loss)
            with open(tf_log,'a') as f:
                f.write(str(epoch)+'\n') 
            epoch +=1

train_neural_network(x)

Much of this should be familiar, but there are a few changes. First, note the:

       try:
            epoch = int(open(tf_log,'r').read().split('\n')[-2])+1
            print('STARTING:',epoch)
        except:
            epoch = 1

This is our way to tracking what epoch we're on using a log file. I am still pretty new at TensorFlow myself, I would expect there to be a way to save the epoch number within the model, but I couldn't seem to get it working.

Next:

       while epoch <= hm_epochs:

            if epoch != 1:
                saver.restore(sess,"model.ckpt")
            epoch_loss = 1

            with open('lexicon.pickle','rb') as f:
                lexicon = pickle.load(f)

            with open('train_set_shuffled.csv', buffering=20000, encoding='latin-1') as f:
                batch_x = []
                batch_y = []
                batches_run = 0

This is our way of continuing this until we've done as many epochs as we've wanted. As the epoch begins, if we're not starting at the first one, then we're going to load in the checkpoint file. We load in the lexicon pickle, then we begin to load the shuffled training set. From here, each line will be vectorized, and added to our batch, which will be of size of our buffering.

               for line in f:

                    label = line.split(':::')[0]
                    tweet = line.split(':::')[1]
                    current_words = word_tokenize(tweet.lower())
                    current_words = [lemmatizer.lemmatize(i) for i in current_words]

                    features = np.zeros(len(lexicon))

                    for word in current_words:
                        if word.lower() in lexicon:
                            index_value = lexicon.index(word.lower())
                            # OR DO +=1, test both
                            features[index_value] += 1

                    line_x = list(features)
                    line_y = eval(label)

                    batch_x.append(line_x)
                    batch_y.append(line_y)

                    if len(batch_x) >= batch_size:
                        _, c = sess.run([optimizer, cost], feed_dict={x: np.array(batch_x),
                                                                  y: np.array(batch_y)})
                        epoch_loss += c

                        batch_x = []
                        batch_y = []
                        batches_run +=1
                        
                        print('Batch run:',batches_run,'/',total_batches,'| Epoch:',epoch,'| Batch Loss:',c,)

This would be a full epoch at this point, so then we want to be sure to save it, and update our epoch log file:

           saver.save(sess, "model.ckpt")
            print('Epoch', epoch, 'completed out of',hm_epochs,'loss:',epoch_loss)
            with open(tf_log,'a') as f:
                f.write(str(epoch)+'\n') 
            epoch +=1

When done, we can test our accuracy as usual:

       correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
        accuracy = tf.reduce_mean(tf.cast(correct, 'float'))

        feature_sets = []
        labels = []
        counter = 0
        with open('processed-test-set.csv', buffering=20000) as f:
            for line in f:
                try:
                    features = list(eval(line.split('::')[0]))
                    label = list(eval(line.split('::')[1]))

                    feature_sets.append(features)
                    labels.append(label)
                    counter += 1
                except:
                    pass
        print('Tested',counter,'samples.')

        test_x = np.array(feature_sets)
        test_y = np.array(labels)

        print('Accuracy:',accuracy.eval({x:test_x, y:test_y}))

Since we're using checkpoints now, and a larger dataset, we might actually want a separate testing function for accuracy:

def test_neural_network():
    prediction = neural_network_model(x)
    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
      
        for epoch in range(hm_epochs):
            try:
                saver.restore(sess,"model.ckpt")
            except Exception as e:
                print(str(e))
            epoch_loss = 0


        correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
        accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
        feature_sets = []
        labels = []
        counter = 0
        with open('processed-test-set.csv', buffering=20000) as f:
            for line in f:
                try:
                    features = list(eval(line.split('::')[0]))
                    label = list(eval(line.split('::')[1]))

                    #print(features)
                    #print(label)

                    feature_sets.append(features)
                    labels.append(label)
                    counter += 1
                except:
                    pass
        print('Tested',counter,'samples.')

        test_x = np.array(feature_sets)
        test_y = np.array(labels)
        print('Accuracy:',accuracy.eval({x:test_x, y:test_y}))


test_neural_network()

Just in case you missed something, the full pre-processing script:

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import pickle
import numpy as np
import pandas as pd

lemmatizer = WordNetLemmatizer()

'''
polarity 0 = negative. 2 = neutral. 4 = positive.
id
date
query
user
tweet
'''

def init_process(fin,fout):
	outfile = open(fout,'a')
	with open(fin, buffering=200000, encoding='latin-1') as f:
		try:
			for line in f:
				line = line.replace('"','')
				initial_polarity = line.split(',')[0]
				if initial_polarity == '0':
					initial_polarity = [1,0]
				elif initial_polarity == '4':
					initial_polarity = [0,1]

				tweet = line.split(',')[-1]
				outline = str(initial_polarity)+':::'+tweet
				outfile.write(outline)
		except Exception as e:
			print(str(e))
	outfile.close()

init_process('training.1600000.processed.noemoticon.csv','train_set.csv')
init_process('testdata.manual.2009.06.14.csv','test_set.csv')


def create_lexicon(fin):
	lexicon = []
	with open(fin, 'r', buffering=100000, encoding='latin-1') as f:
		try:
			counter = 1
			content = ''
			for line in f:
				counter += 1
				if (counter/2500.0).is_integer():
					tweet = line.split(':::')[1]
					content += ' '+tweet
					words = word_tokenize(content)
					words = [lemmatizer.lemmatize(i) for i in words]
					lexicon = list(set(lexicon + words))
					print(counter, len(lexicon))

		except Exception as e:
			print(str(e))

	with open('lexicon-2500-2638.pickle','wb') as f:
		pickle.dump(lexicon,f)

create_lexicon('train_set.csv')


def convert_to_vec(fin,fout,lexicon_pickle):
	with open(lexicon_pickle,'rb') as f:
		lexicon = pickle.load(f)
	outfile = open(fout,'a')
	with open(fin, buffering=20000, encoding='latin-1') as f:
		counter = 0
		for line in f:
			counter +=1
			label = line.split(':::')[0]
			tweet = line.split(':::')[1]
			current_words = word_tokenize(tweet.lower())
			current_words = [lemmatizer.lemmatize(i) for i in current_words]

			features = np.zeros(len(lexicon))

			for word in current_words:
				if word.lower() in lexicon:
					index_value = lexicon.index(word.lower())
					# OR DO +=1, test both
					features[index_value] += 1

			features = list(features)
			outline = str(features)+'::'+str(label)+'\n'
			outfile.write(outline)

		print(counter)

convert_to_vec('test_set.csv','processed-test-set.csv','lexicon-2500-2638.pickle')


def shuffle_data(fin):
	df = pd.read_csv(fin, error_bad_lines=False)
	df = df.iloc[np.random.permutation(len(df))]
	print(df.head())
	df.to_csv('train_set_shuffled.csv', index=False)
	
shuffle_data('train_set.csv')


def create_test_data_pickle(fin):

	feature_sets = []
	labels = []
	counter = 0
	with open(fin, buffering=20000) as f:
		for line in f:
			try:
				features = list(eval(line.split('::')[0]))
				label = list(eval(line.split('::')[1]))

				feature_sets.append(features)
				labels.append(label)
				counter += 1
			except:
				pass
	print(counter)
	feature_sets = np.array(feature_sets)
	labels = np.array(labels)

create_test_data_pickle('processed-test-set.csv')

Neural Network Script:

import tensorflow as tf
import pickle
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

n_nodes_hl1 = 500
n_nodes_hl2 = 500

n_classes = 2

batch_size = 32
total_batches = int(1600000/batch_size)
hm_epochs = 10

x = tf.placeholder('float')
y = tf.placeholder('float')

hidden_1_layer = {'f_fum':n_nodes_hl1,
                  'weight':tf.Variable(tf.random_normal([2638, n_nodes_hl1])),
                  'bias':tf.Variable(tf.random_normal([n_nodes_hl1]))}

hidden_2_layer = {'f_fum':n_nodes_hl2,
                  'weight':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])),
                  'bias':tf.Variable(tf.random_normal([n_nodes_hl2]))}

output_layer = {'f_fum':None,
                'weight':tf.Variable(tf.random_normal([n_nodes_hl2, n_classes])),
                'bias':tf.Variable(tf.random_normal([n_classes])),}

def neural_network_model(data):
    l1 = tf.add(tf.matmul(data,hidden_1_layer['weight']), hidden_1_layer['bias'])
    l1 = tf.nn.relu(l1)
    l2 = tf.add(tf.matmul(l1,hidden_2_layer['weight']), hidden_2_layer['bias'])
    l2 = tf.nn.relu(l2)
    output = tf.matmul(l2,output_layer['weight']) + output_layer['bias']
    return output

saver = tf.train.Saver()
tf_log = 'tf.log'

def train_neural_network(x):
    prediction = neural_network_model(x)
    cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y) )
    optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost)
    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        try:
            epoch = int(open(tf_log,'r').read().split('\n')[-2])+1
            print('STARTING:',epoch)
        except:
            epoch = 1

        while epoch <= hm_epochs:
            if epoch != 1:
                saver.restore(sess,"model.ckpt")
            epoch_loss = 1
            with open('lexicon.pickle','rb') as f:
                lexicon = pickle.load(f)
            with open('train_set_shuffled.csv', buffering=20000, encoding='latin-1') as f:
                batch_x = []
                batch_y = []
                batches_run = 0
                for line in f:
                    label = line.split(':::')[0]
                    tweet = line.split(':::')[1]
                    current_words = word_tokenize(tweet.lower())
                    current_words = [lemmatizer.lemmatize(i) for i in current_words]

                    features = np.zeros(len(lexicon))

                    for word in current_words:
                        if word.lower() in lexicon:
                            index_value = lexicon.index(word.lower())
                            # OR DO +=1, test both
                            features[index_value] += 1
                    line_x = list(features)
                    line_y = eval(label)
                    batch_x.append(line_x)
                    batch_y.append(line_y)
                    if len(batch_x) >= batch_size:
                        _, c = sess.run([optimizer, cost], feed_dict={x: np.array(batch_x),
                                                                  y: np.array(batch_y)})
                        epoch_loss += c
                        batch_x = []
                        batch_y = []
                        batches_run +=1
                        print('Batch run:',batches_run,'/',total_batches,'| Epoch:',epoch,'| Batch Loss:',c,)

            saver.save(sess, "model.ckpt")
            print('Epoch', epoch, 'completed out of',hm_epochs,'loss:',epoch_loss)
            with open(tf_log,'a') as f:
                f.write(str(epoch)+'\n') 
            epoch +=1

train_neural_network(x)

def test_neural_network():
    prediction = neural_network_model(x)
    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        for epoch in range(hm_epochs):
            try:
                saver.restore(sess,"model.ckpt")
            except Exception as e:
                print(str(e))
            epoch_loss = 0
            
        correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
        accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
        feature_sets = []
        labels = []
        counter = 0
        with open('processed-test-set.csv', buffering=20000) as f:
            for line in f:
                try:
                    features = list(eval(line.split('::')[0]))
                    label = list(eval(line.split('::')[1]))
                    feature_sets.append(features)
                    labels.append(label)
                    counter += 1
                except:
                    pass
        print('Tested',counter,'samples.')
        test_x = np.array(feature_sets)
        test_y = np.array(labels)
        print('Accuracy:',accuracy.eval({x:test_x, y:test_y}))


test_neural_network()

How'd we do? After about 10 epochs, we're sitting around 74% accuracy. This isn't anything to write home about, but at least it's better than 50%. While we're thinking of it, it might be wise to confirm that our dataset is a perfect 50/50 split. What if our model always predicts positive sentiment, and 74% of our data is positive? We had better check:

import pandas as pd
from collections import Counter

df = pd.read_csv('train_set.csv',names=['sentiment','tweet'], delimiter=':::')
print(Counter(df['sentiment']))
Counter({})

Well, that's good, a perfect 50/50 split.

What if we actually liked the 74% accuracy, and we wanted to actually use this model? How might we do it? We already have everything we need at this point. All we need to do is take a string input, vectorize it according to our bag of words model, feed it through the neural network, and the output will either be [1,0] for positive sentiment or [0,1] for negative sentiment. Here's a quick script that simply uses the network:

import tensorflow as tf
import pickle
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

n_nodes_hl1 = 500
n_nodes_hl2 = 500

n_classes = 2
hm_data = 2000000

batch_size = 32
hm_epochs = 10

x = tf.placeholder('float')
y = tf.placeholder('float')


current_epoch = tf.Variable(1)

hidden_1_layer = {'f_fum':n_nodes_hl1,
                  'weight':tf.Variable(tf.random_normal([2638, n_nodes_hl1])),
                  'bias':tf.Variable(tf.random_normal([n_nodes_hl1]))}

hidden_2_layer = {'f_fum':n_nodes_hl2,
                  'weight':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])),
                  'bias':tf.Variable(tf.random_normal([n_nodes_hl2]))}

output_layer = {'f_fum':None,
                'weight':tf.Variable(tf.random_normal([n_nodes_hl2, n_classes])),
                'bias':tf.Variable(tf.random_normal([n_classes])),}


def neural_network_model(data):

    l1 = tf.add(tf.matmul(data,hidden_1_layer['weight']), hidden_1_layer['bias'])
    l1 = tf.nn.relu(l1)

    l2 = tf.add(tf.matmul(l1,hidden_2_layer['weight']), hidden_2_layer['bias'])
    l2 = tf.nn.relu(l2)

    output = tf.matmul(l2,output_layer['weight']) + output_layer['bias']

    return output

saver = tf.train.Saver()

def use_neural_network(input_data):
    prediction = neural_network_model(x)
    with open('lexicon.pickle','rb') as f:
        lexicon = pickle.load(f)
        
    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        saver.restore(sess,"model.ckpt")
        current_words = word_tokenize(input_data.lower())
        current_words = [lemmatizer.lemmatize(i) for i in current_words]
        features = np.zeros(len(lexicon))

        for word in current_words:
            if word.lower() in lexicon:
                index_value = lexicon.index(word.lower())
                # OR DO +=1, test both
                features[index_value] += 1

        features = np.array(list(features))
        # pos: [1,0] , argmax: 0
        # neg: [0,1] , argmax: 1
        result = (sess.run(tf.argmax(prediction.eval(feed_dict={x:[features]}),1)))
        if result[0] == 0:
            print('Positive:',input_data)
        elif result[0] == 1:
            print('Negative:',input_data)

use_neural_network("He's an idiot and a jerk.")
use_neural_network("This was the best store i've ever seen.")

Output here should be:

Negative: He's an idiot and a jerk.
Positive: This was the best store i've ever seen.

I have hosted the checkpoint file and lexicon pickle for anyone who is interested, since running the 10-15 epochs to get a decent model can take a while.

In the next tutorial, we're going to be talking about using CUDA and the GPU version of TensorFlow.

The next tutorial:





  • Practical Machine Learning Tutorial with Python Introduction
  • Regression - Intro and Data
  • Regression - Features and Labels
  • Regression - Training and Testing
  • Regression - Forecasting and Predicting
  • Pickling and Scaling
  • Regression - Theory and how it works
  • Regression - How to program the Best Fit Slope
  • Regression - How to program the Best Fit Line
  • Regression - R Squared and Coefficient of Determination Theory
  • Regression - How to Program R Squared
  • Creating Sample Data for Testing
  • Classification Intro with K Nearest Neighbors
  • Applying K Nearest Neighbors to Data
  • Euclidean Distance theory
  • Creating a K Nearest Neighbors Classifer from scratch
  • Creating a K Nearest Neighbors Classifer from scratch part 2
  • Testing our K Nearest Neighbors classifier
  • Final thoughts on K Nearest Neighbors
  • Support Vector Machine introduction
  • Vector Basics
  • Support Vector Assertions
  • Support Vector Machine Fundamentals
  • Constraint Optimization with Support Vector Machine
  • Beginning SVM from Scratch in Python
  • Support Vector Machine Optimization in Python
  • Support Vector Machine Optimization in Python part 2
  • Visualization and Predicting with our Custom SVM
  • Kernels Introduction
  • Why Kernels
  • Soft Margin Support Vector Machine
  • Kernels, Soft Margin SVM, and Quadratic Programming with Python and CVXOPT
  • Support Vector Machine Parameters
  • Machine Learning - Clustering Introduction
  • Handling Non-Numerical Data for Machine Learning
  • K-Means with Titanic Dataset
  • K-Means from Scratch in Python
  • Finishing K-Means from Scratch in Python
  • Hierarchical Clustering with Mean Shift Introduction
  • Mean Shift applied to Titanic Dataset
  • Mean Shift algorithm from scratch in Python
  • Dynamically Weighted Bandwidth for Mean Shift
  • Introduction to Neural Networks
  • Installing TensorFlow for Deep Learning - OPTIONAL
  • Introduction to Deep Learning with TensorFlow
  • Deep Learning with TensorFlow - Creating the Neural Network Model
  • Deep Learning with TensorFlow - How the Network will run
  • Deep Learning with our own Data
  • Simple Preprocessing Language Data for Deep Learning
  • Training and Testing on our Data for Deep Learning
  • 10K samples compared to 1.6 million samples with Deep Learning
  • How to use CUDA and the GPU Version of Tensorflow for Deep Learning
  • Recurrent Neural Network (RNN) basics and the Long Short Term Memory (LSTM) cell
  • RNN w/ LSTM cell example in TensorFlow and Python
  • Convolutional Neural Network (CNN) basics
  • Convolutional Neural Network CNN with TensorFlow tutorial
  • TFLearn - High Level Abstraction Layer for TensorFlow Tutorial
  • Using a 3D Convolutional Neural Network on medical imaging data (CT Scans) for Kaggle
  • Classifying Cats vs Dogs with a Convolutional Neural Network on Kaggle
  • Using a neural network to solve OpenAI's CartPole balancing environment