Python Programming Tutorials

Applying K Nearest Neighbors to Data

Welcome to the 14th part of our Machine Learning with Python tutorial series. In the last part we introduced Classification, which is a supervised form of machine learning, and explained the K Nearest Neighbors algorithm intuition. In this tutorial, we're actually going to apply a simple example of the algorithm using Scikit-Learn, and then in the subsquent tutorials we'll build our own algorithm to learn more about how it works under the hood.

To exemplify classification, we're going to use a Breast Cancer Dataset, which is a dataset donated to the University of California, Irvine (UCI) collection from the University of Wisconsin-Madison. UCI has a large Machine Learning Repository. The datasets here are organized by types of machine learning often used for them, data types, attribute types, topic areas, and a few others. Very useful both for educational uses, as well as for machine learning algorithm development. I find myself coming back here frequently, it's definitely worth a bookmark. From the Breast Cancer Dataset page, choose the Data Folder link. From there, grab breast-cancer-wisconsin.data and breast-cancer-wisconsin.names. These may not download, but instead display in browser. Right click to save as if this is the case for you.

After downloading, go ahead and open the breast-cancer-wisconsin.names file. Looking at this file, scrolling down to just after line 100, we get the names of the attributes (columns). With this information, we're going to just manually add these labels to the breast-cancer-wisconsin.data file. Open that, and enter a new first line: id,clump_thickness,uniform_cell_size, uniform_cell_shape,marginal_adhesion, single_epi_cell_size,bare_nuclei,bland_chromation, normal_nucleoli,mitoses,class. Right out of the gate, you should be thinking what our features will be and what our label will be. We're attempting to classify things, so it should be obvious that the classes are going to be that the list of attributes leads to a benign or malignant tumor. Also, most of these columns appear to be of use, but are there any that are similar to the others or maybe useless? Absolutely, this ID column is not something we actually want to feed into the classifier.

Missing/bad data: This dataset also has some missing data in it, which we're going to need to clean! Let's start off with our imports, pulling in the data, and some cleaning:

import numpy as np
from sklearn import preprocessing, cross_validation, neighbors
import pandas as pd

df = pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?',-99999, inplace=True)
df.drop(['id'], 1, inplace=True)

After reading in the data, we take note that there are some columns with missing data. These columns have a "?" filled in. The .names file informed us of this, but we would have discovered this eventually via an error if we attempted to feed this information through to a classifier. In this case, we're choosing to fill in a -99,999 value for any missing data. You can choose how you want to handle missing data, but, in the real world, you may find that 50% or more of your rows contain missing data in one of the columns, especially if you are collecting data with extensive attributes. -99999 isn't perfect, but it works well enough. Next, we're dropping the ID column. When we are done, we'll comment out the dropping of the id column just to see what sort of impact it might have to include it.

Next, we define our features (X) and labels (y):

X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])

The features X are everything except for the class. Doing df.drop returns a new dataframe with our chosen column(s) dropped. The labels, y, are just the class column.

Now we create training and testing samples, using Scikit-Learn's cross_validation.train_test_split:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)

Define the classifier:

clf = neighbors.KNeighborsClassifier()

In this case, we're using the K Nearest Neighbors classifier from Sklearn.

Train the classifier:

clf.fit(X_train, y_train)

Test:

accuracy = clf.score(X_test, y_test)
print(accuracy)

The result should be about 95%, and that's out of the box without any tweaking. Very cool! Just for show, let's show what happens when we do indeed include truly meaningless and misleading data by commenting out the dropping of the id column:

import numpy as np
from sklearn import preprocessing, cross_validation, neighbors
import pandas as pd

df = pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?',-99999, inplace=True)
#df.drop(['id'], 1, inplace=True)

X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)

clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(accuracy)

The impact is staggering, where accuracy drops from ~95% to ~60% on average. In the future, when AI rules the planet, note that you just need to feed it meaningless attributes to outsmart it! Interestingly enough, adding noise can be a way to help or hurt your algorithm. When combatting your robot overlords, being able to distinguish between helpful noise and malicious noise may save your life!

Next, you can probably guess how we'll be predicting if you followed from the regression tutorial that used Scikit-Learn. First, we need some sample data. We can just make it up. For example, I will look at one of the lines in the sample file, and make something similar, merely shifting some of the values. You can also just add noise to do further testing, provided the standard deviation is not outrageous. Doing this is relatively safe as well, since you're not actually training on the falsified data, you're merely testing. I will just manually do this by making up a line:

example_measures = np.array([4,2,1,1,1,2,3,2,1])

Feel free to search the document for that list of features. It doesn't exist. Now you can do:

prediction = clf.predict(example_measures)
print(prediction)

...or depending on when you are watching this, you might not be able to! When doing that, I get a warning:

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

Okay, no problem. Do we have a single feature? Nope. Do we have a single example? Yes! So we will use X.reshape(1, -1):

example_measures = np.array([4,2,1,1,1,2,3,2,1])
example_measures = example_measures.reshape(1, -1)
prediction = clf.predict(example_measures)
print(prediction)

Output:

0.95
[2]

The output here is first the accuracy (95%) and then the prediction (2), which is what we modeled our fake data to be.

What if we had two samples?

example_measures = np.array([[4,2,1,1,1,2,3,2,1],[4,2,1,1,1,2,3,2,1]])
example_measures = example_measures.reshape(2, -1)
prediction = clf.predict(example_measures)
print(prediction)

Darn this hard-coding. What if we don't know how many samples!?!

example_measures = np.array([[4,2,1,1,1,2,3,2,1],[4,2,1,1,1,2,3,2,1]])
example_measures = example_measures.reshape(len(example_measures), -1)
prediction = clf.predict(example_measures)
print(prediction)

As you can see, implementing K Nearest Neighbors is not only easy, it's extremely accurate in this case. In the next tutorials, we're going to build our own K Nearest Neighbors algorithm from scratch, rather than using Scikit-Learn, in attempt to learn more about the algorithm, understanding how it works, and, most importantly, one of its pitfalls.

The next tutorial:

Practical Machine Learning Tutorial with Python Introduction
Regression - Intro and Data
Regression - Features and Labels
Regression - Training and Testing
Regression - Forecasting and Predicting
Pickling and Scaling
Regression - Theory and how it works
Regression - How to program the Best Fit Slope
Regression - How to program the Best Fit Line
Regression - R Squared and Coefficient of Determination Theory
Regression - How to Program R Squared
Creating Sample Data for Testing
Classification Intro with K Nearest Neighbors
Applying K Nearest Neighbors to Data
Euclidean Distance theory
Creating a K Nearest Neighbors Classifer from scratch
Creating a K Nearest Neighbors Classifer from scratch part 2
Testing our K Nearest Neighbors classifier
Final thoughts on K Nearest Neighbors
Support Vector Machine introduction
Vector Basics
Support Vector Assertions
Support Vector Machine Fundamentals
Constraint Optimization with Support Vector Machine
Beginning SVM from Scratch in Python
Support Vector Machine Optimization in Python
Support Vector Machine Optimization in Python part 2
Visualization and Predicting with our Custom SVM
Kernels Introduction
Why Kernels
Soft Margin Support Vector Machine
Kernels, Soft Margin SVM, and Quadratic Programming with Python and CVXOPT
Support Vector Machine Parameters
Machine Learning - Clustering Introduction
Handling Non-Numerical Data for Machine Learning
K-Means with Titanic Dataset
K-Means from Scratch in Python
Finishing K-Means from Scratch in Python
Hierarchical Clustering with Mean Shift Introduction
Mean Shift applied to Titanic Dataset
Mean Shift algorithm from scratch in Python
Dynamically Weighted Bandwidth for Mean Shift
Introduction to Neural Networks
Installing TensorFlow for Deep Learning - OPTIONAL
Introduction to Deep Learning with TensorFlow
Deep Learning with TensorFlow - Creating the Neural Network Model
Deep Learning with TensorFlow - How the Network will run
Deep Learning with our own Data
Simple Preprocessing Language Data for Deep Learning
Training and Testing on our Data for Deep Learning
10K samples compared to 1.6 million samples with Deep Learning
How to use CUDA and the GPU Version of Tensorflow for Deep Learning
Recurrent Neural Network (RNN) basics and the Long Short Term Memory (LSTM) cell
RNN w/ LSTM cell example in TensorFlow and Python
Convolutional Neural Network (CNN) basics
Convolutional Neural Network CNN with TensorFlow tutorial
TFLearn - High Level Abstraction Layer for TensorFlow Tutorial
Using a 3D Convolutional Neural Network on medical imaging data (CT Scans) for Kaggle
Classifying Cats vs Dogs with a Convolutional Neural Network on Kaggle
Using a neural network to solve OpenAI's CartPole balancing environment