Regression - How to Program R Squared

Welcome to part 11 of the Machine Learning with Python tutorial series.

Now that we know what we're looking for, let's actually calculate it in Python. The first step would be to calculate the squared error. A function for that might be something like:

def squared_error(ys_orig,ys_line):
    return sum((ys_line - ys_orig) * (ys_line - ys_orig))

With the above function, we can calculate the squared error of any line to datapoints, so we can use this sort of syntax for both the regression line and the mean of the ys. That said, squared error is only a part of the coefficient of determination, so let's build that function instead. Since the squared error function is only one line, you could elect to have it just be a line within the coefficient of determination function, but squared error is something you may actually use outside of this function, so I will choose to keep it as its own function. For r squared:

def coefficient_of_determination(ys_orig,ys_line):
    y_mean_line = [mean(ys_orig) for y in ys_orig]
    squared_error_regr = squared_error(ys_orig, ys_line)
    squared_error_y_mean = squared_error(ys_orig, y_mean_line)
    return 1 - (squared_error_regr/squared_error_y_mean)

What we've done here is calculate the y mean line, using a 1 liner for loop. Then we're calculating the squared error of the y mean and the regression line using the funcion from just above. Now, all we have left to do is actually calculate the r squared value, which is simply 1 minus the regression line's squared error divided by the y mean line's squared error. We return the value and we're done! All together now, skipping the graph part, the code is:

from statistics import mean
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style

xs = np.array([1,2,3,4,5], dtype=np.float64)
ys = np.array([5,4,6,5,6], dtype=np.float64)

def best_fit_slope_and_intercept(xs,ys):
    m = (((mean(xs)*mean(ys)) - mean(xs*ys)) /
         ((mean(xs)*mean(xs)) - mean(xs*xs)))
    b = mean(ys) - m*mean(xs)
    return m, b

def squared_error(ys_orig,ys_line):
    return sum((ys_line - ys_orig) * (ys_line - ys_orig))

def coefficient_of_determination(ys_orig,ys_line):
    y_mean_line = [mean(ys_orig) for y in ys_orig]
    squared_error_regr = squared_error(ys_orig, ys_line)
    squared_error_y_mean = squared_error(ys_orig, y_mean_line)
    return 1 - (squared_error_regr/squared_error_y_mean)
m, b = best_fit_slope_and_intercept(xs,ys)
regression_line = [(m*x)+b for x in xs]

r_squared = coefficient_of_determination(ys,regression_line)

##plt.plot(xs, regression_line, label='regression line')

Output: 0.321428571429

That's a pretty low value, so actually our best-fit line isn't all that great according to this measure. Is r squared a good measure in this case? It may depend on what your goals are. In most cases, if you care about predicting exact future values, r squared is indeed very useful. If you're interested in predicting motion/direction, then our best fit line is actually pretty good so far, and r squared shouldn't carry as much weight. Look at our actual dataset though. We stuck with low, whole numbers. Variance from value to value was 20-50% at some points, that's a very high variance. It should not be all that surprising that, with this simple dataset, our best fit line still wasn't that descriptive of the actual data.

What we've just described, however, is an assumption. You know what they say about assume! While we can logically all, I hope, agree with the assumption, we need to come up with a way to test the assumption. The algorithms involved so far are pretty basic, we have only a few layers going on here, so there is not too much room for error, but, later on, you are likely to have layers upon layers. Not just hierarchical layers for the algorithm itself to consider, but the algorithm will be compromised of many layers of algorithms. Where possible, we need to test these to make sure our assumptions about how these algorithms are meant to act are true. Consider how simple it would be to screw up the order of operations in a function, and then, from there, disrupt the entire validity of thousands of lines of code after that!

What we're going to do in the next tutorial is build a relatively simple datset generator that will generate data according to our parameters. We can use this to manipulate data to our liking, and then test our algorithms against these datasets, changing parameters that, by our assumptions, should produce some sort of change. We can then compare our assumptions to the reality in hopes that they match up! In the case here, the assumptions are that we coded these algorithms correctly, and that the reason for the low coefficient of determination value was because the variation in y was actually quite large. We'll be testing this assumption in the next tutorial.

There exists 1 quiz/question(s) for this tutorial. for access to these, video downloads, and no ads.

The next tutorial:

  • Practical Machine Learning Tutorial with Python Introduction
  • Regression - Intro and Data
  • Regression - Features and Labels
  • Regression - Training and Testing
  • Regression - Forecasting and Predicting
  • Pickling and Scaling
  • Regression - Theory and how it works
  • Regression - How to program the Best Fit Slope
  • Regression - How to program the Best Fit Line
  • Regression - R Squared and Coefficient of Determination Theory
  • Regression - How to Program R Squared
  • Creating Sample Data for Testing
  • Classification Intro with K Nearest Neighbors
  • Applying K Nearest Neighbors to Data
  • Euclidean Distance theory
  • Creating a K Nearest Neighbors Classifer from scratch
  • Creating a K Nearest Neighbors Classifer from scratch part 2
  • Testing our K Nearest Neighbors classifier
  • Final thoughts on K Nearest Neighbors
  • Support Vector Machine introduction
  • Vector Basics
  • Support Vector Assertions
  • Support Vector Machine Fundamentals
  • Constraint Optimization with Support Vector Machine
  • Beginning SVM from Scratch in Python
  • Support Vector Machine Optimization in Python
  • Support Vector Machine Optimization in Python part 2
  • Visualization and Predicting with our Custom SVM
  • Kernels Introduction
  • Why Kernels
  • Soft Margin Support Vector Machine
  • Kernels, Soft Margin SVM, and Quadratic Programming with Python and CVXOPT
  • Support Vector Machine Parameters
  • Machine Learning - Clustering Introduction
  • Handling Non-Numerical Data for Machine Learning
  • K-Means with Titanic Dataset
  • K-Means from Scratch in Python
  • Finishing K-Means from Scratch in Python
  • Hierarchical Clustering with Mean Shift Introduction
  • Mean Shift applied to Titanic Dataset
  • Mean Shift algorithm from scratch in Python
  • Dynamically Weighted Bandwidth for Mean Shift
  • Introduction to Neural Networks
  • Installing TensorFlow for Deep Learning - OPTIONAL
  • Introduction to Deep Learning with TensorFlow
  • Deep Learning with TensorFlow - Creating the Neural Network Model
  • Deep Learning with TensorFlow - How the Network will run
  • Deep Learning with our own Data
  • Simple Preprocessing Language Data for Deep Learning
  • Training and Testing on our Data for Deep Learning
  • 10K samples compared to 1.6 million samples with Deep Learning
  • How to use CUDA and the GPU Version of Tensorflow for Deep Learning
  • Recurrent Neural Network (RNN) basics and the Long Short Term Memory (LSTM) cell
  • RNN w/ LSTM cell example in TensorFlow and Python
  • Convolutional Neural Network (CNN) basics
  • Convolutional Neural Network CNN with TensorFlow tutorial
  • TFLearn - High Level Abstraction Layer for TensorFlow Tutorial
  • Using a 3D Convolutional Neural Network on medical imaging data (CT Scans) for Kaggle
  • Classifying Cats vs Dogs with a Convolutional Neural Network on Kaggle
  • Using a neural network to solve OpenAI's CartPole balancing environment