Welcome to the 12th part of our Machine Learning with Python tutorial series. We've been learning about regression, and even coded our own very simple linear regression algorithm. Along with that, we've also built a coefficient of determination algorithm to check for the accuracy and reliability of our best-fit line. We've discussed and shown how a best-fit line may not be a great fit, but also explained why our example was correct directionally, even if it was not exact. Now, however, we are at the point where we're using two top-level algorithms, which are subsequently comprised of a handful of smaller algorithms. As we continue building this hierarchy of algorithms, we might wind up finding ourselves in trouble if just one of them have a tiny error, so we want to test our assumptions.
In the world of programming, the systematic testing of programs is generally called "unit testing." This is where, as a large program is built, each tiny sub-system is constantly checked. As updates and modifcations are made to a large program it can be very easy to accidentally change something that will conflict with a previous system. With machine learning, this can also be a problem, but, our main concern at this point is merely to test our assumptions. Eventually, you would be wise to create unit testing for your entire machine learning system, but, for now, we'll just keep it as simple as possible.
Our assumption is that we created a best fit line, which we then tested with a coefficient of determination method. We know (mathematically) that the r squared value will be lower when the best-fit line is worse, and higher (closer to 1), when the best fit line is better. Our assumption is that we've built a system that works this way. Our system has many parts, and even a simple order of operations error would throw everything out of the window. How could we test our algorithm's behaviors, so that we're confident everything is working as intended?
The idea here is to create a sample dataset that is defined by us. If we have a positively correlated dataset, where the correlation is quite strong and tight, then r squared should be higher, than if the correlation is weaker and points are not as tightly conformed. The easier it would be for us to personally just eye-ball and estimate the line, the better the machine should perform. Let's build a system that will generate example data that we can dictate these such parameters:
To start, we'll build a skeleton function that mimics what the end-goal is:
import random def create_dataset(hm,variance,step=2,correlation=False): return np.array(xs, dtype=np.float64),np.array(ys,dtype=np.float64)
We've got the beginnings of a function that takes the following parameters:
hm
- The value will be "how much." This is how many datapoints that we want in the set. We could choose to have 10, or 10 million, for example.
variance
- This will dictate how much each point can vary from the previous point. The more variance, the less-tight the data will be.
step - This will be how far to step on average per point, defaulting to 2.
correlation
- This will be either False, pos, or neg to indicate that we want no correlation, positive correlation, or negative correlation.
Note that we also imported random, this will help us generate seemingly random datasets.
Now we'll begin filling in the function:
def create_dataset(hm,variance,step=2,correlation=False): val = 1 ys = [] for i in range(hm): y = val + random.randrange(-variance,variance) ys.append(y)
Simple enough, we just begin iterating through the range that we chose with the hm
(how much) variable, appending the current value plus a random range of the negative variance to positive variance. This gives us data, but currently no correlation if we wanted it. Let's add that:
def create_dataset(hm,variance,step=2,correlation=False): val = 1 ys = [] for i in range(hm): y = val + random.randrange(-variance,variance) ys.append(y) if correlation and correlation == 'pos': val+=step elif correlation and correlation == 'neg': val-=step
Great, now we've got a good definition for y values. Next, let's create the xs, which are much easier, then return the whole thing:
def create_dataset(hm,variance,step=2,correlation=False): val = 1 ys = [] for i in range(hm): y = val + random.randrange(-variance,variance) ys.append(y) if correlation and correlation == 'pos': val+=step elif correlation and correlation == 'neg': val-=step xs = [i for i in range(len(ys))] return np.array(xs, dtype=np.float64),np.array(ys,dtype=np.float64)
We're all set now. To create a sample dataset, all we need to do is something like:
xs, ys = create_dataset(40,40,2,correlation='pos')
Let's put it all together with the code from the previous linear regression tutorials:
from statistics import mean import numpy as np import random import matplotlib.pyplot as plt from matplotlib import style style.use('ggplot') def create_dataset(hm,variance,step=2,correlation=False): val = 1 ys = [] for i in range(hm): y = val + random.randrange(-variance,variance) ys.append(y) if correlation and correlation == 'pos': val+=step elif correlation and correlation == 'neg': val-=step xs = [i for i in range(len(ys))] return np.array(xs, dtype=np.float64),np.array(ys,dtype=np.float64) def best_fit_slope_and_intercept(xs,ys): m = (((mean(xs)*mean(ys)) - mean(xs*ys)) / ((mean(xs)*mean(xs)) - mean(xs*xs))) b = mean(ys) - m*mean(xs) return m, b def coefficient_of_determination(ys_orig,ys_line): y_mean_line = [mean(ys_orig) for y in ys_orig] squared_error_regr = sum((ys_line - ys_orig) * (ys_line - ys_orig)) squared_error_y_mean = sum((y_mean_line - ys_orig) * (y_mean_line - ys_orig)) print(squared_error_regr) print(squared_error_y_mean) r_squared = 1 - (squared_error_regr/squared_error_y_mean) return r_squared xs, ys = create_dataset(40,40,2,correlation='pos') m, b = best_fit_slope_and_intercept(xs,ys) regression_line = [(m*x)+b for x in xs] r_squared = coefficient_of_determination(ys,regression_line) print(r_squared) plt.scatter(xs,ys,color='#003F72', label = 'data') plt.plot(xs, regression_line, label = 'regression line') plt.legend(loc=4) plt.show()
Running that exact code, you should get something similar to:
The coefficient of determination: 0.516508576011
(note that your's will not be identical, since we're using the random range).
Great, so our assumption is that our r-squared/coefficient of determination should improve if we made the dataset a more tightly correlated dataset. How would we do that? Simple: lower variance!
Using xs, ys = create_dataset(40,10,2,correlation='pos')
:
Now our r-squared value: 0.939865240568
, much better, as expected. Let's test a negative correlation next:xs, ys = create_dataset(40,10,2,correlation='neg')
The r squared value: 0.930242442156
, which is good that it is very similar to the previous one, since they had the same parameters, just opposite directions.
At this point, our assumptions are panning out and passing the test: Less variance should result in higher r-squared/coefficient of determination, higher variance = lower r squared. What about no correlation? This should be even lower, and actually quite close to zero, unless we get a crazy random permutation that actually has correlation anyway. Let's test it: xs, ys = create_dataset(40,10,2,correlation=False)
.
Coefficient of determination: 0.0152650900427
.
By now, I think we should feel confident that things are working how we intended!
Now that you have an appreciation for simple linear regression, let's move on to classification in the next tutorial.