Hello and welcome to part 6 of the Data Analysis with Python and Pandas series, where we're going to be looking into using Pandas as the data pre-processing step for machine learning.
Let's start with a simple regression task, where we're attempting to price out the value of diamonds, using the following diamond dataset.
import pandas as pd
df = pd.read_csv("datasets/diamonds.csv", index_col=0)
df.head()
Now, the curiosity is if we could come up with some sort of formula to take inputs like carat, cut, color, clarity, depth, table, x, y, and z to then see if we can predict the price.
The basis of machine learning is math, so columns with string values like cut and clarity have to be converted to numbers.
I would like to start us using linear regression, so it's also fairly ideal that our string classifications are linear, meaning they have a meaningful order. Let's see what all of our cuts are, for example:
df['cut'].unique()
Okay, we can take this and hard-code the order:
cut_class_dict = {"Fair": 1, "Good": 2, "Very Good": 3, "Premium": 4, "Ideal": 5}
Next, let's check out clarity:
df['clarity'].unique()
FL,IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
- Taken from the dataset page, this is ordered best to worst, so now we need this in a dict too.
We also have color. D is the best, J is the worst.
clarity_dict = {"I3": 1, "I2": 2, "I1": 3, "SI2": 4, "SI1": 5, "VS2": 6, "VS1": 7, "VVS2": 8, "VVS1": 9, "IF": 10, "FL": 11}
color_dict = {"J": 1,"I": 2,"H": 3,"G": 4,"F": 5,"E": 6,"D": 7}
Now we map this:
df['cut'] = df['cut'].map(cut_class_dict)
df['clarity'] = df['clarity'].map(clarity_dict)
df['color'] = df['color'].map(color_dict)
df.head()
Alright, let's see if we can train a regression model to figure this out. This will be what is called a "supervised" learning task. With supervised learning, your job will pretty much always be the same. You take the data you want to use to make a prediction, and separate it out into an array. Then you take the data you want to predict, and separate that out into another array.
Then, you feed the data you want to use to make the prediction (features) and then the correct values that you want to build a model to learn to map to (your labels) into some type of model.
Scikit-learn is a popular package used for doing regular machine learning (not deep learning usually, though you can do deep learning with sklearn). To get it:
pip install scikit-learn
While you pip install scikit-learn
, you actually import things from sklearn.
Next, we pick a model. A super easy way to figure out what model you want is: choosing the right estimator.
This would suggest to us to use an SGD Regressor. Want to learn more about machine learning? Check out the Machine learning series. All you need to know is this model will take our input features, make them into variables that will be used in an equation to get as close as possible to outputting whatever trained values we pass.
Then, later, we can either save some samples for true out of sample testing, or just make some up to see what the model says would be the price of the diamond.
If you've ever seen a home value estimate or something, this is how they are done. They take in a bunch of features and run them through a regression algorithm to come up with a value.
Okay, so our first job is to convert to features and labels. Always be careful in this step, making sure you dont accidentally pass something about your label into your features, thus informing the model about the intended label more than you intend.
In machine learning, the standard is typically featuresets are stored as a capital X
and labels as a lowercase y
.
import sklearn
from sklearn.linear_model import SGDRegressor
df = sklearn.utils.shuffle(df) # always shuffle your data to avoid any biases that may emerge b/c of some order.
X = df.drop("price", axis=1).values
y = df["price"].values
Recall that many methods will return a dataframe. So for X
we want all of the columns EXCEPT for the price one, so we can just drop it. Then we use .values
to convert to a numpy array. Then, for our labels, y
, we say this is just the price
column's values. Great, but we want to probably save some of these values for testing the model after it's been trained. So we'll do something like:
test_size = 200
X_train = X[:-test_size]
y_train = y[:-test_size]
X_test = X[-test_size:]
y_test = y[-test_size:]
Now we can train and test our classifier!
clf = SGDRegressor(max_iter=1000)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
for X,y in list(zip(X_test, y_test))[:10]:
print(clf.predict([X])[0], y)
Well, that's not very good. The score
for these regression models is r-squared/coefficient of determination, so I am actually not even sure how we got -70999348.67836547, but apparently we did. R-Squared is more often between 0 and 100%, where 100% is a perfect fit (1.0). Let's try support vector regression instead:
from sklearn import svm
clf = svm.SVR()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
for X,y in list(zip(X_test, y_test))[:10]:
print(clf.predict([X])[0], y)
Well, the good news is some of these are at least close. We're in the same zipcode at least! That took a while to run though. One difference between svm.SVR()
and the SGDRegressor
according to the docs is that svm.SVR()
by default has an unlimited number of iterations. Let's try that with the SGDRegressor
to be fair, by setting it to something quite large. Apparently -1 isn't allowed! 10,000 it is!
clf = SGDRegressor(max_iter=10000)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
for X,y in list(zip(X_test, y_test))[:10]:
print(clf.predict([X])[0], y)
Ok no, it just isn't gonna work unless we tweak more. Let's go back to the svm.SVR()
model and see if we can improve it.
The most common way to improve models is to scale data. Let's try that.
import sklearn
from sklearn import svm, preprocessing
df = sklearn.utils.shuffle(df) # always shuffle your data to avoid any biases that may emerge b/c of some order.
X = df.drop("price", axis=1).values
X = preprocessing.scale(X)
y = df["price"].values
test_size = 200
X_train = X[:-test_size]
y_train = y[:-test_size]
X_test = X[-test_size:]
y_test = y[-test_size:]
clf = svm.SVR()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
for X,y in list(zip(X_test, y_test))[:10]:
print(f"model predicts {clf.predict([X])[0]}, real value: {y}")
This improved our score a bit, so that's nice. We could keep tweaking things and probably improve this model further, but that's not quite the intention of this series, so this will do for now.
Any new diamond data you got would need to be combined into your main dataset, scaled, then predicted from.
Okay, that's all for now! I hope you have enjoyed! If you're interested in doing more machine learning, definitely check out the tutorials here: Machine learning series