Welcome to the 35th part of our machine learning tutorial series. We've recently begun talking about clustering specifically, but in this tutorial we're going to be covering handling non-numeric data specifically, which is of course not clustering-specific.
The data that we're going to be working with is the Titanic Dataset.
For a brief overview of the data and values:
Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) survival Survival (0 = No; 1 = Yes) name Name sex Sex age Age sibsp Number of Siblings/Spouses Aboard parch Number of Parents/Children Aboard ticket Ticket Number fare Passenger Fare (British pound) cabin Cabin embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) boat Lifeboat body Body Identification Number home.dest Home/Destination
The main focus on this dataset is typically on the survival
column. When using supervised machine learning, chances are, you will be training the data against the survival column as the classification. With clustering, however, we let the machine make the groups, and basically a label of its own. My first interest is if the groups are clearly related to any of the columns, especially the survival column. For our current tutorial, we're currently doing flat-clustering, which is where we tell the machine we want two groups, but later we'll also let the machine determine the number of groups.
For now, however, we're up against another issue. If we read this dataset into a pandas, we'll see that we get something like:
#https://pythonprogramming.net/static/downloads/machine-learning-data/titanic.xls import matplotlib.pyplot as plt from matplotlib import style style.use('ggplot') import numpy as np from sklearn.cluster import KMeans from sklearn import preprocessing, cross_validation import pandas as pd ''' Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) survival Survival (0 = No; 1 = Yes) name Name sex Sex age Age sibsp Number of Siblings/Spouses Aboard parch Number of Parents/Children Aboard ticket Ticket Number fare Passenger Fare (British pound) cabin Cabin embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) boat Lifeboat body Body Identification Number home.dest Home/Destination ''' df = pd.read_excel('titanic.xls') print(df.head())
pclass survived name sex \ 0 1 1 Allen, Miss. Elisabeth Walton female 1 1 1 Allison, Master. Hudson Trevor male 2 1 0 Allison, Miss. Helen Loraine female 3 1 0 Allison, Mr. Hudson Joshua Creighton male 4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female age sibsp parch ticket fare cabin embarked boat body \ 0 29.0000 0 0 24160 211.3375 B5 S 2 NaN 1 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN 2 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN 3 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135.0 4 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN home.dest 0 St Louis, MO 1 Montreal, PQ / Chesterville, ON 2 Montreal, PQ / Chesterville, ON 3 Montreal, PQ / Chesterville, ON 4 Montreal, PQ / Chesterville, ON pclass survived name sex age sibsp parch ticket fare \ 0 1 1 110 0 29.0000 0 0 748 211.3375 1 1 1 839 1 0.9167 1 2 504 151.5500 2 1 0 1274 0 2.0000 1 2 504 151.5500 3 1 0 284 1 30.0000 1 2 504 151.5500 4 1 0 563 0 25.0000 1 2 504 151.5500 cabin embarked boat body home.dest 0 52 1 1 NaN 173 1 44 1 6 NaN 277 2 44 1 0 NaN 277 3 44 1 0 135.0 277 4 44 1 0 NaN 277
The issue is, we've got non-numerical data here. The machine learning algorithm is going to require numbers. We can just drop the name column, it has no use to us. Should we drop the sex
column? I don't think so, it seems like a pretty important column, especially given our knowledge of "women and children first." What about the cabin
column? Might it have been a bit important where on the ship you were? I suspect so! Maybe slightly less valuable is where you embarked from, but, at this point, we already know we're going to have to handle for non-numerical data anyway.
There are many ways to handle for non-numerical data, this is just the method I personally use. First, you will want to cycle through the columns in the Pandas dataframe. For columns that are not numbers, you want to find their unique elements. This can be done by simply take a set
of the column values. From here, the index within that set can be the new "numerical" value or "id" of the text data.
To begin:
def handle_non_numerical_data(df): columns = df.columns.values for column in columns:
Creating function, getting the columns, beginning to iterate through them. Continuing:
def handle_non_numerical_data(df): columns = df.columns.values for column in columns: text_digit_vals = {} def convert_to_int(val): return text_digit_vals[val] if df[column].dtype != np.int64 and df[column].dtype != np.float64: column_contents = df[column].values.tolist() unique_elements = set(column_contents)
Here, we've added an embedded function that converts the parameter value to whatever the value of that item (as a key) is from the text_digit_vals
dictionary. We aren't using it just yet, but we're about to. Next, while we're iterating through the columns, we're going to ask if that column is not either an np.int64
or np.float64
. If not, then we're going to convert the column to a list of its values, then we take the set
of that column to get just the unique values.
def handle_non_numerical_data(df): columns = df.columns.values for column in columns: text_digit_vals = {} def convert_to_int(val): return text_digit_vals[val] if df[column].dtype != np.int64 and df[column].dtype != np.float64: column_contents = df[column].values.tolist() unique_elements = set(column_contents) x = 0 for unique in unique_elements: if unique not in text_digit_vals: text_digit_vals[unique] = x x+=1 df[column] = list(map(convert_to_int, df[column])) return df
Continuing along, for each of the unique elements we find, we create a new dictionary key that is that unique element, with a value of a new number. Once we've iterated through all of the unique values, we then use mapping to map the function we created before to the pandas column. Not sure what mapping is? Check out the Mapping a Function with Pandas tutorial.
Now we can add a couple final lines:
df = handle_non_numerical_data(df) print(df.head())
Full code:
#https://pythonprogramming.net/static/downloads/machine-learning-data/titanic.xls import matplotlib.pyplot as plt from matplotlib import style style.use('ggplot') import numpy as np from sklearn.cluster import KMeans from sklearn import preprocessing, cross_validation import pandas as pd ''' Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) survival Survival (0 = No; 1 = Yes) name Name sex Sex age Age sibsp Number of Siblings/Spouses Aboard parch Number of Parents/Children Aboard ticket Ticket Number fare Passenger Fare (British pound) cabin Cabin embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) boat Lifeboat body Body Identification Number home.dest Home/Destination ''' df = pd.read_excel('titanic.xls') #print(df.head()) df.drop(['body','name'], 1, inplace=True) df.convert_objects(convert_numeric=True) df.fillna(0, inplace=True) #print(df.head()) def handle_non_numerical_data(df): columns = df.columns.values for column in columns: text_digit_vals = {} def convert_to_int(val): return text_digit_vals[val] if df[column].dtype != np.int64 and df[column].dtype != np.float64: column_contents = df[column].values.tolist() unique_elements = set(column_contents) x = 0 for unique in unique_elements: if unique not in text_digit_vals: text_digit_vals[unique] = x x+=1 df[column] = list(map(convert_to_int, df[column])) return df df = handle_non_numerical_data(df) print(df.head())
Output:
pclass survived sex age sibsp parch ticket fare cabin \ 0 1 1 1 29.0000 0 0 767 211.3375 80 1 1 1 0 0.9167 1 2 531 151.5500 149 2 1 0 1 2.0000 1 2 531 151.5500 149 3 1 0 0 30.0000 1 2 531 151.5500 149 4 1 0 1 25.0000 1 2 531 151.5500 149 embarked boat home.dest 0 1 1 307 1 1 27 43 2 1 0 43 3 1 0 43 4 1 0 43
If the df.convert_objects(convert_numeric=True)
is giving you deprecation warnings or errors, feel free to just comment it out. I usually keep it there to be absolutely explicit, but the dataframe *should* be read as numbers where there are numbers. For some reason, Pandas will seemingly randomly read some rows in columns as strings, despite even the strings being actual number. Makes no sense to me, so I just convert to numeric to be totally certain.
Okay great, so we've got numbers, and now we can continue along to do flat clustering with this data!