What we're going to cover here is how to gather some basic statistics information on our data sets. This includes information like how many rows, the average of all of the data, standard deviation for all of the data... max and min % swing on all data. All of this is given to us with describe(). Then, we can take it a step further and gather rolling information of correlation and covariance. What makes this exceedingly interesting, is how it ends up translating, to the point where you can do some basic manipulation and have some pretty great analysis already.
Let's get started. Note: if you don't have sp500_ohlc.csv, then you didn't follow the whole series. You should be able to do this entire tutorial with your own data and then soon we'll be getting new data anyway!
import pandas as pd from pandas import DataFrame df = pd.read_csv('sp500_ohlc.csv', index_col = 'Date', parse_dates=True) df['H-L'] = df.High - df.Low print(df.describe())
Output:
Open High Low Close Volume \ count 2830.000000 2830.000000 2830.000000 2830.000000 2.830000e+03 mean 1172.834276 1181.118809 1163.808601 1172.820491 2.898598e+09 std 177.216544 176.393538 177.979827 177.132788 1.754116e+09 min 679.280000 695.270000 666.790000 676.530000 3.560700e+08 25% 1071.250000 1079.040000 1063.192500 1071.220000 1.426118e+09 50% 1178.880000 1186.540000 1170.795000 1178.825000 2.325460e+09 75% 1293.977500 1300.702500 1284.667500 1294.095000 4.134882e+09 max 1564.980000 1576.090000 1555.460000 1565.150000 1.145623e+10 Adj Close H-L count 2830.000000 2830.000000 mean 1172.820491 17.310208 std 177.132788 11.273921 min 676.530000 2.900000 25% 1071.220000 10.010000 50% 1178.825000 14.505000 75% 1294.095000 21.077500 max 1565.150000 102.650000
As you can see, we get quite a bit of information here with just one quick invocation of .describe(). We get count, which is how many rows we have for each column. We then get mean, or the average, of all the data in that column. STD is standard deviation for each column. Min is the minimum value in that row. 25% is where the 25th percentile mark is, and so on through 75%. Finally, we get max, which is the highest value for that column.
Next, we can calculate correlation with .corr():
print(df.corr())
Output:
Open High Low Close Volume Adj Close \ Open 1.000000 0.998747 0.998277 0.996652 -0.060033 0.996652 High 0.998747 1.000000 0.998016 0.998467 -0.049252 0.998467 Low 0.998277 0.998016 1.000000 0.998569 -0.073954 0.998569 Close 0.996652 0.998467 0.998569 1.000000 -0.061102 1.000000 Volume -0.060033 -0.049252 -0.073954 -0.061102 1.000000 -0.061102 Adj Close 0.996652 0.998467 0.998569 1.000000 -0.061102 1.000000 H-L -0.133110 -0.109380 -0.171749 -0.142097 0.396885 -0.142097 H-L Open -0.133110 High -0.109380 Low -0.171749 Close -0.142097 Volume 0.396885 Adj Close -0.142097 H-L 1.000000
Here, we get the correlation of each column compared to the other one. As you can see, we get a table of comparison. Obviously, Open, High, Low, and Close are all very closely correlated.
How about covariance?
print(df.cov())Output:
Open High Low Close \ Open 3.140570e+04 3.122069e+04 3.148663e+04 3.128575e+04 High 3.122069e+04 3.111468e+04 3.133220e+04 3.119719e+04 Low 3.148663e+04 3.133220e+04 3.167682e+04 3.148096e+04 Close 3.128575e+04 3.119719e+04 3.148096e+04 3.137602e+04 Volume -1.866176e+10 -1.523944e+10 -2.308814e+10 -1.898494e+10 Adj Close 3.128575e+04 3.119719e+04 3.148096e+04 3.137602e+04 H-L -2.659442e+02 -2.175186e+02 -3.446199e+02 -2.837660e+02 Volume Adj Close H-L Open -1.866176e+10 3.128575e+04 -2.659442e+02 High -1.523944e+10 3.119719e+04 -2.175186e+02 Low -2.308814e+10 3.148096e+04 -3.446199e+02 Close -1.898494e+10 3.137602e+04 -2.837660e+02 Volume 3.076921e+18 -1.898494e+10 7.848706e+09 Adj Close -1.898494e+10 3.137602e+04 -2.837660e+02 H-L 7.848706e+09 -2.837660e+02 1.271013e+02
What if you just want to compare a few elements instead of everything?
print( df[['Volume','H-L']].corr())
Output:
Volume H-L Volume 1.000000 0.396885 H-L 0.396885 1.000000
So now that you're somewhat comfortable with this, let's check out a popular use for correlation and covariance tables. There are many websites out there that either are a paid service, or a heavily advertised that create a correlation matrix, and sometimes co-variance, matrix tables. Turns out, doing this in Pandas is incredibly easy!
Let's start a new program specifically for this:
import datetime import pandas.io.data import pandas as pd C = pd.io.data.get_data_yahoo('C', start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2014, 1, 1)) AAPL = pd.io.data.get_data_yahoo('AAPL', start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2014, 1, 1)) MSFT = pd.io.data.get_data_yahoo('MSFT', start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2014, 1, 1)) TSLA = pd.io.data.get_data_yahoo('TSLA', start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2014, 1, 1)) BAC = pd.io.data.get_data_yahoo('BAC', start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2014, 1, 1)) BBRY = pd.io.data.get_data_yahoo('BBRY', start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2014, 1, 1)) CMG = pd.io.data.get_data_yahoo('CMG', start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2014, 1, 1)) EBAY = pd.io.data.get_data_yahoo('EBAY', start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2014, 1, 1)) JPM = pd.io.data.get_data_yahoo('JPM', start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2014, 1, 1)) SBUX = pd.io.data.get_data_yahoo('SBUX', start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2014, 1, 1)) TGT = pd.io.data.get_data_yahoo('TGT', start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2014, 1, 1)) WFC = pd.io.data.get_data_yahoo('WFC', start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2014, 1, 1)) print(C.head())
Up to this point, we can see that we've grabbed a bunch of data for various stocks that we want to create a correlation matrix with. Right now, we're nowhere near a matrix table for these stocks, but we're getting there. I've printed C.head() to give us a reminder of the data that we're looking at. We're only interested in comparing one of the prices to the rest. Generally, people will use the adjusted closes for comparison. Let's do that. Now, let's remove the other columns for C, or Citigroup.
del C['Open'] del C['High'] del C['Low'] del C['Close'] del C['Volume']
That's done, we're just left with Adj Close now. Now we can just rename C to whatever we want.
corComp = C
That'll do for a name. Now, instead of calling these columns all Adj Close, we're going to want to call them the stock name, so we can compare easily:
corComp.rename(columns={'Adj Close': 'C'}, inplace=True)
So now our table is just 1 column of Citigroup's adjusted close, let's add the others.
corComp['BAC'] = BAC['Adj Close'] corComp['MSFT'] = MSFT['Adj Close'] corComp['TSLA'] = TSLA['Adj Close'] corComp['AAPL'] = AAPL['Adj Close'] corComp['BBRY'] = BBRY['Adj Close'] corComp['CMG'] = CMG['Adj Close'] corComp['EBAY'] = EBAY['Adj Close'] corComp['JPM'] = JPM['Adj Close'] corComp['SBUX'] = SBUX['Adj Close'] corComp['TGT'] = TGT['Adj Close'] corComp['WFC'] = WFC['Adj Close'] print(corComp.head())
Output:
C BAC MSFT TSLA AAPL BBRY CMG EBAY JPM \ Date 2011-10-03 23.04 5.47 22.65 23.73 51.15 20.50 292.70 28.11 26.30 2011-10-04 24.32 5.70 23.40 23.66 50.86 21.00 301.09 29.35 28.02 2011-10-05 24.63 5.71 23.91 25.37 51.65 23.60 300.79 30.53 28.56 2011-10-06 25.94 6.21 24.32 26.96 51.53 24.42 298.99 31.44 29.99 2011-10-07 24.55 5.84 24.24 26.99 50.49 23.36 294.59 31.07 28.43 SBUX TGT WFC Date 2011-10-03 34.81 44.70 21.53 2011-10-04 35.82 46.27 22.48 2011-10-05 36.59 46.54 22.75 2011-10-06 37.02 48.55 23.56 2011-10-07 37.75 48.22 22.79
At this point, we've now got a Pandas set of data that consists of columns of stock names and their close prices by date! Converting this to a correlation matrix is just one call away:
print( corComp.corr())
Output:
C BAC MSFT TSLA AAPL BBRY CMG \ C 1.000000 0.975169 0.732407 0.792107 -0.172905 -0.249992 0.450620 BAC 0.975169 1.000000 0.767743 0.812945 -0.027634 -0.392026 0.494348 MSFT 0.732407 0.767743 1.000000 0.783313 0.207460 -0.530131 0.775090 TSLA 0.792107 0.812945 0.783313 1.000000 -0.103821 -0.404356 0.677697 AAPL -0.172905 -0.027634 0.207460 -0.103821 1.000000 -0.694610 0.093031 BBRY -0.249992 -0.392026 -0.530131 -0.404356 -0.694610 1.000000 -0.332754 CMG 0.450620 0.494348 0.775090 0.677697 0.093031 -0.332754 1.000000 EBAY 0.796745 0.846395 0.527071 0.528979 0.137681 -0.464328 0.108140 JPM 0.979039 0.979418 0.789477 0.786151 -0.043577 -0.361171 0.508082 SBUX 0.863705 0.909169 0.868984 0.913326 0.034459 -0.491639 0.758022 TGT 0.808486 0.825841 0.618687 0.561540 0.075709 -0.452150 0.172996 WFC 0.905780 0.954583 0.856305 0.831263 0.137327 -0.577703 0.561800 EBAY JPM SBUX TGT WFC C 0.796745 0.979039 0.863705 0.808486 0.905780 BAC 0.846395 0.979418 0.909169 0.825841 0.954583 MSFT 0.527071 0.789477 0.868984 0.618687 0.856305 TSLA 0.528979 0.786151 0.913326 0.561540 0.831263 AAPL 0.137681 -0.043577 0.034459 0.075709 0.137327 BBRY -0.464328 -0.361171 -0.491639 -0.452150 -0.577703 CMG 0.108140 0.508082 0.758022 0.172996 0.561800 EBAY 1.000000 0.810438 0.666961 0.895417 0.823806 JPM 0.810438 1.000000 0.895216 0.815957 0.947179 SBUX 0.666961 0.895216 1.000000 0.674836 0.930716 TGT 0.895417 0.815957 0.674836 1.000000 0.848994 WFC 0.823806 0.947179 0.930716 0.848994 1.000000
Finally, you might want to take this data and maybe have it as a csv file, which you can do further analysis with:
fancy = corComp.corr() fancy.to_csv('bigmoney.csv')
And that's it! So now you know how to create your own correlation tables. You can make these sorts of tables based on more than just correlation. It can be co-variance, standard deviation, or even your own custom functions.