Pandas Correlation matrix and Statistics Information on Data




What we're going to cover here is how to gather some basic statistics information on our data sets. This includes information like how many rows, the average of all of the data, standard deviation for all of the data... max and min % swing on all data. All of this is given to us with describe(). Then, we can take it a step further and gather rolling information of correlation and covariance. What makes this exceedingly interesting, is how it ends up translating, to the point where you can do some basic manipulation and have some pretty great analysis already.

Let's get started. Note: if you don't have sp500_ohlc.csv, then you didn't follow the whole series. You should be able to do this entire tutorial with your own data and then soon we'll be getting new data anyway!

import pandas as pd
from pandas import DataFrame

df = pd.read_csv('sp500_ohlc.csv', index_col = 'Date', parse_dates=True)
df['H-L'] = df.High - df.Low

print(df.describe())

Output:

              Open         High          Low        Close        Volume  \
count  2830.000000  2830.000000  2830.000000  2830.000000  2.830000e+03   
mean   1172.834276  1181.118809  1163.808601  1172.820491  2.898598e+09   
std     177.216544   176.393538   177.979827   177.132788  1.754116e+09   
min     679.280000   695.270000   666.790000   676.530000  3.560700e+08   
25%    1071.250000  1079.040000  1063.192500  1071.220000  1.426118e+09   
50%    1178.880000  1186.540000  1170.795000  1178.825000  2.325460e+09   
75%    1293.977500  1300.702500  1284.667500  1294.095000  4.134882e+09   
max    1564.980000  1576.090000  1555.460000  1565.150000  1.145623e+10   

         Adj Close          H-L  
count  2830.000000  2830.000000  
mean   1172.820491    17.310208  
std     177.132788    11.273921  
min     676.530000     2.900000  
25%    1071.220000    10.010000  
50%    1178.825000    14.505000  
75%    1294.095000    21.077500  
max    1565.150000   102.650000  

As you can see, we get quite a bit of information here with just one quick invocation of .describe(). We get count, which is how many rows we have for each column. We then get mean, or the average, of all the data in that column. STD is standard deviation for each column. Min is the minimum value in that row. 25% is where the 25th percentile mark is, and so on through 75%. Finally, we get max, which is the highest value for that column.

Next, we can calculate correlation with .corr():

print(df.corr())

Output:

               Open      High       Low     Close    Volume  Adj Close  \
Open       1.000000  0.998747  0.998277  0.996652 -0.060033   0.996652   
High       0.998747  1.000000  0.998016  0.998467 -0.049252   0.998467   
Low        0.998277  0.998016  1.000000  0.998569 -0.073954   0.998569   
Close      0.996652  0.998467  0.998569  1.000000 -0.061102   1.000000   
Volume    -0.060033 -0.049252 -0.073954 -0.061102  1.000000  -0.061102   
Adj Close  0.996652  0.998467  0.998569  1.000000 -0.061102   1.000000   
H-L       -0.133110 -0.109380 -0.171749 -0.142097  0.396885  -0.142097   

                H-L  
Open      -0.133110  
High      -0.109380  
Low       -0.171749  
Close     -0.142097  
Volume     0.396885  
Adj Close -0.142097  
H-L        1.000000  

Here, we get the correlation of each column compared to the other one. As you can see, we get a table of comparison. Obviously, Open, High, Low, and Close are all very closely correlated.

How about covariance?

print(df.cov())
Output:
                   Open          High           Low         Close  \
Open       3.140570e+04  3.122069e+04  3.148663e+04  3.128575e+04   
High       3.122069e+04  3.111468e+04  3.133220e+04  3.119719e+04   
Low        3.148663e+04  3.133220e+04  3.167682e+04  3.148096e+04   
Close      3.128575e+04  3.119719e+04  3.148096e+04  3.137602e+04   
Volume    -1.866176e+10 -1.523944e+10 -2.308814e+10 -1.898494e+10   
Adj Close  3.128575e+04  3.119719e+04  3.148096e+04  3.137602e+04   
H-L       -2.659442e+02 -2.175186e+02 -3.446199e+02 -2.837660e+02   

                 Volume     Adj Close           H-L  
Open      -1.866176e+10  3.128575e+04 -2.659442e+02  
High      -1.523944e+10  3.119719e+04 -2.175186e+02  
Low       -2.308814e+10  3.148096e+04 -3.446199e+02  
Close     -1.898494e+10  3.137602e+04 -2.837660e+02  
Volume     3.076921e+18 -1.898494e+10  7.848706e+09  
Adj Close -1.898494e+10  3.137602e+04 -2.837660e+02  
H-L        7.848706e+09 -2.837660e+02  1.271013e+02  

What if you just want to compare a few elements instead of everything?

print( df[['Volume','H-L']].corr())

Output:

          Volume       H-L
Volume  1.000000  0.396885
H-L     0.396885  1.000000

So now that you're somewhat comfortable with this, let's check out a popular use for correlation and covariance tables. There are many websites out there that either are a paid service, or a heavily advertised that create a correlation matrix, and sometimes co-variance, matrix tables. Turns out, doing this in Pandas is incredibly easy!

Let's start a new program specifically for this:

import datetime
import pandas.io.data
import pandas as pd

C = pd.io.data.get_data_yahoo('C', 
                                 start=datetime.datetime(2011, 10, 1), 
                                 end=datetime.datetime(2014, 1, 1))
AAPL = pd.io.data.get_data_yahoo('AAPL', 
                                 start=datetime.datetime(2011, 10, 1), 
                                 end=datetime.datetime(2014, 1, 1))
MSFT = pd.io.data.get_data_yahoo('MSFT', 
                                 start=datetime.datetime(2011, 10, 1), 
                                 end=datetime.datetime(2014, 1, 1))
TSLA = pd.io.data.get_data_yahoo('TSLA', 
                                 start=datetime.datetime(2011, 10, 1), 
                                 end=datetime.datetime(2014, 1, 1))
BAC = pd.io.data.get_data_yahoo('BAC', 
                                 start=datetime.datetime(2011, 10, 1), 
                                 end=datetime.datetime(2014, 1, 1))
BBRY = pd.io.data.get_data_yahoo('BBRY', 
                                 start=datetime.datetime(2011, 10, 1), 
                                 end=datetime.datetime(2014, 1, 1))
CMG = pd.io.data.get_data_yahoo('CMG', 
                                 start=datetime.datetime(2011, 10, 1), 
                                 end=datetime.datetime(2014, 1, 1))
EBAY = pd.io.data.get_data_yahoo('EBAY', 
                                 start=datetime.datetime(2011, 10, 1), 
                                 end=datetime.datetime(2014, 1, 1))
JPM = pd.io.data.get_data_yahoo('JPM', 
                                 start=datetime.datetime(2011, 10, 1), 
                                 end=datetime.datetime(2014, 1, 1))
SBUX = pd.io.data.get_data_yahoo('SBUX', 
                                 start=datetime.datetime(2011, 10, 1), 
                                 end=datetime.datetime(2014, 1, 1))
TGT = pd.io.data.get_data_yahoo('TGT', 
                                 start=datetime.datetime(2011, 10, 1), 
                                 end=datetime.datetime(2014, 1, 1))
WFC = pd.io.data.get_data_yahoo('WFC', 
                                 start=datetime.datetime(2011, 10, 1), 
                                 end=datetime.datetime(2014, 1, 1))
print(C.head())

Up to this point, we can see that we've grabbed a bunch of data for various stocks that we want to create a correlation matrix with. Right now, we're nowhere near a matrix table for these stocks, but we're getting there. I've printed C.head() to give us a reminder of the data that we're looking at. We're only interested in comparing one of the prices to the rest. Generally, people will use the adjusted closes for comparison. Let's do that. Now, let's remove the other columns for C, or Citigroup.

del C['Open']
del C['High']
del C['Low']
del C['Close']
del C['Volume']

That's done, we're just left with Adj Close now. Now we can just rename C to whatever we want.

corComp = C

That'll do for a name. Now, instead of calling these columns all Adj Close, we're going to want to call them the stock name, so we can compare easily:

corComp.rename(columns={'Adj Close': 'C'}, inplace=True)

So now our table is just 1 column of Citigroup's adjusted close, let's add the others.

corComp['BAC'] = BAC['Adj Close']
corComp['MSFT'] = MSFT['Adj Close']
corComp['TSLA'] = TSLA['Adj Close']

corComp['AAPL'] = AAPL['Adj Close']
corComp['BBRY'] = BBRY['Adj Close']
corComp['CMG'] = CMG['Adj Close']
corComp['EBAY'] = EBAY['Adj Close']
corComp['JPM'] = JPM['Adj Close']
corComp['SBUX'] = SBUX['Adj Close']
corComp['TGT'] = TGT['Adj Close']
corComp['WFC'] = WFC['Adj Close']

print(corComp.head())

Output:

                C   BAC   MSFT   TSLA   AAPL   BBRY     CMG   EBAY    JPM  \
Date                                                                        
2011-10-03  23.04  5.47  22.65  23.73  51.15  20.50  292.70  28.11  26.30   
2011-10-04  24.32  5.70  23.40  23.66  50.86  21.00  301.09  29.35  28.02   
2011-10-05  24.63  5.71  23.91  25.37  51.65  23.60  300.79  30.53  28.56   
2011-10-06  25.94  6.21  24.32  26.96  51.53  24.42  298.99  31.44  29.99   
2011-10-07  24.55  5.84  24.24  26.99  50.49  23.36  294.59  31.07  28.43   

             SBUX    TGT    WFC  
Date                             
2011-10-03  34.81  44.70  21.53  
2011-10-04  35.82  46.27  22.48  
2011-10-05  36.59  46.54  22.75  
2011-10-06  37.02  48.55  23.56  
2011-10-07  37.75  48.22  22.79 

At this point, we've now got a Pandas set of data that consists of columns of stock names and their close prices by date! Converting this to a correlation matrix is just one call away:

print( corComp.corr())

Output:

             C       BAC      MSFT      TSLA      AAPL      BBRY       CMG  \
C     1.000000  0.975169  0.732407  0.792107 -0.172905 -0.249992  0.450620   
BAC   0.975169  1.000000  0.767743  0.812945 -0.027634 -0.392026  0.494348   
MSFT  0.732407  0.767743  1.000000  0.783313  0.207460 -0.530131  0.775090   
TSLA  0.792107  0.812945  0.783313  1.000000 -0.103821 -0.404356  0.677697   
AAPL -0.172905 -0.027634  0.207460 -0.103821  1.000000 -0.694610  0.093031   
BBRY -0.249992 -0.392026 -0.530131 -0.404356 -0.694610  1.000000 -0.332754   
CMG   0.450620  0.494348  0.775090  0.677697  0.093031 -0.332754  1.000000   
EBAY  0.796745  0.846395  0.527071  0.528979  0.137681 -0.464328  0.108140   
JPM   0.979039  0.979418  0.789477  0.786151 -0.043577 -0.361171  0.508082   
SBUX  0.863705  0.909169  0.868984  0.913326  0.034459 -0.491639  0.758022   
TGT   0.808486  0.825841  0.618687  0.561540  0.075709 -0.452150  0.172996   
WFC   0.905780  0.954583  0.856305  0.831263  0.137327 -0.577703  0.561800   

          EBAY       JPM      SBUX       TGT       WFC  
C     0.796745  0.979039  0.863705  0.808486  0.905780  
BAC   0.846395  0.979418  0.909169  0.825841  0.954583  
MSFT  0.527071  0.789477  0.868984  0.618687  0.856305  
TSLA  0.528979  0.786151  0.913326  0.561540  0.831263  
AAPL  0.137681 -0.043577  0.034459  0.075709  0.137327  
BBRY -0.464328 -0.361171 -0.491639 -0.452150 -0.577703  
CMG   0.108140  0.508082  0.758022  0.172996  0.561800  
EBAY  1.000000  0.810438  0.666961  0.895417  0.823806  
JPM   0.810438  1.000000  0.895216  0.815957  0.947179  
SBUX  0.666961  0.895216  1.000000  0.674836  0.930716  
TGT   0.895417  0.815957  0.674836  1.000000  0.848994  
WFC   0.823806  0.947179  0.930716  0.848994  1.000000  

Finally, you might want to take this data and maybe have it as a csv file, which you can do further analysis with:

fancy = corComp.corr()
fancy.to_csv('bigmoney.csv')

And that's it! So now you know how to create your own correlation tables. You can make these sorts of tables based on more than just correlation. It can be co-variance, standard deviation, or even your own custom functions.

The next tutorial:





  • Intro to Pandas and Saving to a CSV and reading from a CSV
  • Pandas Column manipulation
  • Pandas Column Operations (basic math operations and moving averages)
  • Pandas 2D Visualization of Pandas data with Matplotlib, including plotting dates
  • Pandas 3D Visualization of Pandas data with Matplotlib
  • Pandas Standard Deviation
  • Pandas Correlation matrix and Statistics Information on Data
  • Pandas Function mapping for advanced Pandas users