## Pandas Correlation matrix and Statistics Information on Data

What we're going to cover here is how to gather some basic statistics information on our data sets. This includes information like how many rows, the average of all of the data, standard deviation for all of the data... max and min % swing on all data. All of this is given to us with describe(). Then, we can take it a step further and gather rolling information of correlation and covariance. What makes this exceedingly interesting, is how it ends up translating, to the point where you can do some basic manipulation and have some pretty great analysis already.

Let's get started. Note: if you don't have sp500_ohlc.csv, then you didn't follow the whole series. You should be able to do this entire tutorial with your own data and then soon we'll be getting new data anyway!

```import pandas as pd
from pandas import DataFrame

df = pd.read_csv('sp500_ohlc.csv', index_col = 'Date', parse_dates=True)
df['H-L'] = df.High - df.Low

print(df.describe())
```

Output:

```              Open         High          Low        Close        Volume  \
count  2830.000000  2830.000000  2830.000000  2830.000000  2.830000e+03
mean   1172.834276  1181.118809  1163.808601  1172.820491  2.898598e+09
std     177.216544   176.393538   177.979827   177.132788  1.754116e+09
min     679.280000   695.270000   666.790000   676.530000  3.560700e+08
25%    1071.250000  1079.040000  1063.192500  1071.220000  1.426118e+09
50%    1178.880000  1186.540000  1170.795000  1178.825000  2.325460e+09
75%    1293.977500  1300.702500  1284.667500  1294.095000  4.134882e+09
max    1564.980000  1576.090000  1555.460000  1565.150000  1.145623e+10

count  2830.000000  2830.000000
mean   1172.820491    17.310208
std     177.132788    11.273921
min     676.530000     2.900000
25%    1071.220000    10.010000
50%    1178.825000    14.505000
75%    1294.095000    21.077500
max    1565.150000   102.650000
```

As you can see, we get quite a bit of information here with just one quick invocation of .describe(). We get count, which is how many rows we have for each column. We then get mean, or the average, of all the data in that column. STD is standard deviation for each column. Min is the minimum value in that row. 25% is where the 25th percentile mark is, and so on through 75%. Finally, we get max, which is the highest value for that column.

Next, we can calculate correlation with .corr():

```print(df.corr())
```

Output:

```               Open      High       Low     Close    Volume  Adj Close  \
Open       1.000000  0.998747  0.998277  0.996652 -0.060033   0.996652
High       0.998747  1.000000  0.998016  0.998467 -0.049252   0.998467
Low        0.998277  0.998016  1.000000  0.998569 -0.073954   0.998569
Close      0.996652  0.998467  0.998569  1.000000 -0.061102   1.000000
Volume    -0.060033 -0.049252 -0.073954 -0.061102  1.000000  -0.061102
Adj Close  0.996652  0.998467  0.998569  1.000000 -0.061102   1.000000
H-L       -0.133110 -0.109380 -0.171749 -0.142097  0.396885  -0.142097

H-L
Open      -0.133110
High      -0.109380
Low       -0.171749
Close     -0.142097
Volume     0.396885
H-L        1.000000
```

Here, we get the correlation of each column compared to the other one. As you can see, we get a table of comparison. Obviously, Open, High, Low, and Close are all very closely correlated.

```print(df.cov())
```
Output:
```                   Open          High           Low         Close  \
Open       3.140570e+04  3.122069e+04  3.148663e+04  3.128575e+04
High       3.122069e+04  3.111468e+04  3.133220e+04  3.119719e+04
Low        3.148663e+04  3.133220e+04  3.167682e+04  3.148096e+04
Close      3.128575e+04  3.119719e+04  3.148096e+04  3.137602e+04
Volume    -1.866176e+10 -1.523944e+10 -2.308814e+10 -1.898494e+10
Adj Close  3.128575e+04  3.119719e+04  3.148096e+04  3.137602e+04
H-L       -2.659442e+02 -2.175186e+02 -3.446199e+02 -2.837660e+02

Open      -1.866176e+10  3.128575e+04 -2.659442e+02
High      -1.523944e+10  3.119719e+04 -2.175186e+02
Low       -2.308814e+10  3.148096e+04 -3.446199e+02
Close     -1.898494e+10  3.137602e+04 -2.837660e+02
Volume     3.076921e+18 -1.898494e+10  7.848706e+09
H-L        7.848706e+09 -2.837660e+02  1.271013e+02
```

What if you just want to compare a few elements instead of everything?

```print( df[['Volume','H-L']].corr())
```

Output:

```          Volume       H-L
Volume  1.000000  0.396885
H-L     0.396885  1.000000
```

So now that you're somewhat comfortable with this, let's check out a popular use for correlation and covariance tables. There are many websites out there that either are a paid service, or a heavily advertised that create a correlation matrix, and sometimes co-variance, matrix tables. Turns out, doing this in Pandas is incredibly easy!

Let's start a new program specifically for this:

```import datetime
import pandas.io.data
import pandas as pd

C = pd.io.data.get_data_yahoo('C',
start=datetime.datetime(2011, 10, 1),
end=datetime.datetime(2014, 1, 1))
AAPL = pd.io.data.get_data_yahoo('AAPL',
start=datetime.datetime(2011, 10, 1),
end=datetime.datetime(2014, 1, 1))
MSFT = pd.io.data.get_data_yahoo('MSFT',
start=datetime.datetime(2011, 10, 1),
end=datetime.datetime(2014, 1, 1))
TSLA = pd.io.data.get_data_yahoo('TSLA',
start=datetime.datetime(2011, 10, 1),
end=datetime.datetime(2014, 1, 1))
BAC = pd.io.data.get_data_yahoo('BAC',
start=datetime.datetime(2011, 10, 1),
end=datetime.datetime(2014, 1, 1))
BBRY = pd.io.data.get_data_yahoo('BBRY',
start=datetime.datetime(2011, 10, 1),
end=datetime.datetime(2014, 1, 1))
CMG = pd.io.data.get_data_yahoo('CMG',
start=datetime.datetime(2011, 10, 1),
end=datetime.datetime(2014, 1, 1))
EBAY = pd.io.data.get_data_yahoo('EBAY',
start=datetime.datetime(2011, 10, 1),
end=datetime.datetime(2014, 1, 1))
JPM = pd.io.data.get_data_yahoo('JPM',
start=datetime.datetime(2011, 10, 1),
end=datetime.datetime(2014, 1, 1))
SBUX = pd.io.data.get_data_yahoo('SBUX',
start=datetime.datetime(2011, 10, 1),
end=datetime.datetime(2014, 1, 1))
TGT = pd.io.data.get_data_yahoo('TGT',
start=datetime.datetime(2011, 10, 1),
end=datetime.datetime(2014, 1, 1))
WFC = pd.io.data.get_data_yahoo('WFC',
start=datetime.datetime(2011, 10, 1),
end=datetime.datetime(2014, 1, 1))
```

Up to this point, we can see that we've grabbed a bunch of data for various stocks that we want to create a correlation matrix with. Right now, we're nowhere near a matrix table for these stocks, but we're getting there. I've printed C.head() to give us a reminder of the data that we're looking at. We're only interested in comparing one of the prices to the rest. Generally, people will use the adjusted closes for comparison. Let's do that. Now, let's remove the other columns for C, or Citigroup.

```del C['Open']
del C['High']
del C['Low']
del C['Close']
del C['Volume']
```

That's done, we're just left with Adj Close now. Now we can just rename C to whatever we want.

```corComp = C
```

That'll do for a name. Now, instead of calling these columns all Adj Close, we're going to want to call them the stock name, so we can compare easily:

```corComp.rename(columns={'Adj Close': 'C'}, inplace=True)
```

So now our table is just 1 column of Citigroup's adjusted close, let's add the others.

```corComp['BAC'] = BAC['Adj Close']

```

Output:

```                C   BAC   MSFT   TSLA   AAPL   BBRY     CMG   EBAY    JPM  \
Date
2011-10-03  23.04  5.47  22.65  23.73  51.15  20.50  292.70  28.11  26.30
2011-10-04  24.32  5.70  23.40  23.66  50.86  21.00  301.09  29.35  28.02
2011-10-05  24.63  5.71  23.91  25.37  51.65  23.60  300.79  30.53  28.56
2011-10-06  25.94  6.21  24.32  26.96  51.53  24.42  298.99  31.44  29.99
2011-10-07  24.55  5.84  24.24  26.99  50.49  23.36  294.59  31.07  28.43

SBUX    TGT    WFC
Date
2011-10-03  34.81  44.70  21.53
2011-10-04  35.82  46.27  22.48
2011-10-05  36.59  46.54  22.75
2011-10-06  37.02  48.55  23.56
2011-10-07  37.75  48.22  22.79
```

At this point, we've now got a Pandas set of data that consists of columns of stock names and their close prices by date! Converting this to a correlation matrix is just one call away:

```print( corComp.corr())
```

Output:

```             C       BAC      MSFT      TSLA      AAPL      BBRY       CMG  \
C     1.000000  0.975169  0.732407  0.792107 -0.172905 -0.249992  0.450620
BAC   0.975169  1.000000  0.767743  0.812945 -0.027634 -0.392026  0.494348
MSFT  0.732407  0.767743  1.000000  0.783313  0.207460 -0.530131  0.775090
TSLA  0.792107  0.812945  0.783313  1.000000 -0.103821 -0.404356  0.677697
AAPL -0.172905 -0.027634  0.207460 -0.103821  1.000000 -0.694610  0.093031
BBRY -0.249992 -0.392026 -0.530131 -0.404356 -0.694610  1.000000 -0.332754
CMG   0.450620  0.494348  0.775090  0.677697  0.093031 -0.332754  1.000000
EBAY  0.796745  0.846395  0.527071  0.528979  0.137681 -0.464328  0.108140
JPM   0.979039  0.979418  0.789477  0.786151 -0.043577 -0.361171  0.508082
SBUX  0.863705  0.909169  0.868984  0.913326  0.034459 -0.491639  0.758022
TGT   0.808486  0.825841  0.618687  0.561540  0.075709 -0.452150  0.172996
WFC   0.905780  0.954583  0.856305  0.831263  0.137327 -0.577703  0.561800

EBAY       JPM      SBUX       TGT       WFC
C     0.796745  0.979039  0.863705  0.808486  0.905780
BAC   0.846395  0.979418  0.909169  0.825841  0.954583
MSFT  0.527071  0.789477  0.868984  0.618687  0.856305
TSLA  0.528979  0.786151  0.913326  0.561540  0.831263
AAPL  0.137681 -0.043577  0.034459  0.075709  0.137327
BBRY -0.464328 -0.361171 -0.491639 -0.452150 -0.577703
CMG   0.108140  0.508082  0.758022  0.172996  0.561800
EBAY  1.000000  0.810438  0.666961  0.895417  0.823806
JPM   0.810438  1.000000  0.895216  0.815957  0.947179
SBUX  0.666961  0.895216  1.000000  0.674836  0.930716
TGT   0.895417  0.815957  0.674836  1.000000  0.848994
WFC   0.823806  0.947179  0.930716  0.848994  1.000000
```

Finally, you might want to take this data and maybe have it as a csv file, which you can do further analysis with:

```fancy = corComp.corr()
fancy.to_csv('bigmoney.csv')
```

And that's it! So now you know how to create your own correlation tables. You can make these sorts of tables based on more than just correlation. It can be co-variance, standard deviation, or even your own custom functions.

The next tutorial:

• Intro to Pandas and Saving to a CSV and reading from a CSV

• Pandas Column manipulation

• Pandas Column Operations (basic math operations and moving averages)

• Pandas 2D Visualization of Pandas data with Matplotlib, including plotting dates

• Pandas 3D Visualization of Pandas data with Matplotlib

• Pandas Standard Deviation

• Pandas Correlation matrix and Statistics Information on Data
• Pandas Function mapping for advanced Pandas users