Recall the globalLandTempHist.txt dataset that consisted of the global land temperature of Earth over the past 300 years. Also recall the equation for the Pearson’s correlation coefficient, between two attributes of a dataset.
\[r_{xy} = \frac{ \sum_1^{n} ~ (x_i-\overline{x})(y_i-\overline{y}) }{ \sqrt{\sum_1^{n} ~ (x_i-\overline{x})^2} ~\sqrt{\sum_1^{n} ~ (y_i-\overline{y})^2} }\]where $n$ represents the number of data points. We wish to compute the Pearson correlation coefficient between the year
and the temperature anomaly
attribute in the land temperature data mentioned in the above. To do so, we will take the following steps:
- Write a function named
getPearsonCor(data1,data2)
that return the Pearson correlation coefficient according to the equation in the above. - Now, read the dataset (using Pandas Python library, for example) and make sure to exlude lines of data that contain
nan
values. If you are using Python, you can get help from Pandasdropna()
method to remove rows of data withnan
. - Now, pass the two columns of data to your function to compute the Pearson correlation coefficient. You should obtain a positive correlation indicating that the global land temperature has increased with time over the past 300 years.
- Now, use the Pearson correlation coefficient calculator from an established library in the language of your choice to verify your calculation of the Pearson correlation. Within Python, you can use Numpy’s
corrcoef()
function to compute the correlation.
To be added…
def genCovMat(Data, Mean = None):
"""
Generate and return the covariance matrix of the input data.
The columns of data must be individual attributes.
The rows of data must be individual observations.
Please pass clean matrix of all real values (no NA, no NaN).
Parameters
----------
Data
The input Numpy matrix of data of all numeric values.
Mean
The mean of the input data along the columns (attributes)
(**optional**, default = numpy.mean(Data))
"""
import numpy as np
if Mean is None: Mean = np.mean(Data, axis = 0)
ndim = len(Data[0,:])
npnt = len(Data[:,0])
normFac = 1 / (npnt - 1)
CovMat = np.zeros((ndim,ndim))
for irow in range(ndim):
for icol in range(irow+1):
CovMat[irow,icol] = normFac * np.dot( Data[:,irow] - Mean[irow] , Data[:,icol] - Mean[icol] )
CovMat[icol,irow] = CovMat[irow,icol]
return CovMat
def genCorMat(Data, Mean = None):
"""
Generate and return the correlation matrix of the input data.
The columns of data must be individual attributes.
The rows of data must be individual observations.
Please pass clean matrix of all real values (no NA, no NaN).
Parameters
----------
Data
The input Numpy matrix of data of all numeric values.
Mean
The mean of the input data along the columns (attributes)
(**optional**, default = numpy.mean(Data))
"""
import numpy as np
CovMat = genCovMat(Data,Mean)
ndim = len(CovMat[0,:])
CorMat = np.ones((ndim,ndim))
for irow in range(ndim):
for icol in range(irow+1):
if icol != irow:
CorMat[irow,icol] = CovMat[irow,icol] / np.sqrt(CovMat[irow,irow] * CovMat[icol,icol])
CorMat[icol,irow] = CorMat[irow,icol]
return CorMat
# Read the global land temperature history data.
import pandas as pd
df = pd.read_csv('http://www.cdslab.org/recipes/programming/stat-covmat/globalLandTempHist.txt', ', ')
df = df.dropna()
df = df.reset_index(drop=True)
# Get the rank of data
CorMat = genCorMat(df.values)
print("Perason's r = {}".format(CorMat[0,1]))
from scipy.stats import pearsonr
psn = pearsonr(df.values[:,0], df.values[:,1])
print("Pearson's r using SciPy = {}".format(psn[0]))
Perason's r = 0.3236486317130343
Pearson's r using SciPy = 0.3236486317130345