Problem

Recall the globalLandTempHist.txt dataset that consisted of the global land temperature of Earth over the past 300 years. Also recall the Spearman correlation rank coefficient is merely the Pearson’s correlation coefficient of the ranks of two attributes in a dataset. In other words, given a dataset, like the above temperature, one can first compute the ranks of attributes. These ranks are basically a reordering of the indices of each data attribute, such that with the new indices data becomes ordered. For example, the rank of the following dataset,

\[D = [2.5, -3, 1, 5.5, 0, 10] ~,\]

is,

\[R = [4, 1, 3, 5, 2, 6] ~,\]

Once the ranks of a pair of attributes are computed, one can readily use the original definition of Pearson’s correlation coefficient to compute the Spearman’s rank correlation coefficient of data via the equation,

\[r_{xy} = \frac{ \sum_1^{n} ~ (R_i-\overline{R})(S_i-\overline{S}) }{ \sqrt{\sum_1^{n} ~ (R_i-\overline{R})^2} ~\sqrt{\sum_1^{n} ~ (S_i-\overline{S})^2} }\]

where $n$ represents the number of data points and $R$ and $S$ are the corresponding ranks of the data attributes $X$ and $Y$. Given this introduction, we wish to compute the Spearman correlation coefficient between the year and the temperature anomaly attribute in the land temperature data mentioned in the above. To do so, we will take the following steps:

  1. Use an existing package in the language of your choice to compute the rank of the two data attributes. In the case of Python, you can use numpy’s argsort() to obtain the attributes’ ranks.
  2. Write a function named genSpearmanCor(data1,data2) that returns the Spearman correlation coefficient according to the equation in the above using the ranks.
  3. Now, read the dataset (using Pandas Python library, for example) and make sure to exlude lines of data that contain nan values. If you are using Python, you can get help from Pandas dropna() method to remove rows of data with nan.
  4. Now, pass the two columns of data to your function to compute the Spearman correlation coefficient. You should obtain a positive correlation indicating that the global land temperature has increased with time over the past 300 years.
  5. Now, use the Spearman correlation coefficient calculator from an established library in the language of your choice to verify your calculation of the Spearman correlation. Within Python, you can use Numpy’s corrcoef() function to compute the correlation.
  6. Compare your answer with what you get from an external package for computing the Spearman’s correlation coefficient. In case of Python, you can use SciPy’s scipy.stats.spearmanr function.

Solution

MATLAB

To be provided…

Python
def genCovMat(Data, Mean = None):
    """
    Generate and return the covariance matrix of the input data.
    The columns of data must be individual attributes.
    The rows of data must be individual observations.
    Please pass clean matrix of all real values (no NA, no NaN).
    
    Parameters
    ----------
        Data
            The input Numpy matrix of data of all numeric values.
    
        Mean
            The mean of the input data along the columns (attributes)
            (**optional**, default = numpy.mean(Data))
    """
    import numpy as np
    if Mean is None: Mean = np.mean(Data, axis = 0)
    ndim = len(Data[0,:])
    npnt = len(Data[:,0])
    normFac = 1 / (npnt - 1)
    CovMat = np.zeros((ndim,ndim))
    for irow in range(ndim):
        for icol in range(irow+1):
            CovMat[irow,icol] = normFac * np.dot( Data[:,irow] - Mean[irow] , Data[:,icol] - Mean[icol] )
            CovMat[icol,irow] = CovMat[irow,icol] 
    return CovMat

def genCorMat(Data, Mean = None):
    """
    Generate and return the correlation matrix of the input data.
    The columns of data must be individual attributes.
    The rows of data must be individual observations.
    Please pass clean matrix of all real values (no NA, no NaN).
    
    Parameters
    ----------
        Data
            The input Numpy matrix of data of all numeric values.
    
        Mean
            The mean of the input data along the columns (attributes)
            (**optional**, default = numpy.mean(Data))
    """
    import numpy as np
    CovMat = genCovMat(Data,Mean)
    ndim = len(CovMat[0,:])
    CorMat = np.ones((ndim,ndim))
    for irow in range(ndim):
        for icol in range(irow+1):
            if icol != irow:
                CorMat[irow,icol] = CovMat[irow,icol] / np.sqrt(CovMat[irow,irow] * CovMat[icol,icol])
                CorMat[icol,irow] = CorMat[irow,icol]
    return CorMat
    
# Read the global land temperature history data
import pandas as pd
df = pd.read_csv('http://www.cdslab.org/recipes/programming/stat-covmat/globalLandTempHist.txt', ', ')
df = df.dropna()
df = df.reset_index(drop=True)

# Get the rank of data
Indx = np.zeros(np.shape(df.values))
for icol in range(len(Indx[0,:])): Indx[:,icol] = np.argsort(df.values[:,icol], axis = -1)
CorMat = genCorMat(Indx)
print("Spearman's r = {}".format(CorMat[0,1]))

from scipy.stats import spearmanr
spr = spearmanr(Indx)
print("Spearman's r using SciPy = {}".format(spr[0]))
Spearman's r = 0.38459936259841604
Spearman's r using SciPy = 0.3845993625984161

Comments