Recall the globalLandTempHist.txt dataset that consisted of the global land temperature of Earth over the past 300 years. Also recall the equation for the Kendall’s rank correlation coefficient, between two attributes of a dataset.
\[\tau_{xy} = \frac{ \text{#concordant pairs} - \text{#discordant pairs} }{ \sqrt{\text{#concordant pairs} + \text{#discordant pairs} + \text{#extra-$y$}} ~ \sqrt{\text{#concordant pairs} + \text{#discordant pairs} + \text{#extra-$x$}} }\]where the concordance and discordance of a pair of data points is determined according to the following partitioning of the space for a given point.
where $n$ represents the number of data points. We wish to compute the Kendall’s rank correlation coefficient between the year
and the temperature anomaly
attribute in the land temperature data mentioned in the above. To do so, we will take the following steps:
- Write a function named
getKendallCor(data1,data2)
that return the Kendall’s rank correlation coefficient according to the equation in the above. - Now, read the dataset (using Pandas Python library, for example) and make sure to exlude lines of data that contain
nan
values. If you are using Python, you can get help from Pandasdropna()
method to remove rows of data withnan
. - Now, pass the two columns of data to your function to compute the Kendall’s rank correlation coefficient. You should obtain a positive correlation indicating that the global land temperature has increased with time over the past 300 years.
- Now, use the Kendall’s rank correlation coefficient calculator from an established library in the language of your choice to verify your calculation of the Kendall’s rank correlation. Within Python, you can use Scipy’s
stats.kendalltau()
function to compute the correlation.