ParaMonte Fortran 2.0.0
Parallel Monte Carlo and Machine Learning Library
See the latest version documentation.
pm_sampleCov Module Reference

This module contains classes and procedures for computing the properties related to the covariance matrices of a random sample. More...

Data Types

interface  getCov
 Generate and return the (optionally unbiased) covariance matrix of a pair of (potentially weighted) time series x(1:nsam) and y(1:nsam) or of an input (potentially weighted) array of shape (ndim, nsam) or (nsam, ndim) where ndim is the number of data dimensions (the number of data attributes) and nsam is the number of data points.
More...
 
interface  getCovMerged
 Generate and return the merged covariance of a sample resulting from the merger of two separate (potentially weighted) samples \(A\) and \(B\).
More...
 
interface  setCov
 Return the covariance matrix corresponding to the input (potentially weighted) correlation matrix or return the biased sample covariance matrix of the input array of shape (ndim, nsam) or (nsam, ndim) or a pair of (potentially weighted) time series x(1:nsam) and y(1:nsam) where ndim is the number of data dimensions (the number of data attributes) and nsam is the number of data points.
More...
 
interface  setCovMean
 Return the covariance matrix and mean vector corresponding to the input (potentially weighted) input sample of shape (ndim, nsam) or (nsam, ndim) or a pair of (potentially weighted) time series x(1:nsam) and y(1:nsam) where ndim is the number of data dimensions (the number of data attributes) and nsam is the number of data points.
More...
 
interface  setCovMeanMerged
 Return the merged covariance and mean of a sample resulting from the merger of two separate (potentially weighted) samples \(A\) and \(B\).
More...
 
interface  setCovMeanUpdated
 Return the covariance and mean of a sample that results from the merger of two separate (potentially weighted) non-singular \(A\) and singular \(B\) samples.
More...
 
interface  setCovMerged
 Return the merged covariance of a sample resulting from the merger of two separate (potentially weighted) samples \(A\) and \(B\).
More...
 
interface  setCovUpdated
 Return the covAariance resulting from the merger of two separate (potentially weighted) non-singular and singular samples \(A\) and \(B\).
More...
 

Variables

character(*, SK), parameter MODULE_NAME = "@pm_sampleCov"
 

Detailed Description

This module contains classes and procedures for computing the properties related to the covariance matrices of a random sample.

Covariance

The concept of variance can be generalized to measure the covariation of any pair of data attributes.
The sample covariance matrix is a \(K\)-by- \(K\) matrix \(\mathbf{Q} = \left[\tilde\Sigma_{jk}\right]\) with entries,

\begin{equation} \tilde\Sigma_{jk} = \frac{1}{n} \sum_{i=1}^{n} \left( x_{ij} - \hat\mu_j \right) \left( x_{ik} - \hat\mu_k \right) ~, \end{equation}

where \(n\) is the number of observations in the sample, \(\hat\mu\) is the sample mean vector, and \(\Sigma_{jk}\) is an estimate of the covariance between the \(j\)th variable and the kth variable of the population underlying the data.

The diagonal elements of the matrix \(\tilde\Sigma_{jj}\) are known as the sample variance.

Biased sample covariance

The above formula yields a biased estimate of the covariance matrix of the sample.
Intuitively, the sample covariance relies on the difference between each observation and the sample mean, but the sample mean is slightly correlated with each observation since it is defined in terms of all observations.
Therefore, unless the sample mean is known a priori, the above equation yields a biased estimate of the covariance with sample mean as a proxy for the true mean of the population.
Note that the bias is noticeable only when the sample size is small (e.g., \(<10\)).

Unbiased sample covariance

A popular fix to the definition of sample covariance to remove its bias is to apply the Bessel correction to the equation above, yielding the unbiased covariance estimate as,

\begin{eqnarray} \hat\Sigma_{jk} &=& \frac{\xi}{n} \sum_{i=1}^{n} \left( x_{ij} - \hat\mu_j \right) \left( x_{ik} - \hat\mu_k \right) ~, &=& \frac{1}{n - 1} \sum_{i=1}^{n} \left( x_{ij} - \hat\mu_j \right) \left( x_{ik} - \hat\mu_k \right) ~, \end{eqnarray}

where \(\xi = \frac{n}{n - 1}\) is the Bessel bias correction factor.

Biased weighted sample covariance

\begin{equation} \tilde{\Sigma}^w_{jk} = \frac{ \sum_{i = 1}^{n} \left( x_{ij} - \hat\mu^w_j \right) \left( x_{ik} - \hat\mu^w_k \right) } { \left( \sum_{i=1}^{n} w_i \right) } ~. \end{equation}

where n = nsam is the number of observations in the sample, \(w_i\) are the weights of individual data points, the superscript \(^w\) signifies the sample weights, and \(\hat\mu^w\) is the weighted mean of the sample.
When the sample size is small, the above equation yields a biased estimate of the covariance.

Unbiased weighted sample covariance

There is no unique generic equation for the unbiased covariance of a weighted sample.
However, depending on the types of the weights involved, a few popular definitions exist.

  1. The unbiased covariance of a sample with frequency, count, or repeat weights can be computed via the following equation,

    \begin{equation} \hat\Sigma^w_{jk} = \frac{ \sum_{i = 1}^{n} \left( x_{ij} - \hat\mu^w_j \right) \left( x_{ik} - \hat\mu^w_k \right) } { \left( \sum_{i=1}^{n} w_i \right) - 1} ~. \end{equation}

    Frequency weights represent the number of duplications of each observation in the sample whose population covariance is to be estimated.
    Therefore, the frequency weights are expected to be integers or whole numbers.
  2. The unbiased covariance of a sample with reliability weights, also sometimes confusingly known as probability weights or importance weights, can be computed by the following equation,

    \begin{equation} \hat\Sigma^w_{jk} = \frac{ \sum_{i=1}^{n} w_i } { \left( \sum_{i=1}^{n} w_i \right)^2 - \left( \sum_{i=1}^{n} w_i^2 \right) } \sum_{i = 1}^{n} \left( x_{ij} - \hat\mu^w_j \right) \left( x_{ik} - \hat\mu^w_k \right) ~. \end{equation}

    1. Reliability weights weights, also known as reliability weights or sampling weights represent the probability of a case (or subject) being selected into the sample from a population.
    2. Application of the term unbiased to the above equation is controversial as some believe that bias cannot be correct without the knowledge of the sample size, which is lost in normalized weights.
    3. Reliability weights are frequently (but not necessarily) normalized, meaning that \(\sum^{i = 1}_{n} w_i = 1\).

Covariance matrix vs. correlation matrix

The covariance matrix \(\Sigma\) is related to the correlation matrix \(\rho\) by the following equation,

\begin{equation} \Sigma_{ij} = \rho_{ij} \times \sigma_{i} \times \sigma_{j} ~, \end{equation}

where \(\Sigma\) represents the covariance matrix, \(\rho\) represents the correlation matrix, and \(\sigma\) represents the standard deviations.

See also
pm_sampling
pm_sampleACT
pm_sampleCCF
pm_sampleCor
pm_sampleCov
pm_sampleConv
pm_sampleECDF
pm_sampleMean
pm_sampleNorm
pm_sampleQuan
pm_sampleScale
pm_sampleShift
pm_sampleWeight
pm_sampleAffinity
pm_sampleVar
Box and Tiao, 1973, Bayesian Inference in Statistical Analysis, Page 421.
Updating mean and variance estimates: an improved method, D.H.D. West, 1979.
Geisser and Cornfield, 1963, Posterior distributions for multivariate normal parameters.
Benchmarks:


Benchmark :: The runtime performance of setCov vs. setCovMean.

1! Test the performance of Cholesky factorization computation using an assumed-shape interface vs. explicit-shape interface.
2program benchmark
3
4 use pm_kind, only: IK, LK, RKG => RKD, SK
5 use pm_sampleCov, only: uppDia
6 use pm_bench, only: bench_type
7
8 implicit none
9
10 integer(IK) :: itry, ntry
11 integer(IK) :: i
12 integer(IK) :: iarr
13 integer(IK) :: fileUnit
14 integer(IK) , parameter :: NARR = 18_IK
15 real(RKG) , allocatable :: sample(:,:)
16 type(bench_type), allocatable :: bench(:)
17 integer(IK) , parameter :: nsammax = 2**NARR
18 integer(IK) , parameter :: ndim = 5_IK, dim = 2
19 real(RKG) :: mean(ndim), cov(ndim,ndim)
20 integer(IK) :: nsam
21 real(RKG) :: dumm
22
23 bench = [ bench_type(name = SK_"setCov", exec = setCov, overhead = setOverhead) &
24 , bench_type(name = SK_"setCovMean", exec = setCovMean, overhead = setOverhead) &
25 ]
26
27 write(*,"(*(g0,:,' '))")
28 write(*,"(*(g0,:,' '))") "sample covariance benchmarking..."
29 write(*,"(*(g0,:,' '))")
30
31 open(newunit = fileUnit, file = "main.out", status = "replace")
32
33 write(fileUnit, "(*(g0,:,','))") "nsam", (bench(i)%name, i = 1, size(bench))
34
35 dumm = 0._RKG
36 loopOverMatrixSize: do iarr = 1, NARR - 1
37
38 nsam = 2**iarr
39 ntry = nsammax / nsam
40 allocate(sample(ndim, nsam))
41 write(*,"(*(g0,:,' '))") "Benchmarking setCov() vs. setCovMean()", nsam, ntry
42
43 do i = 1, size(bench)
44 bench(i)%timing = bench(i)%getTiming()
45 end do
46
47 write(fileUnit,"(*(g0,:,','))") nsam, (bench(i)%timing%mean / ntry, i = 1, size(bench))
48 deallocate(sample)
49
50 end do loopOverMatrixSize
51 write(*,"(*(g0,:,' '))") dumm
52
53 close(fileUnit)
54
55contains
56
57 !%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
58 ! procedure wrappers.
59 !%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
60
61 subroutine setOverhead()
62 do itry = 1, ntry
63 call setSample()
64 end do
65 end subroutine
66
67 subroutine setSample()
68 integer(IK) :: i
69 call random_number(sample)
70 end subroutine
71
72 subroutine setCov()
73 block
74 use pm_sampleCov, only: setCov
75 use pm_sampleMean, only: setMean
76 do itry = 1, ntry
77 call setSample()
78 call setMean(mean, sample, dim)
79 call setCov(cov, uppDia, mean, sample, dim)
80 dumm = dumm + cov(1,1) - mean(ndim)
81 end do
82 end block
83 end subroutine
84
85 subroutine setCovMean()
86 block
87 use pm_sampleCov, only: setCovMean
88 do itry = 1, ntry
89 call setSample()
90 call setCovMean(cov, uppDia, mean, sample, dim, sample(1:ndim, 1))
91 dumm = dumm + cov(1,1) - mean(ndim)
92 end do
93 end block
94 end subroutine
95
96end program benchmark
Generate and return an object of type timing_type containing the benchmark timing information and sta...
Definition: pm_bench.F90:574
Return the covariance matrix and mean vector corresponding to the input (potentially weighted) input ...
Return the covariance matrix corresponding to the input (potentially weighted) correlation matrix or ...
Return the (weighted) mean of a pair of time series or of an input sample of nsam observations with n...
This module contains abstract interfaces and types that facilitate benchmarking of different procedur...
Definition: pm_bench.F90:41
This module defines the relevant Fortran kind type-parameters frequently used in the ParaMonte librar...
Definition: pm_kind.F90:268
integer, parameter LK
The default logical kind in the ParaMonte library: kind(.true.) in Fortran, kind(....
Definition: pm_kind.F90:541
integer, parameter IK
The default integer kind in the ParaMonte library: int32 in Fortran, c_int32_t in C-Fortran Interoper...
Definition: pm_kind.F90:540
integer, parameter RKD
The double precision real kind in Fortran mode. On most platforms, this is an 64-bit real kind.
Definition: pm_kind.F90:568
integer, parameter SK
The default character kind in the ParaMonte library: kind("a") in Fortran, c_char in C-Fortran Intero...
Definition: pm_kind.F90:539
This module contains classes and procedures for computing the properties related to the covariance ma...
This module contains classes and procedures for computing the first moment (i.e., the statistical mea...
This is the class for creating benchmark and performance-profiling objects.
Definition: pm_bench.F90:386
subroutine bench(sort, arraySize)

Example Unix compile command via Intel ifort compiler
1#!/usr/bin/env sh
2rm main.exe
3ifort -fpp -standard-semantics -O3 -Wl,-rpath,../../../lib -I../../../inc main.F90 ../../../lib/libparamonte* -o main.exe
4./main.exe

Example Windows Batch compile command via Intel ifort compiler
1del main.exe
2set PATH=..\..\..\lib;%PATH%
3ifort /fpp /standard-semantics /O3 /I:..\..\..\include main.F90 ..\..\..\lib\libparamonte*.lib /exe:main.exe
4main.exe

Example Unix / MinGW compile command via GNU gfortran compiler
1#!/usr/bin/env sh
2rm main.exe
3gfortran -cpp -ffree-line-length-none -O3 -Wl,-rpath,../../../lib -I../../../inc main.F90 ../../../lib/libparamonte* -o main.exe
4./main.exe

Postprocessing of the benchmark output
1#!/usr/bin/env python
2
3import matplotlib.pyplot as plt
4import pandas as pd
5import numpy as np
6
7import os
8dirname = os.path.basename(os.getcwd())
9
10fontsize = 14
11
12df = pd.read_csv("main.out", delimiter = ",")
13colnames = list(df.columns.values)
14
15
18
19ax = plt.figure(figsize = 1.25 * np.array([6.4,4.6]), dpi = 200)
20ax = plt.subplot()
21
22for colname in colnames[1:]:
23 plt.plot( df[colnames[0]].values
24 , df[colname].values
25 , linewidth = 2
26 )
27
28plt.xticks(fontsize = fontsize)
29plt.yticks(fontsize = fontsize)
30ax.set_xlabel(colnames[0], fontsize = fontsize)
31ax.set_ylabel("Runtime [ seconds ]", fontsize = fontsize)
32ax.set_title(" vs. ".join(colnames[1:])+"\nLower is better.", fontsize = fontsize)
33ax.set_xscale("log")
34ax.set_yscale("log")
35plt.minorticks_on()
36plt.grid(visible = True, which = "both", axis = "both", color = "0.85", linestyle = "-")
37ax.tick_params(axis = "y", which = "minor")
38ax.tick_params(axis = "x", which = "minor")
39ax.legend ( colnames[1:]
40 #, loc='center left'
41 #, bbox_to_anchor=(1, 0.5)
42 , fontsize = fontsize
43 )
44
45plt.tight_layout()
46plt.savefig("benchmark." + dirname + ".runtime.png")
47
48
51
52ax = plt.figure(figsize = 1.25 * np.array([6.4,4.6]), dpi = 200)
53ax = plt.subplot()
54
55plt.plot( df[colnames[0]].values
56 , np.ones(len(df[colnames[0]].values))
57 , linestyle = "--"
58 #, color = "black"
59 , linewidth = 2
60 )
61for colname in colnames[2:]:
62 plt.plot( df[colnames[0]].values
63 , df[colname].values / df[colnames[1]].values
64 , linewidth = 2
65 )
66
67plt.xticks(fontsize = fontsize)
68plt.yticks(fontsize = fontsize)
69ax.set_xlabel(colnames[0], fontsize = fontsize)
70ax.set_ylabel("Runtime compared to {}".format(colnames[1]), fontsize = fontsize)
71ax.set_title("Runtime Ratio Comparison. Lower means faster.\nLower than 1 means faster than {}().".format(colnames[1]), fontsize = fontsize)
72ax.set_xscale("log")
73ax.set_yscale("log")
74plt.minorticks_on()
75plt.grid(visible = True, which = "both", axis = "both", color = "0.85", linestyle = "-")
76ax.tick_params(axis = "y", which = "minor")
77ax.tick_params(axis = "x", which = "minor")
78ax.legend ( colnames[1:]
79 #, bbox_to_anchor = (1, 0.5)
80 #, loc = "center left"
81 , fontsize = fontsize
82 )
83
84plt.tight_layout()
85plt.savefig("benchmark." + dirname + ".runtime.ratio.png")

Visualization of the benchmark output

Benchmark moral
  1. The procedures under the generic interface setCov take the sample mean as input and return the covariance matrix.
  2. The procedures under the generic interface setCovMean compute both the sample mean and covariance matrix in one pass.
  3. The performance of the two methods appears to depend significantly on the compiler used.
  4. But in general, the one-pass algorithm of setCovMean appears to perform equally or slightly better than the two-pass algorithm of setCov.
Test:
test_pm_sampleCov
Bug:

Status: See Unresolved, See this page for more information.

Source: GNU Fortran Compiler gfortran
Description: Ideally, there should be only one generic interface in this module for computing the biased/corrected/weighted variance.
This requires ability to resolve the different weight types, which requires custom derived types for weights.
Fortran PDTs are ideal for such use cases. However, the implementation of PDTs is far from complete in GNU Fortran Compiler gfortran.

Remedy (as of ParaMonte Library version 2.0.0): Given that the importance of GNU Fortran Compiler gfortran support, separate generic interfaces were instead developed for different sample weight types.
Once the GNU Fortran Compiler gfortran PDT bugs are resolved, the getVar generic interface can be extended to serve as a high-level wrapper for the weight-specific generic interfaces in this module.
Todo:
Normal Priority: The inclusion of bias correction in the calculation of covariance is a frequentist abomination and shenanigan that must be eliminated in the future.
The correction factor should be computed separately from the actual covariance calculation.


Final Remarks


If you believe this algorithm or its documentation can be improved, we appreciate your contribution and help to edit this page's documentation and source file on GitHub.
For details on the naming abbreviations, see this page.
For details on the naming conventions, see this page.
This software is distributed under the MIT license with additional terms outlined below.

  1. If you use any parts or concepts from this library to any extent, please acknowledge the usage by citing the relevant publications of the ParaMonte library.
  2. If you regenerate any parts/ideas from this library in a programming environment other than those currently supported by this ParaMonte library (i.e., other than C, C++, Fortran, MATLAB, Python, R), please also ask the end users to cite this original ParaMonte library.

This software is available to the public under a highly permissive license.
Help us justify its continued development and maintenance by acknowledging its benefit to society, distributing it, and contributing to it.

Author:
Amir Shahmoradi, Nov 24, 2020, 4:19 AM, Dallas, TX
Fatemeh Bagheri, Thursday 12:45 AM, August 20, 2021, Dallas, TX
Amir Shahmoradi, Monday March 6, 2017, 2:48 AM, Institute for Computational Engineering and Sciences (ICES), The University of Texas at Austin.

Variable Documentation

◆ MODULE_NAME

character(*, SK), parameter pm_sampleCov::MODULE_NAME = "@pm_sampleCov"

Definition at line 182 of file pm_sampleCov.F90.