ParaMonte Fortran 2.0.0
Parallel Monte Carlo and Machine Learning Library
See the latest version documentation.
pm_sampleVar Module Reference

This module contains classes and procedures for computing the properties related to the covariance matrices of a random sample.
More...

Data Types

interface  getVar
 Generate and return the variance of the input sample of type complex or real of shape (nsam) or (ndim, nsam) or (nsam, ndim) where ndim is the number of data dimensions (the number of data attributes) and nsam is the number of data points.
More...
 
interface  getVarCorrection
 Generate and return the bias correction factor for the computation of the variance of a (weighted) sample.
More...
 
interface  getVarMerged
 Generate and return the (weighted) merged variance of a complex or real sample resulting from the merger of two separate (weighted) samples \(A\) and \(B\).
More...
 
interface  setVar
 Return the variance of an input (weighted) sample of type complex or real of shape (nsam) or (ndim, nsam) or (nsam, ndim) where ndim is the number of data dimensions (the number of data attributes) and nsam is the number of data points.
More...
 
interface  setVarMean
 Return the (weighted) sample variance and mean of an input time series of nsam observations, or of an input sample of nsam observations with ndim attributes optionally weighted by the input weight, optionally also sum(weight).
More...
 
interface  setVarMeanMerged
 Return the (weighted) merged variance and mean of a complex or real sample resulting from the merger of two separate (weighted) samples \(A\) and \(B\).
More...
 
interface  setVarMerged
 Return the (weighted) merged variance of a complex or real sample resulting from the merger of two separate (weighted) samples \(A\) and \(B\).
More...
 

Variables

character(*, SK), parameter MODULE_NAME = "@pm_sampleVar"
 

Detailed Description

This module contains classes and procedures for computing the properties related to the covariance matrices of a random sample.

Variance

Variance is the squared deviation from the mean of a random variable.
The variance is also often defined as the square of the standard deviation.
Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value.
It is the second central moment of a distribution, and the covariance of the random variable with itself.
It is frequently represented by \(\Sigma\), \(\sigma^2\), \(s^2\), \(\up{Var}(X)\), or \(\mathbb{V}(X)\).

Variance as a measure of dispersion

An advantage of variance as a measure of dispersion is that it is more amenable to algebraic manipulation than other measures of dispersion such as the expected absolute deviation.
For example, the variance of a sum of uncorrelated random variables is equal to the sum of their variances.
A disadvantage of the variance for practical applications is that, unlike the standard deviation, its units differ from the random variable, which is why the standard deviation is more commonly reported as a measure of dispersion once the calculation is finished.

Population vs. sample variance

There are two distinct concepts that are both called variance.
One, as discussed above, is part of a theoretical probability distribution and is defined by an equation.
The other variance is a characteristic of a set of observations.
When variance is calculated from observations, those observations are typically measured from a real world system.
If all possible observations of the system are present then the calculated variance is called the population variance.
Normally, however, only a subset is available, and the variance calculated from this is called the sample variance.
The variance calculated from a sample is considered an estimate of the full population variance.
There are multiple ways to calculate an estimate of the population variance, as discussed in the section below.
The two kinds of variance are closely related.
To see how, consider that a theoretical probability distribution can be used as a generator of hypothetical observations.
If an infinite number of observations are generated using a distribution, then the sample variance calculated from that infinite set will match the value calculated using the distribution's equation for variance.

History

The term variance was apparently first introduced by Ronald Fisher in his 1918 paper The Correlation Between Relatives on the Supposition of Mendelian Inheritance.

Definition

The variance of a random variable \(X\) is the expected value of the squared deviation from the mean of \(X\), \(\mu= \up{E}[X]\):

\begin{equation} \up{Var}(X) = \up{E}\left[(X - \mu)^2\right] ~. \end{equation}

This definition encompasses random variables that are generated by processes that are discrete, continuous, neither, or mixed.
The variance can also be thought of as the covariance of a random variable with itself:

\begin{equation} \up{Var}(X) = \up{Cov}(X, X) ~. \end{equation}

The variance is also equivalent to the second cumulant of the probability distribution that generates \(X\).
The variance is typically designated as \(\up{Var}(X)\), or sometimes as \(V(X)\) or \(\mathbb{V}(X)\), or symbolically as \(\sigma_X^2\) or simply \(\sigma^2\).
The expression for the variance can be expanded as follows:

\begin{equation} \begin{aligned} \up{Var}(X) &= \up{E} \left[ (X - \up{E}[X])^2\right] \\ &= \up{E} \left[ X^2 - 2X\up{E}[X] + \up{E} [X]^{2} \right] \\ &= \up{E} \left[ X^2 \right] - 2\up{E}[X]\up{E}[X] + \up{E}[X]^{2} \\ &= \up{E} \left[ X^2 \right] - \up{E}[X]^{2} \end{aligned} \end{equation}

In other words, the variance of \(X\) is equal to the mean of the square of \(X\) minus the square of the mean of \(X\).<br< However, this formulation is never used for computations using floating point arithmetic, because it suffers from catastrophic cancellation if the two components of the equation are similar in magnitude.

Population variance and sample variance

Real-world observations such as the measurements of yesterday rain throughout the day typically cannot be complete sets of all possible observations that could be made.
As such, the variance calculated from the finite set will in general not match the variance that would have been calculated from the full population of possible observations.
This means that one estimates the mean and variance from a limited set of observations by using an estimator equation.
The estimator is a function of the sample of \(n\) observations drawn without observational bias from the whole population of potential observations.
The simplest estimators for population mean and population variance are simply the mean and variance of the sample, the sample mean and (uncorrected) sample variance.
These are consistent estimators as they converge to the correct value as the number of samples increases), but can be improved.
Estimating the population variance by taking the sample variance is close to optimal in general, but can be improved in two ways.
The sample variance is computed as an average of squared deviations about the (sample) mean, by dividing by \(n\).
However, using values other than \(n\) improves the estimator in various ways.
Four common values for the denominator are \(n\), \(n − 1\), \(n + 1\), and \(n − 1.5\):

  1. \(n\) is the simplest (population variance of the sample),
  2. \(n − 1\) eliminates bias,
  3. \(n + 1\) minimizes mean squared error for the normal distribution,
  4. \(n − 1.5\) mostly eliminates bias in unbiased estimation of standard deviation for the normal distribution.

Firstly, if the true population mean is unknown, then the sample variance (which uses the sample mean in place of the true mean) is a biased estimator:
It underestimates the variance by a factor of \((n − 1) / n\).
Correcting by this factor (dividing by \(n − 1\) instead of \(n\)) is called the Bessel correction.
The resulting estimator is unbiased, and is called the corrected sample variance or unbiased sample variance.
For example, when \(n = 1\) the variance of a single observation about the sample mean (itself) is obviously zero regardless of the population variance.
If the mean is determined in some other way than from the same samples used to estimate the variance then this bias does not arise and the variance can safely be estimated as that of the samples about the (independently known) mean.

Biased sample variance

In many practical situations, the true variance of a population is not known a priori and must be computed.
When dealing with extremely large populations, it is not possible to count every object in the population, so the computation must be performed on a sample of the population.
This is generally referred to as sample variance or empirical variance.
Sample variance can also be applied to the estimation of the variance of a continuous distribution from a sample of that distribution.
We take a sample with replacement of \(n\) values \(X_1, \ldots, X_n\) from the population and estimate the variance on the basis of this sample.
Directly taking the variance of the sample data gives the average of the squared deviations:

\begin{equation} \tilde{\sigma}_X^2 = \frac{1}{n} \sum_{i = 1}^{n} \left( X_i - \overline{X} \right)^2 = \left( \frac{1}{n} \sum_{i = 1}^{n} X_i^2 \right) - \overline{X}^2 = \frac{1}{n^2} \sum_{i, j ~:~ i < j} \left( X_i - X_j \right)^2 ~.<br> \end{equation}

Here, \(\overline{X}\) denotes the sample mean:

\begin{equation} \overline{X} = \frac{1}{n} \sum_{i = 1}^n X_i ~. \end{equation}

Since the \(X_i\) are selected randomly, both \(\overline{X}\) and \(\tilde{\sigma}_X^2\) are random variables.
Their expected values can be evaluated by averaging over the ensemble of all possible samples \(X_i\) of size \(n\) from the population.
For \(\tilde{\sigma}_X^2\) this gives:

\begin{equation} \begin{aligned} \up{E}[\tilde{\sigma}_X^2] &= \up{E} \left[ \frac{1}{n} \sum_{i = 1}^n \left( X_i - \frac{1}{n} \sum_{j = 1}^n X_{j} \right)^2 \right] \\ &= \frac{1}{n} \sum_{i = 1}^n \up{E} \left[ X_i^2 - \frac{2}{n} X_i \sum_{j = 1}^n X_j + \frac{1}{n^2} \sum_{j = 1}^n X_j \sum_{k = 1}^n X_k \right] \\ &= \frac{1}{n} \sum_{i = 1}^n \left( \frac{n - 2}{n} \up{E} \left[ X_i^2 \right] - \frac{2}{n} \sum_{j \neq i} \up{E} \left[ X_i X_j \right] + \frac{1}{n^2} \sum_{j = 1}^n \sum_{k \neq j}^n \up{E} \left[ X_j X_k \right] + \frac{1}{n^2} \sum_{j = 1}^n \up{E}\left[ X_j^2 \right] \right) \\ &= \frac{1}{n} \sum_{i = 1}^n \left[ \frac{n - 2}{n} \left( \sigma^2 + \mu^2 \right) - \frac{2}{n}(n - 1)\mu^2 + \frac{1}{n^2} n (n - 1) \mu^2 + \frac{1}{n} \left(\sigma^2 + \mu^2 \right)\right] \\ &= \frac{n - 1}{n} \sigma^2 ~. \end{aligned} \end{equation}

Hence \(\tilde{\sigma}_X^2\) gives an estimate of the population variance that is biased by a factor of \(\frac{n - 1}{n}\).
For this reason, \(\tilde{\sigma}_X^2\) is referred to as the biased sample variance.
The bias-correction factor in this case is \(\xi = \frac{n}{n - 1}\).

Unbiased sample variance

Correcting for this bias yields the unbiased sample variance, denoted \(\sigma^2\):

\begin{equation} \sigma^{2} = \frac{n}{n-1} \tilde{\sigma}_X^2 = \frac{n}{n - 1} \left[ \frac{1}{n} \sum_{i = 1}^n \left(X_i - \overline{X} \right)^2\right] = \frac{1}{n-1}\sum_{i = 1}^n\left(X_i - \overline{X} \right)^2 ~. \end{equation}

Either estimator may be simply referred to as the sample variance when the version can be determined by context.
The same proof is also applicable for samples taken from a continuous probability distribution.
The use of the term \(n − 1\) is called the Bessel correction, and it is also used in sample covariance and the sample standard deviation (the square root of variance).
The square root is a concave function and thus introduces negative bias (by the Jensen inequality), which depends on the distribution, and thus the corrected sample standard deviation is biased.
The unbiased estimation of standard deviation is a technically involved problem, though for the normal distribution using the term \(n − 1.5\) yields an almost unbiased estimator.

Biased weighted sample variance

Given a weighted sample of \(n\) observations \(X_{1:n}\) with a weighted sample mean \(\hat\mu_w\), the biased variance of the weighted sample is computed as,

\begin{equation} \hat\sigma_w^2 = \frac{1}{\sum_{i=1}^{n} w_i} \sum_{i = 1}^n w_i (X_i - \hat\mu_w)^2 ~, \end{equation}

where n = nsam is the number of observations in the sample, \(w_i\) are the weights of individual data points, and \(\hat\mu_w\) is the weighted mean of the sample.
When the sample size is small, the above equation yields a biased estimate of the variance.

Unbiased weighted sample variance

There is no unique generic equation for the unbiased variance of a weighted sample.
However, depending on the types of the weights involved, a few popular definitions exist.

  1. The unbiased variance of a sample with frequency, count, or repeat weights can be computed via the following equation,

    \begin{equation} \hat\sigma_w^2 = \frac{\xi}{\sum_{i=1}^{n} w_i} \sum_{i=1}^{n} w_i (X_i - \hat\mu_w)^2 = \frac{1}{\left( \sum_{i=1}^{n} w_i \right) - 1} \sum_{i=1}^{n} w_i (X_i - \hat\mu_w)^2 ~, \end{equation}

    where the bias correction factor \(\xi\) is,

    \begin{equation} \xi = \frac{\sum_{i=1}^{n} w_i}{\left( \sum_{i=1}^{n} w_i \right) - 1} ~. \end{equation}

    Frequency weights represent the number of duplications of each observation in the sample whose population variance is to be estimated.
    Therefore, the frequency weights are expected to be integers or whole numbers.
  2. The unbiased variance of a sample with reliability weights, also sometimes confusingly known as probability weights or importance weights, can be computed by the following equation,

    \begin{equation} \hat\sigma_w^2 = \frac{\xi}{\sum_{i=1}^{n}w_i} \sum_{i=1}^{n} w_i \left(X_i - \hat\mu_w\right)^2 = \frac{\sum_{i=1}^{n} w_i}{\left(\sum_{i=1}^{n}w_i\right)^2 - \left(\sum_{i=1}^{n}w_i^2\right)} \sum_{i=1}^{n} w_i \left(X_i - \hat\mu_w\right)^2 ~, \end{equation}

    where the bias correction factor \(\xi\) is,

    \begin{equation} \xi = \frac{\left(\sum_{i=1}^{n} w_i\right)^2}{\left(\sum_{i=1}^{n}w_i\right)^2 - \left(\sum_{i=1}^{n}w_i^2\right)} ~. \end{equation}

    1. Reliability weights weights, also known as reliability weights or sampling weights represent the probability of a case (or subject) being selected into the sample from a population.
    2. Application of the term unbiased to the above equation is controversial as some believe that bias cannot be correct without the knowledge of the sample size, which is lost in normalized weights.
    3. Reliability weights are frequently (but not necessarily) normalized, meaning that \(\sum^{i = 1}_{n} w_i = 1\).

Variance updating

Consider a sample \(X_{1..n}\) of size \(n\) with weights \(w_{1..n}\).
The weighted mean of this sample can be expressed as,

\begin{equation} \large \hat\mu_{1:n} = \frac{1}{w_{1:n}} \sum_{i = 1}^{n} w_i X_i ~, \end{equation}

where \(w_{1:n} = \sum w_{1..n}\) and the weighted biased variance of the sample can be expressed as,

\begin{eqnarray} \large \tilde{\sigma}_{1:n}^2 = \frac{\hat\sigma_{1:n}^2}{\xi_{1:n}} &=& \frac{1}{w_{1:n}} \sum_{i = 1}^{n} w_i (X_i - \hat\mu_{1:n})^2 ~, &=& \frac{1}{w_{1:n}} \sum_{i = 1}^{n} w_i X_i^2 - \xi_{1:n} \hat\mu_{1:n}^2 ~, \end{eqnarray}

where \(\xi_{1:n}\) is the appropriate bias-correction factor (which can be one).
This implies,

\begin{equation} \large \sum_{i = 1}^{n} w_i X_i^2 = w_{1:n} \left(\frac{\hat\sigma_{1:n}^2}{\xi_{1:n}} + \hat\mu_{1:n}\right) ~. \end{equation}

The above can be used to express the variance of the sample \(X_{1..n+m}\) resulting from the merger of two separate samples \(X_{1..n}\) and \(X_{n+1..n+m}\) as,

\begin{equation} \large \frac{\hat\sigma_{1:n+m}^2}{\xi_{1:n+m}} = \frac{1}{w_{1:n+m}} \left( w_{1:n} \left( \frac{\hat\sigma_{1:n}^2}{\xi_{1:n}} + \hat\mu_{1:n}^2 \right) + w_{n+1:n+m} \left(\frac{\hat\sigma_{n+1:n+m}^2}{\xi_{n+1:n+m}} + \hat\mu_{n+1:n+m}^2 \right) - w_{1:n+m} \hat\mu_{1:n+m}^2 \right) ~. \end{equation}

Note
Note the effects of bias-correction in computing the variance become noticeable only for sample sample sizes (i.e., when nsam is small).
For a two or higher-dimensional sample, if the variance is to be computed for the entire sample (as opposed to computing it along a particular dimension), simply pass reshape(sample, shape = size(sample)) to the appropriate getVar interface.
Alternatively, a 1D pointer of the same size as the multidimensional sample can be passed to the procedure.
While it is tempting to extend this generic interface to weight arguments of type integer or real of various kinds, such extensions do not appear to add any tangible benefits beyond making the interface more flexible for the end user.
But such extensions would certainly make the maintenance and future extensions of this interface difficult and complex.
In the case of integer (frequency) weights,
  1. The summation sum(weight) involved in the computation may lead to an integer-overflow if individual weights are too large.
  2. Avoiding overflow would then require coercing the weights to real before summation, which add an extra layer of unnecessary type coercion.
  3. Furthermore, according to the coercion rules of the Fortran standard, if an integer is multiplied with a real, the integer value must be first converted to real of the same kind as the real value, then multiplied.
  4. The type coercion to real will have to happen a second time when the weights are multiplied with the data values.
  5. Each integer-real type coercion costs about a real multiplication on modern hardware (See, e.g., this thread).
By contrast,
  1. Real-valued weights, even if the weights are counts, do not require type coercion if real values in the computation are of the same kind as is here.
  2. The floating-point multiplication tends to be faster than integer multiplication on most modern architecture.
  3. However, real-valued weight summation is 4-8 times more expensive then integer addition, but less than real multiplication.
Considering all factors in the above, there does not seem to exist any performance benefits with providing dedicated interfaces for weight arguments of different type and kind.
The following list compares the cost and latencies of some of the basic operations involving integer and real numbers.
  1. Central Processing Unit (CPU):
    1. Integer add: 1 cycle
    2. 32-bit integer multiply: 10 cycles
    3. 64-bit integer multiply: 20 cycles
    4. 32-bit integer divide: 69 cycles
    5. 64-bit integer divide: 133 cycles
  2. On-chip Floating Point Unit (FPU):
    1. Floating point add: 4 cycles
    2. Floating point multiply: 7 cycles
    3. Double precision multiply: 8 cycles
    4. Floating point divide: 23 cycles
    5. Double precision divide: 36 cycles

Generalization of variance

Variance of complex variables

If \(x\) is a scalar complex-valued random variable, with values in \(\mathbb{C}\), then its variance is \(\up{E} \left[(x-\mu )(x-\mu )^{*}\right]\), where \(x^{*}\) is the complex conjugate of \(x\).
This variance is a real scalar.

Matrix variance of vector-valued variables

If \(X\) is a vector-valued random variable, with values in \(\mathbb{R}^{n}\), and thought of as a column vector, then a natural generalization of variance is \(\up{E}\left[(X-\mu)(X-\mu)^{\up{T}}\right]\), where \(\mu = \up{E}(X)\) and \(X^{\up{T}}\) is the transpose of \(X\), and so is a row vector.
The result is a positive semi-definite square matrix, commonly referred to as the variance-covariance matrix (or simply as the covariance matrix).

Matrix variance of complex vector-valued variables

If \(X\) is a vector- and complex-valued random variable, with values in \(\mathbb {C}^{n}\), then the covariance matrix is \(\up{E} \left[(X-\mu)(X-\mu)^{\dagger}\right]\), where \(X^{\dagger}\) is the conjugate transpose of \(X\).
This matrix is also positive semi-definite and square.

Extension of scalar variance to higher dimensions

Another generalization of variance for vector-valued random variables \(X\), which results in a scalar value rather than in a matrix, is the generalized variance \(\det(C)\), the determinant of the covariance matrix.
The generalized variance can be shown to be related to the multidimensional scatter of points around their mean.

A different generalization is obtained by considering the variance of the Euclidean distance between the random variable and its mean.
This results in \(\up{E} \left[(X-\mu)^{\up {T}}(X-\mu)\right] = \up{tr}(C)\), which is the trace of the covariance matrix.

Summary

  1. Variance is a measure of dispersion in data.
  2. If the population mean is known a priori independent of the current sample based upon which the variance is to be computed, then the computed sample variance is an unbiased estimate of the true population variance.
  3. If the population mean is unknown a priori and must be computed via the same sample based upon which the variance is to be computed, then the computed sample variance is an biased estimate of the true population variance.
  4. This variance bias can be corrected by applying appropriate correction factors to the computed variances.
  5. The most famous such correction is called the Bessel correction.
See also
pm_sampling
pm_sampleACT
pm_sampleCCF
pm_sampleCor
pm_sampleCov
pm_sampleConv
pm_sampleECDF
pm_sampleMean
pm_sampleNorm
pm_sampleQuan
pm_sampleScale
pm_sampleShift
pm_sampleWeight
pm_sampleAffinity
pm_sampleVar
Box and Tiao, 1973, Bayesian Inference in Statistical Analysis, Page 421.
Updating mean and variance estimates: an improved method, D.H.D. West, 1979.
Geisser and Cornfield, 1963, Posterior distributions for multivariate normal parameters.
Test:
test_pm_sampleCov
Bug:

Status: See Unresolved, See this page for more information.

Source: GNU Fortran Compiler gfortran
Description: Ideally, there should be only one generic interface in this module for computing the biased/corrected/weighted variance.
This requires ability to resolve the different weight types, which requires custom derived types for weights.
Fortran PDTs are ideal for such use cases. However, the implementation of PDTs is far from complete in GNU Fortran Compiler gfortran.

Remedy (as of ParaMonte Library version 2.0.0): Given that the importance of GNU Fortran Compiler gfortran support, separate generic interfaces were instead developed for different sample weight types.
Once the GNU Fortran Compiler gfortran PDT bugs are resolved, the getVar generic interface can be extended to serve as a high-level wrapper for the weight-specific generic interfaces in this module.


Final Remarks


If you believe this algorithm or its documentation can be improved, we appreciate your contribution and help to edit this page's documentation and source file on GitHub.
For details on the naming abbreviations, see this page.
For details on the naming conventions, see this page.
This software is distributed under the MIT license with additional terms outlined below.

  1. If you use any parts or concepts from this library to any extent, please acknowledge the usage by citing the relevant publications of the ParaMonte library.
  2. If you regenerate any parts/ideas from this library in a programming environment other than those currently supported by this ParaMonte library (i.e., other than C, C++, Fortran, MATLAB, Python, R), please also ask the end users to cite this original ParaMonte library.

This software is available to the public under a highly permissive license.
Help us justify its continued development and maintenance by acknowledging its benefit to society, distributing it, and contributing to it.

Todo:
Normal Priority: The inclusion of bias correction in the calculation of variance is a frequentist abomination and shenanigan that must be eliminated in the future.
The correction factor should be computed separately from the actual variance calculation.
Author:
Amir Shahmoradi, Nov 24, 2020, 4:19 AM, Dallas, TX
Fatemeh Bagheri, Thursday 12:45 AM, August 20, 2021, Dallas, TX
Amir Shahmoradi, Monday March 6, 2017, 2:48 AM, Institute for Computational Engineering and Sciences (ICES), The University of Texas at Austin.

Variable Documentation

◆ MODULE_NAME

character(*, SK), parameter pm_sampleVar::MODULE_NAME = "@pm_sampleVar"

Definition at line 403 of file pm_sampleVar.F90.