Computing and removing the autocorrelation of a dataset

Problem

Consider the following Banana function.

def getLogFuncBanana(point):

    import numpy as np
    from scipy.stats import multivariate_normal as mvn
    from scipy.special import logsumexp
    
    NPAR = 2                        # sum(Banana,gaussian) normalization factor
    normfac = 0.3                   # sum(Banana,gaussian) normalization factor
    lognormfac = np.log(normfac)    # sum(Banana,gaussian) normalization factor
    a, b = 0.7, 1.5                 # parameters of the Banana function
    MeanB = [ -5.0 , 0. ]           # mean vector of Banana function
    MeanG = [  3.5 , 0. ]           # mean vector of Gaussian function
    CovMatB     = np.reshape([0.25,0.,0.,0.81], newshape = (NPAR,NPAR))    # Covariance matrix of Banana function
    CovMatG     = np.reshape([0.15,0.,0.,0.15], newshape = (NPAR,NPAR))    # Covariance matrix of Gaussian function

    LogProb = np.zeros(2)

    # transformed parameters that transform the Gaussian to the Banana function

    pointSkewed = [ -point[0], +point[1] ]

    # Gaussian function

    LogProb[0] = lognormfac + mvn.logpdf(x = pointSkewed, mean = MeanG, cov = CovMatG) # logProbBanana

    # Do variable transformations for the Skewed-Gaussian (banana) function.

    pointSkewed[1]  = pointSkewed[1] * a
    pointSkewed[0]  = pointSkewed[0] / a - (pointSkewed[1]**2 + a**2) * b

    # Banana function

    LogProb[1] = mvn.logpdf(x = pointSkewed, mean = MeanB, cov = CovMatB) # logProbBanana

    return logsumexp(LogProb)

We wish to generate random sample from the distribution function represented by this Python function. We do so via the ParaMonte library’s ParaDRAM MCMC sampler,

!pip install --upgrade --user paramonte
import paramonte as pm
sim = pm.paradram()
sim.spec.chainSize = 30000
sim.runSampler( ndim = 2
              , getLogFunc = getLogFuncBanana
              )

This sampler outputs an MCMC chain that we can subsequently visualize,

%matplotlib notebook
import numpy as np
chain = sim.readChain(renabled = True)[0]
chain.df["Banana Function Value"] = np.exp(chain.df.SampleLogFunc.values)

chain.plot.contour3()
chain.plot.contour3.savefig(fname = "bananaFuncContour3.png")

chain.plot.scatter3.scatter.kws.s = 0.03
chain.plot.scatter3.scatter.kws.cmap = "winter"
chain.plot.scatter3(zcolumns = "Banana Function Value", ccolumns = "Banana Function Value")
chain.plot.scatter3.savefig(fname = "bananaFuncScatter3.png")

Now, this MCMC chain is a time-series data, meaning that we can compute its autocorrelation (for each data attribute). The ParaMonte library does this for us automatically which we can visualize via,

chain.stats.autocorr.plot.line()
chain.stats.autocorr.plot.line.savefig(fname = "bananaCompactChainACF.png")

Obviously, the three attributes of this chain are autocorrelated. But, we can remove traces of autocorrelation by choosing an appropriate step by which we jump over (skip) the data to thin (or reduce or decorrelate or refine the chain). Choose such an appropriate step size and refine the data in chain.df and then compute the autocorrelation of the refined data via scipy.signal.correlate function. Then, visualize it similar to the above illustration by the ParaMonte library to ensure the refinement process has truly removed the autocorrelation from your data.

import numpy as np
from scipy.signal import correlate

attribute = attribute - np.mean(attribute)

nlag = len(attribute) - 1
acf = np.zeros(nlag)

acf = correlate ( attribute
                , attribute
                , mode = "full"
                )[nlag:2*nlag]
acf = acf / acf[0]

Computing and removing the autocorrelation of a dataset

Problem

Amir Shahmoradi

Comments