Problem Part A

Consider the following web-page address https://cdslaborg.github.io/DataRepos_SwiftBat/index.html. This is an data table (in HTML language) containing data from the NASA Swift satellite. Each row in this table represents information about a Gamma-Ray Burst (GRB) detection that Swift has made in the past years. Now, corresponding to each of the event IDs, there might exist files that contain some attributes of these events which we wish to plot and understand their behavior against each other.

For example, for the first event in this table, GRB170406x (00745966), there is a data file which is hidden in a directory on https://cdslaborg.github.io/DataRepos_SwiftBat/ep_flu/GRB00745966_ep_flu.txt. Notice how the event ID (inside paratheses) is inserted into the web address. Now, for each event in the GRB event table, there might exist one such text file hidden on the web directory beginning with https://cdslaborg.github.io/DataRepos_SwiftBat/ep_flu/ (followed by the GRB’s ID) and .txt.

Our goal here is to fetch all these files from the website and save them locally on our computer. Then read their contents one by one and plot the two columns of data in all of these files together on a single plot.

Write a script named fetchDataFromWeb in the language of your own choice that uses this web address: https://raw.githubusercontent.com/cdslaborg/DataRepos_SwiftBat/master/triggers.txt to read a list of all GRB events and then writes the entire table of triggers.txt to a local file with the same name on your device.

Problem Part B

Now, add to your script another set of commands that uses the event IDs stored in this file, to generate the corresponding web addresses like,

https://cdslaborg.github.io/DataRepos_SwiftBat/ep_flu/GRB00745966_ep_flu.txt.

Then it uses the generated web address to read the content of the page and store it in a local file on your device with the same name as it is stored on the web page (for example, for the given web page, the filename would be GRB00745966_ep_flu.txt).

Python

Problem Part C

Now write another script named plotDatafromFile, that reads all of these files in your downloaded set of files one by one and plots the contents of all of them together on a single scatter plot like the following,


Problem Part D

Write a script that performs a linear fit to this data that you have plotted and report the best fit slope and intercept that you obtain.

Problem Part E

Use the ParaMonte or any other Monte Carlo sampling package that you want to also compute the uncertainties associated with the slope and intercept of the linear fit to the data.

Problem Part F

By now, you may have realized that each file, that you downloaded and visualized in the above, contains information for a single observed event, and that the rows in each file correspond to different possible realizations of the same observation. In other words, the rows represent together the uncertainty in the observation corresponding to the file. The above scatter plot that you have made illustrates all such possible realizations of all observations in a single visualization. While the result looks beautiful, sometimes it does not and you may need to reduce the data for a better illustration. For example, one common method is to represent each observation with the mean value of data and and add $1\sigma$ standard deviation error-bars to the mean values in the plots.

Now using the above suggested data-reduction method, make an error-bar plot of the data. For reference, here is an example errobar plot of the reduced data, showing only the mean and standard deviations.


Compare this graph with the previous one that you generated.

Problem Part G

Although the above graph looks simpler than the original plot containing all thousands of points, the error bars do not correctly represent the uncertainties. Why? Because the y-error is correlated with x-error in most observations. To see this, compare the original plot of all data points with the latter reduced-data plot. How can we fix this problem, while keeping the plot simple, using only reduced data? The remedy is to use 2D ellipses to represent the $1\sigma$ standard deviations of data on both attributes. Unlike simple error bars, ellipses can capture potential correlations among the two attributes.

Now, instead of representing the attribute error bars independently and individually in the plot, use a single ellipsoid to represent the $1\sigma$ uncertainty for each data point. For this, you will have to form the bivariate covariance matrix of the uncertainty for each observation. Then you pass this covariance matrix along with the mean to the this function to get a set of representative points on the boundary of the ellipse corresponding to the mean and covariance matrix. Then you plot these representative points a closed line on the current plot. You repeat this process for all observation to obtain the full illustration as depicted below.


Unlike the previous plots, this ellipsoidal illustration is revealing something new to us never seen before; that the bivariate uncertainty of observations at high epeak values are positively correlated whereas the observations in the low epeak part of the plot are negatively correlated.

To get the color-map seen in the plot, you can use this helper script to define the color for each ellipsoid drawing in the figure.

#### set up color mapping via correlations

cmap = Struct()
cmap.values = []
for i in range(0,stat.ndata): cmap.values.append(stat.cor[i][0,1])
cmap.values = np.array(cmap.values)
cmap.name = "coolwarm"

import matplotlib.colors as colors
import matplotlib.cm as cmx

cmapObject = plt.get_cmap(cmap.name)
#cNorm = colors.Normalize( vmin = np.min(cmap.values), vmax = np.max(cmap.values), clip = True )
cNorm = colors.Normalize( vmin=-1., vmax=1., clip = True )
mappable = cmx.ScalarMappable( norm = cNorm, cmap = cmapObject )
cmap._rgba = np.zeros( ( len(cmap.values) , 4) )
for i, rho in enumerate(cmap.values): cmap._rgba[i,:] = mappable.to_rgba(cmap.values[i])

Here in this code, stat is a data structure containing summary statistics of data, including mean (stat.avg), covariance (stat.cov) and correlation (stat.cor) matrices for each observation. Describe what each line of this helper script does in your answer.

Solution Part A-D

MATLAB

Here is an example implementation of the MATLAB script to fetch data from the web, fetchDataFromWeb.m,

% define URL variables
dataReposUrl = ['https://cdslaborg.github.io/DataRepos_SwiftBat/'];

% create data directory
dataDirectory = 'ep_flu/';
mkdir(dataDirectory);

% read the list event IDs
triggerList = webread([dataReposUrl,'triggers.txt'], 'ContentType' , 'text');
triggerList = strsplit(triggerList,'\n')';

missingFileCounter = 0;
for i = 1:length(triggerList)
    filename = ['GRB',triggerList{i},'_ep_flu.txt'];
    myUrl = [dataReposUrl,'ep_flu',filename];
    disp(['Fetching file ',filename,'...']);
    try
        data = webread(myUrl);
    catch
        missingFileCounter = missingFileCounter + 1;
        warning('file not found');
        continue;
    end
    fid = fopen([dataDirectory,filename],'w');
    fprintf(fid,'%s',data);
    fclose(fid);
end

disp(['There were a total of ', num2str(missingFileCounter), ' files missing in this repository.']);

Here is an example implementation of the MATLAB script to make the scatter plot, plotDatafromFile.m,

close all;

% Now read all data files and plot the results
dataDirectory = './ep_flu/';
triggerListDirectory = './';
triggerList = fileread([triggerListDirectory,'triggers.txt']);
triggerList = strsplit(triggerList,'\n');
missingFileCounter = 0;

% first read data from files and put them all in a single Cell array:

if exist('MyDataCell','var')
    warning('data cell array already exists in the MATLAB environment. Using the available data...');
else
    MyDataCell = cell(1,length(triggerList));
    for i = 1:length(triggerList)
        disp(['reading data from file number ', num2str(i)]);
        filePath = [dataDirectory,'GRB',triggerList{i}(1:end-1),'_ep_flu.txt'];
        if exist(filePath,'file')
            %MyDataCell{i} = importdata(filePath); %,'headerlinesIn',1);
            MyDataCell{i} = readtable(filePath);
            MyDataCell{i} = table2array(MyDataCell{i});
            if isfield(MyDataCell{i},'data')
                MyDataCell{i} = MyDataCell{i}.data;
            end
        else
            missingFileCounter = missingFileCounter + 1;
            disp(['missing file detected: #', num2str(missingFileCounter)]);
        end
    end
end

% now plot the data:

figure(); hold on; box on;
eventCounter = 0;
for i = 1:length(triggerList)
    disp(['plotting data from file number ', num2str(i)]);
    if ~isempty(MyDataCell{i}) && all(MyDataCell{i}(:,2)<0)
        eventCounter = eventCounter + 1;
        scatter( exp(MyDataCell{i}(:,2)) ...
                , MyDataCell{i}(:,1) ...
                , 1 ...
                , 'MarkerFaceColor', 'red' ...
                , 'MarkerEdgeColor', 'red' ...
                , 'MarkerFaceAlpha', .01 ...
                , 'MarkerEdgeAlpha', .01 ...
                );
    end
end
set( gca ...
   , 'xscale','log' ...
   , 'yscale','log' ...
   , 'xlim', [1e-8 1e-1] ...
   );
xlabel( 'Fluence [ ergs/cm^2 ]' ...
      , 'interpreter', 'tex' ...
      , 'fontsize', 12 ...
      );
ylabel( 'E_{peak}' ...
      , 'interpreter', 'tex' ...
      , 'fontsize', 12 ...
      );
title( ['Plot of E_{peak} vs. Fluence for ', num2str(eventCounter), ' Swift GRB events' ] ...
     , 'interpreter', 'tex' ...
     , 'fontsize', 12 ...
     );
saveas( gcf ...
      , 'SwiftDataPlot.png' ...
      );
Python

Here is an example implementation of the script in Python, readPlotWebData.py,

#!/usr/bin/env python

import urllib.request

def parseTable(html):
    #Each "row" of the HTML table will be a list, and the items
    #in that list will be the TD data items.
    ourTable = []

    #We keep these set to NONE when not actively building a
    #row of data or a data item.
    ourTD = None    #Stores one table data item
    ourTR = None    #List to store each of the TD items in.


    #State we keep track of
    inTable = False
    inTR = False
    inTD = False

    #Start looking for a start tag at the beginning!
    tagStart = html.find("<", 0)

    while( tagStart != -1):
        tagEnd = html.find(">", tagStart)

        if tagEnd == -1:    #We are done, return the data!
            return ourTable

        tagText = html[tagStart+1:tagEnd]

        #only look at the text immediately following the <
        tagList = tagText.split()
        tag = tagList[0]
        tag = tag.lower()

        #Watch out for TABLE (start/stop) tags!
        if tag == "table":      #We entered the table!
            inTable = True
        if tag == "/table":     #We exited a table.
            inTable = False

        #Detect/Handle Table Rows (TR's)
        if tag == "tr":
            inTR = True
            ourTR = []      #Started a new Table Row!

        #If we are at the end of a row, add the data we collected
        #so far to the main list of table data.
        if tag == "/tr":
            inTR = False
            ourTable.append(ourTR)
            ourTR = None

        #We are starting a Data item!
        if tag== "td":
            inTD = True
            ourTD = ""      #Start with an empty item!
            
        #We are ending a data item!
        if tag == "/td":
            inTD = False
            if ourTD != None and ourTR != None:
                cleanedTD = ourTD.strip()   #Remove extra spaces
                ourTR.append( ourTD.strip() )
            ourTD = None
            

        #Look for the NEXT start tag. Anything between the current
        #end tag and the next Start Tag is potential data!
        tagStart = html.find("<", tagEnd+1)
        
        #If we are in a Table, and in a Row and also in a TD,
        # Save anything that's not a tag! (between tags)
        #
        #Note that this may happen multiple times if the table
        #data has tags inside of it!
        #e.g. <td>some <b>bold</b> text</td>
        #
        #Because of this, we need to be sure to put a space between each
        #item that may have tags separating them. We remove any extra
        #spaces (above) before we append the ourTD data to the ourTR list.
        if inTable and inTR and inTD:
            ourTD = ourTD + html[tagEnd+1:tagStart] + " "
            #print("td:", ourTD)   #for debugging


    #If we end the while loop looking for the next start tag, we
    #are done, return ourTable of data.
    return(ourTable)
        
def fetchHtmlTable(link,outPath):
    response = urllib.request.urlopen(link)
    html_bytes = response.read()
    html = html_bytes.decode()
    with open(outPath+link.split('/')[-1] , 'w') as fout:
        fout.write(html)
    
    dataTable = parseTable(html)
    with open(outPath+link.split('/')[-1]+'.tab' , 'w') as fout:
        for row in dataTable:
            for item in row:
                fout.write('{:>30}'.format(item))
            fout.write('\n')
    return dataTable
    
def fetchBatFiles(dataTable,outPath,fileName = 'ep_flu.txt'):  # fetch individual files in Swift repository
    root = 'https://cdslaborg.github.io/DataRepos_SwiftBat/ep_flu/' + fileName
    counter = 0
    missing = 0
    for row in dataTable:
        grb_id = row[0].split('(')[-1][:-1]
        #print(grb_id)
        try:
            counter += 1
            link = root.replace('$grb_id$',grb_id)
            response = urllib.request.urlopen(link)
            webFile_bytes = response.read()
            webFile = webFile_bytes.decode()
            with open(outPath+'/GRB'+grb_id+'_'+fileName , 'w') as fout: fout.write(webFile)
        except urllib.request.HTTPError:
            missing += 1
            print('\nurl: {} does not exist.'.format(link))
            print('file for GRB ID: {} is missing. {} files missing out of {} event so far.\n'.format(grb_id,missing,counter))
    

def plotBatFiles(inPath,figFile):
    import os
    import numpy as np, os
    import matplotlib.pyplot as plt
    ax = plt.gca()
    ax.set_xlabel('Fluence [ ergs/cm^2 ]')
    ax.set_ylabel('Epeak [ keV ]')
    ax.axis([1.0e-8, 1.0e-1, 1.0, 1.0e4]) # [xmin, xmax, ymin, ymax]
    ax.set_yscale('log')
    ax.set_xscale('log')
    #plt.hold('on')
    counter = 0
    for file in os.listdir(inPath):
        if file.endswith("ep_flu.txt"):
            data = np.loadtxt(os.path.join(inPath, file), skiprows=1)
            if data.size!=0 and all(data[:,1]<0.0):
                counter += 1
                if counter%100==0 : print('Fetching file # {}: {}'.format(counter,file))
                data[:,1] = np.exp(data[:,1])
                #colors = np.full(data[:,1].size,np.random.rand())
                ax.scatter(data[:,1],data[:,0],s=1,alpha=0.05,c='r',edgecolors='none')
    ax.set_title('Plot of Epeak vs. Fluence for {} Swift GRB events'.format(counter))
    plt.savefig(figFile)
    plt.show()

def main():
    import sys
    if len(sys.argv)!=3:
        print("""
        USAGE:
              ./readPlotWebData.py <web address: https://cdslaborg.github.io/DataRepos_SwiftBat/index.html> <path to output files, e,g,: ./>
              """)
    else:
        link = sys.argv[1]
        outPath = sys.argv[2]
        #fetchBatFiles( fetchHtmlTable(link,outPath) , outPath , fileName = 'ep_flu.txt')
        figFile = 'SwiftDataPlot.png'
        plotBatFiles(outPath,figFile)

if __name__ == "__main__":
   main()

Solution Part F

Python

First, we write a function that extracts the first two statistical moments of the possible realizations of each observation given in separate files. Then, we can visualize the first two moments of all observations in a single plot,

import os
import numpy as np
import pandas as pd
import urllib.request
from pathlib import Path
import matplotlib.pyplot as plt
import paramonte as pm

class Struct(): pass
def genMeanStd(dirPath = "data"):
    """
    Return an object of class Struct containing the means and
    the standard deviations of the two attributes as components.
    """
    stat = Struct()
    stat.ndata = 0;
    stat.logepeak = Struct()
    stat.logfluence = Struct()
    stat.logepeak.avg = []
    stat.logepeak.std = []
    stat.logfluence.avg = []
    stat.logfluence.std = []
    for file in os.listdir(dirPath):
        if file.endswith("ep_flu.txt"):
            data = np.loadtxt(os.path.join(dirPath, file), skiprows=1)
            if data.size!=0 and all(data[:,1]<0.0):
                stat.ndata += 1
                if stat.ndata%100==0 : print('Fetching file # {}: {}'.format(stat.ndata,file))
                data[:,0] = np.log(data[:,0])
                stat.logepeak.std.append(np.std(data[:,0]))
                stat.logepeak.avg.append(np.mean(data[:,0]))
                stat.logfluence.std.append(np.std(data[:,1]))
                stat.logfluence.avg.append(np.mean(data[:,1]))
    return stat

stat = genMeanStd(dirPath = "data")

# Now visualize the summarized data

ax = plt.figure(figsize = 1.5*np.array([6.4,4.8]), dpi = 150)
ax = plt.subplot()
ax.set_xlabel(r"Log( Fluence [ ergs/cm$^2$ ] )", fontsize = 17)
ax.set_ylabel(r"Log( Epeak [ keV ] )", fontsize = 17)
ax.axis(np.log([1.0e-8, 1.0e-1, 1.0, 1.0e4])) # [xmin, xmax, ymin, ymax]

ax.errorbar ( stat.logfluence.avg
            , stat.logepeak.avg
            , yerr = stat.logepeak.std
            , xerr = stat.logfluence.std
            , elinewidth = 1
            , alpha = 0.4
            , ecolor = 'r'
            , fmt = "o"
            )

ax.set_title('Plot of Epeak vs. Fluence for {} Swift GRB events'.format(stat.ndata), fontsize = 17)
plt.savefig("SwiftDataPlotReduced.png")
plt.show()

Solution Part G

Python

We can modify our previously written script to return the first two moments of each event as a mean vector of length two (because there are two attributes epeak and fluence describing each event) and a covariance matrix (cov) which describes both the spread and correlation of attributes with each other.

import os
import numpy as np
import pandas as pd
import urllib.request
from pathlib import Path
import matplotlib.pyplot as plt
import paramonte as pm

class Struct(): pass

def genCorMat(covMat):
    """
    Return the correlation matrix corresponding to the input covariance matrix.
    """
    nd = len(covMat[:,0])
    corMat = np.zeros((nd,nd))
    # Compute the Standard deviation vector
    stdVec = np.zeros(nd)
    for i in range(nd): stdVec[i] = np.sqrt(covMat[i][i])
    for i in range(nd):
        corMat[i][i] = 1.
        for j in range(i+1,nd):
            corMat[i][j] = covMat[i][j] / (stdVec[i] * stdVec[j])
            corMat[j][i] = corMat[i][j]
    return corMat

def genMeanCorCov(dirPath = "data"):
    """
    Return an object of class Struct containing the means and
    the standard deviations of the two attributes as components.
    """
    stat = Struct()
    stat.ndata = 0;
    stat.avg = []
    stat.cov = []
    stat.cor = []
    for file in os.listdir(dirPath):
        if file.endswith("ep_flu.txt"):
            data = np.loadtxt(os.path.join(dirPath, file), skiprows=1)
            if data.size != 0 and all(data[:,1]<0.0):
                stat.ndata += 1
                if stat.ndata%100 == 0: print('Fetching file # {}: {}'.format(stat.ndata,file))
                data[:,0] = np.log(data[:,0])
                stat.avg.append(np.mean(data, axis = 0))
                stat.cov.append(np.cov(data, rowvar = False))
                stat.cor.append(genCorMat(stat.cov[-1]))
    return stat

stat = genMeanCorCov(dirPath = "data")

#### Now visualize the summarized data

ax = plt.figure(figsize = 1.5*np.array([6.4,4.8]), dpi = 150)
ax = plt.subplot()
ax.set_xlabel(r"Log( Fluence [ ergs/cm$^2$ ] )", fontsize = 17)
ax.set_ylabel(r"Log( Epeak [ keV ] )", fontsize = 17)
ax.axis(([1.0e-8, 1.0e-1, 1.0, 1.0e4])) # [xmin, xmax, ymin, ymax]
ax.set_xscale("log")
ax.set_yscale("log")

#### set up color mapping via correlations

cmap = Struct()
cmap.values = []
for i in range(0,stat.ndata): cmap.values.append(stat.cor[i][0,1])
cmap.values = np.array(cmap.values)
cmap.name = "coolwarm"

import matplotlib.colors as colors
import matplotlib.cm as cmx

cmapObject = plt.get_cmap(cmap.name)
#cNorm = colors.Normalize( vmin = np.min(cmap.values), vmax = np.max(cmap.values), clip = True )
cNorm = colors.Normalize( vmin=-1., vmax=1., clip = True )
mappable = cmx.ScalarMappable( norm = cNorm, cmap = cmapObject )
cmap._rgba = np.zeros( ( len(cmap.values) , 4) )
for i, rho in enumerate(cmap.values): cmap._rgba[i,:] = mappable.to_rgba(cmap.values[i])

iskip = 1
for i in range(0,stat.ndata,iskip):
    bcrd = getEllipsoidBoundary(stat.cov[i],stat.avg[i])
    ax.plot ( np.exp(bcrd[1,:])
            , np.exp(bcrd[0,:])
            , color = cmap._rgba[i]
            , alpha = 0.25
            , linewidth = 0.7
            )

ax.set_title('Plot of Epeak vs. Fluence for {} Swift GRB events'.format(stat.ndata), fontsize = 17)
plt.savefig("SwiftDataPlotEllipse.png")
plt.show()

Comments