Consider the following web-page address https://cdslaborg.github.io/DataRepos_SwiftBat/index.html. This is an data table (in HTML language) containing data from the NASA Swift satellite. Each row in this table represents information about a Gamma-Ray Burst (GRB) detection that Swift has made in the past years. Now, corresponding to each of the event IDs, there might exist files that contain some attributes of these events which we wish to plot and understand their behavior against each other.
For example, for the first event in this table, GRB170406x (00745966)
, there is a data file which is hidden in a directory on https://cdslaborg.github.io/DataRepos_SwiftBat/ep_flu/GRB00745966_ep_flu.txt. Notice how the event ID (inside paratheses) is inserted into the web address. Now, for each event in the GRB event table, there might exist one such text file hidden on the web directory beginning with https://cdslaborg.github.io/DataRepos_SwiftBat/ep_flu/
(followed by the GRB’s ID) and .txt
.
Our goal here is to fetch all these files from the website and save them locally on our computer. Then read their contents one by one and plot the two columns of data in all of these files together on a single plot.
Write a script named fetchDataFromWeb
in the language of your own choice that uses this web address: https://raw.githubusercontent.com/cdslaborg/DataRepos_SwiftBat/master/triggers.txt
to read a list of all GRB events and then writes the entire table of triggers.txt
to a local file with the same name on your device.
Now, add to your script another set of commands that uses the event IDs stored in this file, to generate the corresponding web addresses like,
https://cdslaborg.github.io/DataRepos_SwiftBat/ep_flu/GRB00745966_ep_flu.txt.
Then it uses the generated web address to read the content of the page and store it in a local file on your device with the same name as it is stored on the web page (for example, for the given web page, the filename would be GRB00745966_ep_flu.txt
).
Warning: Some of the web addresses for the given event IDs do not exist. Therefore, you should exception handling constructs to avoid runtime errors in your code that could break your code prematurely.
Tip: For the events for which the web address might not exist, for example,
https://cdslaborg.github.io/DataRepos_SwiftBat/ep_flu/GRB00680331_ep_flu.txt,
your script will have to raise a urllib.request.HTTPError
exception. Write your code such that it can smoothly skip these exceptions.
Now write another script named plotDatafromFile
, that reads all of these files in your downloaded set of files one by one and plots the contents of all of them together on a single scatter plot like the following,

Warning: Remember that some of the data files stored on your computer might be empty and some others might have useless data if the data in the second column of the file is larger than $0$ (data in such files have a different scientific meaning than the other ones. The meaning of that goes beyond the scope of what we are trying to do here, but the important point is that we can simply ignore such files if any data value in the second column is larger than $0$). Therefore, you will have to write your script in such a way that it checks for non-emptiness of the files. In other words, you have to make sure that the file does indeed contain some numerical data. Also, your script must check for the negativity of the values in the column of data in each file. Once you have done all of these checks, you will have to do one final manipulation of the data. Note that the values on the second column of the data files are the log of data in natural Neperian base. Therefore, you will have to transform the values in the second column of the files to the original non-log values via the mathematical function exp()
before plotting them (Note that the plot in the figure above is a log-log plot and we want to exactly regenerate it). Also, in order to find out how many files you have plotted in the figure, you will have to define a variable counter which increases by one unit, each time a new non-empty negative-second-column data file is read and plotted (you will have to ignore those files that have positive values in the second columns as well as those files that empty). In the end, you will have to set a title for your plot and label the axes of the plot, and save it to an output file, for example, named SwiftDataPlot.png
.
Write a script that performs a linear fit to this data that you have plotted and report the best fit slope and intercept that you obtain.
Use the ParaMonte or any other Monte Carlo sampling package that you want to also compute the uncertainties associated with the slope and intercept of the linear fit to the data.
Tip: For this part, you can get help from the instructor during the lectures to explain the concepts to you.
By now, you may have realized that each file, that you downloaded and visualized in the above, contains information for a single observed event, and that the rows in each file correspond to different possible realizations of the same observation. In other words, the rows represent together the uncertainty in the observation corresponding to the file. The above scatter plot that you have made illustrates all such possible realizations of all observations in a single visualization. While the result looks beautiful, sometimes it does not and you may need to reduce the data for a better illustration. For example, one common method is to represent each observation with the mean value of data and and add $1\sigma$ standard deviation error-bars to the mean values in the plots.
Now using the above suggested data-reduction method, make an error-bar plot of the data. For reference, here is an example errobar plot of the reduced data, showing only the mean and standard deviations.

Compare this graph with the previous one that you generated.
Although the above graph looks simpler than the original plot containing all thousands of points, the error bars do not correctly represent the uncertainties. Why? Because the y-error is correlated with x-error in most observations. To see this, compare the original plot of all data points with the latter reduced-data plot. How can we fix this problem, while keeping the plot simple, using only reduced data? The remedy is to use 2D ellipses to represent the $1\sigma$ standard deviations of data on both attributes. Unlike simple error bars, ellipses can capture potential correlations among the two attributes.
Now, instead of representing the attribute error bars independently and individually in the plot, use a single ellipsoid to represent the $1\sigma$ uncertainty for each data point. For this, you will have to form the bivariate covariance matrix of the uncertainty for each observation. Then you pass this covariance matrix along with the mean to the this function to get a set of representative points on the boundary of the ellipse corresponding to the mean and covariance matrix. Then you plot these representative points a closed line on the current plot. You repeat this process for all observation to obtain the full illustration as depicted below.

Unlike the previous plots, this ellipsoidal illustration is revealing something new to us never seen before; that the bivariate uncertainty of observations at high epeak
values are positively correlated whereas the observations in the low epeak
part of the plot are negatively correlated.
To get the color-map seen in the plot, you can use this helper script to define the color for each ellipsoid drawing in the figure.
#### set up color mapping via correlations
cmap = Struct()
cmap.values = []
for i in range(0,stat.ndata): cmap.values.append(stat.cor[i][0,1])
cmap.values = np.array(cmap.values)
cmap.name = "coolwarm"
import matplotlib.colors as colors
import matplotlib.cm as cmx
cmapObject = plt.get_cmap(cmap.name)
#cNorm = colors.Normalize( vmin = np.min(cmap.values), vmax = np.max(cmap.values), clip = True )
cNorm = colors.Normalize( vmin=-1., vmax=1., clip = True )
mappable = cmx.ScalarMappable( norm = cNorm, cmap = cmapObject )
cmap._rgba = np.zeros( ( len(cmap.values) , 4) )
for i, rho in enumerate(cmap.values): cmap._rgba[i,:] = mappable.to_rgba(cmap.values[i])
Here in this code, stat
is a data structure containing summary statistics of data, including mean (stat.avg
), covariance (stat.cov
) and correlation (stat.cor
) matrices for each observation. Describe what each line of this helper script does in your answer.
Here is an example implementation of the MATLAB script to fetch data from the web, fetchDataFromWeb.m,
% define URL variables
dataReposUrl = ['https://cdslaborg.github.io/DataRepos_SwiftBat/'];
% create data directory
dataDirectory = 'ep_flu/';
mkdir(dataDirectory);
% read the list event IDs
triggerList = webread([dataReposUrl,'triggers.txt'], 'ContentType' , 'text');
triggerList = strsplit(triggerList,'\n')';
missingFileCounter = 0;
for i = 1:length(triggerList)
filename = ['GRB',triggerList{i},'_ep_flu.txt'];
myUrl = [dataReposUrl,'ep_flu',filename];
disp(['Fetching file ',filename,'...']);
try
data = webread(myUrl);
catch
missingFileCounter = missingFileCounter + 1;
warning('file not found');
continue;
end
fid = fopen([dataDirectory,filename],'w');
fprintf(fid,'%s',data);
fclose(fid);
end
disp(['There were a total of ', num2str(missingFileCounter), ' files missing in this repository.']);
Here is an example implementation of the MATLAB script to make the scatter plot, plotDatafromFile.m,
close all;
% Now read all data files and plot the results
dataDirectory = './ep_flu/';
triggerListDirectory = './';
triggerList = fileread([triggerListDirectory,'triggers.txt']);
triggerList = strsplit(triggerList,'\n');
missingFileCounter = 0;
% first read data from files and put them all in a single Cell array:
if exist('MyDataCell','var')
warning('data cell array already exists in the MATLAB environment. Using the available data...');
else
MyDataCell = cell(1,length(triggerList));
for i = 1:length(triggerList)
disp(['reading data from file number ', num2str(i)]);
filePath = [dataDirectory,'GRB',triggerList{i}(1:end-1),'_ep_flu.txt'];
if exist(filePath,'file')
%MyDataCell{i} = importdata(filePath); %,'headerlinesIn',1);
MyDataCell{i} = readtable(filePath);
MyDataCell{i} = table2array(MyDataCell{i});
if isfield(MyDataCell{i},'data')
MyDataCell{i} = MyDataCell{i}.data;
end
else
missingFileCounter = missingFileCounter + 1;
disp(['missing file detected: #', num2str(missingFileCounter)]);
end
end
end
% now plot the data:
figure(); hold on; box on;
eventCounter = 0;
for i = 1:length(triggerList)
disp(['plotting data from file number ', num2str(i)]);
if ~isempty(MyDataCell{i}) && all(MyDataCell{i}(:,2)<0)
eventCounter = eventCounter + 1;
scatter( exp(MyDataCell{i}(:,2)) ...
, MyDataCell{i}(:,1) ...
, 1 ...
, 'MarkerFaceColor', 'red' ...
, 'MarkerEdgeColor', 'red' ...
, 'MarkerFaceAlpha', .01 ...
, 'MarkerEdgeAlpha', .01 ...
);
end
end
set( gca ...
, 'xscale','log' ...
, 'yscale','log' ...
, 'xlim', [1e-8 1e-1] ...
);
xlabel( 'Fluence [ ergs/cm^2 ]' ...
, 'interpreter', 'tex' ...
, 'fontsize', 12 ...
);
ylabel( 'E_{peak}' ...
, 'interpreter', 'tex' ...
, 'fontsize', 12 ...
);
title( ['Plot of E_{peak} vs. Fluence for ', num2str(eventCounter), ' Swift GRB events' ] ...
, 'interpreter', 'tex' ...
, 'fontsize', 12 ...
);
saveas( gcf ...
, 'SwiftDataPlot.png' ...
);
Here is an example implementation of the script in Python, readPlotWebData.py,
#!/usr/bin/env python
import urllib.request
def parseTable(html):
#Each "row" of the HTML table will be a list, and the items
#in that list will be the TD data items.
ourTable = []
#We keep these set to NONE when not actively building a
#row of data or a data item.
ourTD = None #Stores one table data item
ourTR = None #List to store each of the TD items in.
#State we keep track of
inTable = False
inTR = False
inTD = False
#Start looking for a start tag at the beginning!
tagStart = html.find("<", 0)
while( tagStart != -1):
tagEnd = html.find(">", tagStart)
if tagEnd == -1: #We are done, return the data!
return ourTable
tagText = html[tagStart+1:tagEnd]
#only look at the text immediately following the <
tagList = tagText.split()
tag = tagList[0]
tag = tag.lower()
#Watch out for TABLE (start/stop) tags!
if tag == "table": #We entered the table!
inTable = True
if tag == "/table": #We exited a table.
inTable = False
#Detect/Handle Table Rows (TR's)
if tag == "tr":
inTR = True
ourTR = [] #Started a new Table Row!
#If we are at the end of a row, add the data we collected
#so far to the main list of table data.
if tag == "/tr":
inTR = False
ourTable.append(ourTR)
ourTR = None
#We are starting a Data item!
if tag== "td":
inTD = True
ourTD = "" #Start with an empty item!
#We are ending a data item!
if tag == "/td":
inTD = False
if ourTD != None and ourTR != None:
cleanedTD = ourTD.strip() #Remove extra spaces
ourTR.append( ourTD.strip() )
ourTD = None
#Look for the NEXT start tag. Anything between the current
#end tag and the next Start Tag is potential data!
tagStart = html.find("<", tagEnd+1)
#If we are in a Table, and in a Row and also in a TD,
# Save anything that's not a tag! (between tags)
#
#Note that this may happen multiple times if the table
#data has tags inside of it!
#e.g. <td>some <b>bold</b> text</td>
#
#Because of this, we need to be sure to put a space between each
#item that may have tags separating them. We remove any extra
#spaces (above) before we append the ourTD data to the ourTR list.
if inTable and inTR and inTD:
ourTD = ourTD + html[tagEnd+1:tagStart] + " "
#print("td:", ourTD) #for debugging
#If we end the while loop looking for the next start tag, we
#are done, return ourTable of data.
return(ourTable)
def fetchHtmlTable(link,outPath):
response = urllib.request.urlopen(link)
html_bytes = response.read()
html = html_bytes.decode()
with open(outPath+link.split('/')[-1] , 'w') as fout:
fout.write(html)
dataTable = parseTable(html)
with open(outPath+link.split('/')[-1]+'.tab' , 'w') as fout:
for row in dataTable:
for item in row:
fout.write('{:>30}'.format(item))
fout.write('\n')
return dataTable
def fetchBatFiles(dataTable,outPath,fileName = 'ep_flu.txt'): # fetch individual files in Swift repository
root = 'https://cdslaborg.github.io/DataRepos_SwiftBat/ep_flu/' + fileName
counter = 0
missing = 0
for row in dataTable:
grb_id = row[0].split('(')[-1][:-1]
#print(grb_id)
try:
counter += 1
link = root.replace('$grb_id$',grb_id)
response = urllib.request.urlopen(link)
webFile_bytes = response.read()
webFile = webFile_bytes.decode()
with open(outPath+'/GRB'+grb_id+'_'+fileName , 'w') as fout: fout.write(webFile)
except urllib.request.HTTPError:
missing += 1
print('\nurl: {} does not exist.'.format(link))
print('file for GRB ID: {} is missing. {} files missing out of {} event so far.\n'.format(grb_id,missing,counter))
def plotBatFiles(inPath,figFile):
import os
import numpy as np, os
import matplotlib.pyplot as plt
ax = plt.gca()
ax.set_xlabel('Fluence [ ergs/cm^2 ]')
ax.set_ylabel('Epeak [ keV ]')
ax.axis([1.0e-8, 1.0e-1, 1.0, 1.0e4]) # [xmin, xmax, ymin, ymax]
ax.set_yscale('log')
ax.set_xscale('log')
#plt.hold('on')
counter = 0
for file in os.listdir(inPath):
if file.endswith("ep_flu.txt"):
data = np.loadtxt(os.path.join(inPath, file), skiprows=1)
if data.size!=0 and all(data[:,1]<0.0):
counter += 1
if counter%100==0 : print('Fetching file # {}: {}'.format(counter,file))
data[:,1] = np.exp(data[:,1])
#colors = np.full(data[:,1].size,np.random.rand())
ax.scatter(data[:,1],data[:,0],s=1,alpha=0.05,c='r',edgecolors='none')
ax.set_title('Plot of Epeak vs. Fluence for {} Swift GRB events'.format(counter))
plt.savefig(figFile)
plt.show()
def main():
import sys
if len(sys.argv)!=3:
print("""
USAGE:
./readPlotWebData.py <web address: https://cdslaborg.github.io/DataRepos_SwiftBat/index.html> <path to output files, e,g,: ./>
""")
else:
link = sys.argv[1]
outPath = sys.argv[2]
#fetchBatFiles( fetchHtmlTable(link,outPath) , outPath , fileName = 'ep_flu.txt')
figFile = 'SwiftDataPlot.png'
plotBatFiles(outPath,figFile)
if __name__ == "__main__":
main()
First, we write a function that extracts the first two statistical moments of the possible realizations of each observation given in separate files. Then, we can visualize the first two moments of all observations in a single plot,
import os
import numpy as np
import pandas as pd
import urllib.request
from pathlib import Path
import matplotlib.pyplot as plt
import paramonte as pm
class Struct(): pass
def genMeanStd(dirPath = "data"):
"""
Return an object of class Struct containing the means and
the standard deviations of the two attributes as components.
"""
stat = Struct()
stat.ndata = 0;
stat.logepeak = Struct()
stat.logfluence = Struct()
stat.logepeak.avg = []
stat.logepeak.std = []
stat.logfluence.avg = []
stat.logfluence.std = []
for file in os.listdir(dirPath):
if file.endswith("ep_flu.txt"):
data = np.loadtxt(os.path.join(dirPath, file), skiprows=1)
if data.size!=0 and all(data[:,1]<0.0):
stat.ndata += 1
if stat.ndata%100==0 : print('Fetching file # {}: {}'.format(stat.ndata,file))
data[:,0] = np.log(data[:,0])
stat.logepeak.std.append(np.std(data[:,0]))
stat.logepeak.avg.append(np.mean(data[:,0]))
stat.logfluence.std.append(np.std(data[:,1]))
stat.logfluence.avg.append(np.mean(data[:,1]))
return stat
stat = genMeanStd(dirPath = "data")
# Now visualize the summarized data
ax = plt.figure(figsize = 1.5*np.array([6.4,4.8]), dpi = 150)
ax = plt.subplot()
ax.set_xlabel(r"Log( Fluence [ ergs/cm$^2$ ] )", fontsize = 17)
ax.set_ylabel(r"Log( Epeak [ keV ] )", fontsize = 17)
ax.axis(np.log([1.0e-8, 1.0e-1, 1.0, 1.0e4])) # [xmin, xmax, ymin, ymax]
ax.errorbar ( stat.logfluence.avg
, stat.logepeak.avg
, yerr = stat.logepeak.std
, xerr = stat.logfluence.std
, elinewidth = 1
, alpha = 0.4
, ecolor = 'r'
, fmt = "o"
)
ax.set_title('Plot of Epeak vs. Fluence for {} Swift GRB events'.format(stat.ndata), fontsize = 17)
plt.savefig("SwiftDataPlotReduced.png")
plt.show()
We can modify our previously written script to return the first two moments of each event as a mean vector of length two (because there are two attributes epeak
and fluence
describing each event) and a covariance matrix (cov
) which describes both the spread and correlation of attributes with each other.
import os
import numpy as np
import pandas as pd
import urllib.request
from pathlib import Path
import matplotlib.pyplot as plt
import paramonte as pm
class Struct(): pass
def genCorMat(covMat):
"""
Return the correlation matrix corresponding to the input covariance matrix.
"""
nd = len(covMat[:,0])
corMat = np.zeros((nd,nd))
# Compute the Standard deviation vector
stdVec = np.zeros(nd)
for i in range(nd): stdVec[i] = np.sqrt(covMat[i][i])
for i in range(nd):
corMat[i][i] = 1.
for j in range(i+1,nd):
corMat[i][j] = covMat[i][j] / (stdVec[i] * stdVec[j])
corMat[j][i] = corMat[i][j]
return corMat
def genMeanCorCov(dirPath = "data"):
"""
Return an object of class Struct containing the means and
the standard deviations of the two attributes as components.
"""
stat = Struct()
stat.ndata = 0;
stat.avg = []
stat.cov = []
stat.cor = []
for file in os.listdir(dirPath):
if file.endswith("ep_flu.txt"):
data = np.loadtxt(os.path.join(dirPath, file), skiprows=1)
if data.size != 0 and all(data[:,1]<0.0):
stat.ndata += 1
if stat.ndata%100 == 0: print('Fetching file # {}: {}'.format(stat.ndata,file))
data[:,0] = np.log(data[:,0])
stat.avg.append(np.mean(data, axis = 0))
stat.cov.append(np.cov(data, rowvar = False))
stat.cor.append(genCorMat(stat.cov[-1]))
return stat
stat = genMeanCorCov(dirPath = "data")
#### Now visualize the summarized data
ax = plt.figure(figsize = 1.5*np.array([6.4,4.8]), dpi = 150)
ax = plt.subplot()
ax.set_xlabel(r"Log( Fluence [ ergs/cm$^2$ ] )", fontsize = 17)
ax.set_ylabel(r"Log( Epeak [ keV ] )", fontsize = 17)
ax.axis(([1.0e-8, 1.0e-1, 1.0, 1.0e4])) # [xmin, xmax, ymin, ymax]
ax.set_xscale("log")
ax.set_yscale("log")
#### set up color mapping via correlations
cmap = Struct()
cmap.values = []
for i in range(0,stat.ndata): cmap.values.append(stat.cor[i][0,1])
cmap.values = np.array(cmap.values)
cmap.name = "coolwarm"
import matplotlib.colors as colors
import matplotlib.cm as cmx
cmapObject = plt.get_cmap(cmap.name)
#cNorm = colors.Normalize( vmin = np.min(cmap.values), vmax = np.max(cmap.values), clip = True )
cNorm = colors.Normalize( vmin=-1., vmax=1., clip = True )
mappable = cmx.ScalarMappable( norm = cNorm, cmap = cmapObject )
cmap._rgba = np.zeros( ( len(cmap.values) , 4) )
for i, rho in enumerate(cmap.values): cmap._rgba[i,:] = mappable.to_rgba(cmap.values[i])
iskip = 1
for i in range(0,stat.ndata,iskip):
bcrd = getEllipsoidBoundary(stat.cov[i],stat.avg[i])
ax.plot ( np.exp(bcrd[1,:])
, np.exp(bcrd[0,:])
, color = cmap._rgba[i]
, alpha = 0.25
, linewidth = 0.7
)
ax.set_title('Plot of Epeak vs. Fluence for {} Swift GRB events'.format(stat.ndata), fontsize = 17)
plt.savefig("SwiftDataPlotEllipse.png")
plt.show()