This is page describes the final semester project that will serve as the final exam for this course. Please submit all your efforts for this project (all files, data, and results) in DSP2019F/exams/final/ directory in your private repository for this course. Don’t forget to push your answers to your remote Github repository by 2 PM, Wednesday, Dec 11, 2019. Note: I strongly urge you to attend the future lectures until the end of the semester and seek help from the instructor (Amir) to tackle this project.

Inside the directory for the project (DSP2019F/exams/final/) create three other folders: data, src, results. The data folder contains the input data for this project. The src folder should contain all the codes that you write for this project, and the results folder should contain all the results generated by your code.

For your final project, you can pick one of the following two projects:

## Nonlinear Regression

### Data reduction and visualization

Our goal in this project is to fit a mathematical model of the growth of living cells to real experimental data for the growth of a cancer tumor in the brain of a rat. You can download the data in the form of a MATLAB data file for this project from here. Write a set of separate Python codes that perform the following tasks one after the other, and output all the results to the results folder described above. Since you have multiple Python codes each in a separate file for different purposes, you should also write a main Python code, such that when the user of your codes runs on the Bash command line,

main.py


then all the necessary Python codes to generate all the results will be called by this main script.

Initially at time $t=0 ~\mathrm{[days]}$, $100,000\pm10,000$ brain tumor cells are injected to the brain of the rat. These cells are then allowed to grow for 10 days. Then starting at day 10, the brain of the rat is imaged using an MRI machine.

Each image results in a 4-dimensional double-precision MATLAB matrix cells(:,:,:,:), corresponding to dimensions cells(y,x,z,time). This data is collected from MRI imaging of the rat’s brain almost every other day for two weeks. For example, cells(:,:,:,1) contains the number of cells at each point in space (y,x,z) at the first time point, or, cells(:,:,10,1) represents a (XY) slice of MRI at $z=1$ and $t=1 [days]$.

Therefore, the vector of times at which we have the number of tumor cells measured would be,

in units of days. Given this data set,

1.  First write a Python script that reads the input MATLAB binary file containing cell numbers at different positions in the rat’s brain measured by MRI, on different days.

2.  Write Python codes that generate a set of figures as similar as possible to the following figures (specific color-codes of the curves and figures do not matter, focus more on the format of the plots and its parts).

### Obtaining the error in tumor cell count

3.  Our assumption here is that the uncertainty in the total number of tumor cells at each time point is given by the number of tumor cells at the boundary of the tumor. Therefore,

• (This part is optional extra credit.) you will have to write a Python code that identifies the boundary of the tumor at each time point and then sums over the count cells in all boundary points and uses that as the error in the number of tumor cell counts.

• If you did not solve the above optional part, then assume that the uncertainty in the count of tumor cells at any given point in time is just $5\%$ of the total count of tumor cells. For the illustration of the error bars, you will need Python functions such as pyplot.errorbar() of matplotlib module. In the end, you should get and save a figure in your project’s figure folder like the following figure,

Note that this part of the project is completely independent of the modeling part described in the following section.

### The mathematical model of tumor growth

4.  Now our goal is to fit the time evolution of the growth of this tumor, using a mathematical model. To do so, we need to find the best-fit parameters of the model. The mathematical model we will use here is called the Gompertzian growth model. Here, we will use a slightly modified form of the Gompertzian function of the following form,

where $N(t,\lambda,c)$ is the predicted number of tumor cells at time $t$, $N_0$ is the initial number of tumor cells at time $t=0$ days, $\lambda$ is the growth rate parameter of the model, and $c$ is just another parameter of the model. We already know the initial value of the number of tumor cells, $N_0=100,000\pm10,000$. Therefore, we can fix $N_0$ to $100,000$ in the equation of the model given above.

However, we don’t know the values of the parameters $\lambda$ and $c$. Thus, we would like to find their best values given the input tumor cell data using some Python optimization algorithm.

This Gompertzian growth model is called our physical model for this problem, because it describes the physics of our problem (The physics/biology of the tumor growth).

#### Combining the physical model with a regression model

Now, if our physical model was ideally perfect in describing the data, the curve of the model prediction would pass through all the points in the growth curve plot of the above figure, thus providing a perfect description of data. This is, however, never the case, as it is famously said all models are wrong, but some are useful. In other words, the model prediction never matches observation perfectly.

Therefore, we have to seek for the parameter values that can bring the model prediction us as close as possible to data. To do so, we define a statistical model in addition to the physical model described above. In other words, we have to define a statistical regression model (the renowned least-squares method) that gives us the probability $\pi(\log N_{obs}|\log N(t))$ of observing individual data points at each of the given times,

Note that our statistical model given above is a Normal probability density function, with its mean parameter represented by the log of the output of our physical model, $\log N(t,\lambda,c)$, and its standard deviation represented by $\sigma$, which is unknown, and we seek to find it. The symbol $\pi$, whenever it appears with parentheses, like $\pi()$, it means the probability of the entity inside the parentheses. However, whenever it appears alone, it means the famous number PI, $\pi\approx 3.1415$.

Why do we use the logarithm of the number of cells instead of using the number of cells directly? The reason behind it is slightly complicated. A simple (but not entirely correct argument) is the following: We do so, because the tumor cell counts at later times become extremely large numbers, on the order of several million cells (For example, look at the number of cells in the late stages of the tumor growth, around $t=20$ days). Therefore, to make sure that we don’t hit any numerical precision limits of the computer when dealing with such huge numbers, we work with the logarithm of the number of tumor cells instead of their true non-logarithmic values.

We have seven data points, so the overall probability of observing all of data $\mathcal{D}$ together (the time vector and the logarithm of the number of cells at different times) given the parameters of the model, $\mathcal{L}(\mathcal{D}|\lambda,c,\sigma)$, is the product of their individual probabilities of observations given by the above equation,

Frequently, however, you would want to work with $\log\mathcal{L}$ instead of $\mathcal{L}$. This is again because the numbers involved are extremely small often below the precision limits of the computer. So, by taking the logarithm of the numbers, we work instead with the number’s exponent, which looks just like a normal number (not so big, not so small). So, by taking the log, the above equation becomes,

5.  Now the goal is to use an optimization algorithm in Python, such as fmin() of scipy package, to find the most likely set of the parameters of the model $\lambda,c,\sigma$ that give the highest likelihood of obtaining the available data, which is given by the number $\log\mathcal{L}(\mathcal{D}|\lambda,c,\sigma)$ from the above equation. So we want to find the set of parameters for which this number given by the above equation is maximized. You can also use any Python optimization function or method that you wish, to obtain the best parameters.

However, if you use fmin() of scipy package, then note that this function finds the minimum of an input function, not the maximum. What we want is to find the maximum of $\log\mathcal{L}(\mathcal{D}|\lambda,c,\sigma)$. What is the solution then? Very simple. We can multiply the value of $\log\mathcal{L}(\mathcal{D}|\lambda,c,\sigma)$ by a negative sign so that the maximum value is converted to a minimum. But, note that the position (the set of parameter values) at which this minimum occurs, will remain the same as the maximum position for $\log\mathcal{L}(\mathcal{D}|\lambda,c,\sigma)$.

So, now rewrite your likelihood function above by multiplying its final result (which is just number) by a negative sign. Then you pass this modified function to fmin() of scipy package and you find the optimal parameters. Note that fmin() of scipy package takes as input also a set of initial staring parameter values to initiate the search for the optimal parameters. You can use $(\lambda,c,\sigma) = [10,0.1,1]$ as your starting point given to fmin() of scipy package to search for the optimal values of the parameters.

Then redraw the above tumor evolution curve and show the result from the model prediction as well, like the following,

Report also your best-fit parameters in a file and submit them with all the figures and your codes to your exam folder repository.

## Hierarchical Clustering

Consider the set of (x,y) coordinates of 1000 points in this file: points.txt. Plotting these points would yield a scatter plot like the black points in the following plot,

The red points on this plot delineate the borders of the three ellipsoids from which these points have been drawn. Suppose, we did not have any a priori knowledge of these ellipsoids and we wanted to guess them to the best of our knowledge using Machine Learning methods, in particular, clustering techniques.

The problem here, however, is slightly complex than the above supposition: We may not even know, a priori, how many clusters exist in our data set. Many clustering techniques have been developed over the past decades to automatically answer the question of how many clusters exist in a dataset and where and which objects belong to what clusters.

Here, we want to focus on a very special approach. To predict the original ellipsoids from which these points are drawn, we can start with a very simple assumption: suppose all points came from one single ellipsoid. We can build this hypothetical ellipsoid by constructing the covariance matrix of the set of points in the dataset and then scale it properly such that the ellipsoid covers all the points in our dataset. Here is a procedure to this in Python,

First, we read the data set and visualize it,

%matplotlib notebook
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

Point = np.array([Data.x,Data.y])

fig = plt.figure( figsize=(4.5, 4) \
, dpi= 100 \
, facecolor='w' \
, edgecolor='w' \
) # create figure object
ax = fig.add_subplot(1,1,1) # Get the axes instance

ax.plot( Point[0,:] \
, Point[1,:] \
, 'r.' \
, markersize = 1 \
) # plot with color red, as line

ax.set_xlabel('X')
ax.set_ylabel('Y')
fig.savefig('points.png', dpi=200) # save the figure to an external file
plt.show() # display the figure


This will display the following figure,

Now, here is a script that computes the covariance matrix of a given sample of points (here, our dataset),

def getMinVolPartition(Point):
import numpy as np
npoint = len(Point[0,:])
ndim = len(Point[:,0])
ncMax = npoint // (ndim + 1) # max number of clusters possible
BoundingEllipsoidCenter = np.array([np.mean(Point[0,:]),np.mean(Point[1,:])])
SampleCovMat = np.mat(np.cov(Point))
SampleInvCovMat = np.mat(np.linalg.inv(SampleCovMat))
PointNormed = np.mat(np.zeros((ndim,npoint)))
for idim in range(ndim):
PointNormed[idim,:] = Point[idim] - BoundingEllipsoidCenter[idim]
MahalSq = PointNormed.T * SampleInvCovMat * PointNormed
maxMahalSq = np.max(MahalSq)
BoundingEllipsoidVolume = np.linalg.det(SampleCovMat) * maxMahalSq**ndim
BoundingEllipsoidCovMat = SampleCovMat * maxMahalSq
print(
"""
nd = {}
np = {}
ncMax = {}
SampleCovMat = {}
InvCovMat = {}
max(MahalSq) = {}
BoundingEllipsoidCenter = {}
BoundingEllipsoidCovMat = {}
BoundingEllipsoidVolume = {}
""".format( ndim
, npoint
, ncMax
, SampleCovMat[:]
, SampleInvCovMat
, maxMahalSq
, BoundingEllipsoidCenter
, BoundingEllipsoidCovMat
, BoundingEllipsoidVolume
))
return BoundingEllipsoidCenter, BoundingEllipsoidCovMat


Calling this function would give an output like the following,

getMinVolPartition(Point)

    nd = 2
np = 1000
ncMax = 333
SampleCovMat = [[1.0761723  0.36394188]
[0.36394188 0.71635847]]
InvCovMat = [[ 1.12198982 -0.5700206 ]
[-0.5700206   1.68554491]]
max(MahalSq) = 14.185346024371276
BoundingEllipsoidCenter = [6.44826263 6.14296536]
BoundingEllipsoidCovMat = [[15.26587652  5.16264153]
[ 5.16264153 10.16179275]]
BoundingEllipsoidVolume = 128.47580579408614

(array([6.44826263, 6.14296536]), matrix([[15.26587652,  5.16264153],
[ 5.16264153, 10.16179275]]))


where, the variable BoundingEllipsoidCenter is a vector of lebght two, representing the center of the bounding ellipsoid of these points, the variable BoundingEllipsoidCovMat represents the 2-by-2 covariance matrix of this bounding ellipsoid, and the variable BoundingEllipsoidVolume is the determinant of this bounding ellipsoid, essentially representing the volume encosed by it.

To visualize this bounding ellipsoid, we can use the following code,

def getRandMVU(numRandMVU,MeanVec,CovMat,isInside=True):
"""
generates numRandMVU uniformly-distributed random points from
inside an ndim-dimensional ellipsoid with Covariance Matrix CovMat,
centered at MeanVec[0:ndim].
Output:
Numpy matrix of shape numRandMVU by ndim
"""
import numpy as np
ndim = len(MeanVec)
AvgStdMVN = np.zeros(ndim)
CovStdMVN = np.eye(ndim)
RandStdMVN = np.random.multivariate_normal(AvgStdMVN,CovStdMVN,numRandMVU)
DistanceSq = np.sum(RandStdMVN**2, axis=1)
#print(len(DistanceSq))
if isInside:
UnifRnd = np.random.random((numRandMVU,))
UnifRnd = (UnifRnd**(1./ndim)) / np.sqrt(DistanceSq)

CholeskyLower = np.linalg.cholesky(np.mat(CovMat))
#print(CholeskyLower[1,0])
RandMVU = np.zeros(np.shape(RandStdMVN))
for iRandMVU in range(numRandMVU):
if isInside:
RandStdMVN[iRandMVU] *= UnifRnd[iRandMVU]
else:
RandStdMVN[iRandMVU] /= np.sqrt(DistanceSq[iRandMVU])
for i in range(ndim):
RandMVU[iRandMVU,i] = RandMVU[iRandMVU,i] + CholeskyLower[i,i] * RandStdMVN[iRandMVU,i]
for j in range(i+1,ndim):
RandMVU[iRandMVU,j] = RandMVU[iRandMVU,j] + CholeskyLower[j,i] * RandStdMVN[iRandMVU,i]
RandMVU[iRandMVU] += MeanVec
return RandMVU


The above code takes an input covariance matrix CovMat corresponding to an ellipsoid of interest centered at MeanVec, then outputs a set of numRandMVU points that lie on the boundary of this ellipsoid. If the optional argument isInside is set to False, then the output random points will be uniformly distributed inside the ellipsoid.

Here is an illustration of the bounding ellipsoid of the points we are interested to classify in this problem,

MeanVec, CovMat = getMinVolPartition(Point)
RandMVU = getRandMVU( numRandMVU=10000
, MeanVec=MeanVec
, CovMat=CovMat
, isInside = False
)
%matplotlib notebook
import matplotlib.pyplot as plt

fig = plt.figure( figsize=(4.5, 4) \
, dpi= 100 \
, facecolor='w' \
, edgecolor='w' \
) # create figure object

# plot the points
plt.plot( Point[0,:] \
, Point[1,:] \
, 'r.' \
, markersize = 2 \
)

# plot the center point
plt.plot( MeanVec \
, MeanVec \
, 'b.' \
, markersize = 10 \
)

# plot the bounding ellipsoid
plt.scatter(RandMVU[:,0],RandMVU[:,1],1)

ax.set_xlabel('X')
ax.set_ylabel('Y')
plt.show()


So far, we have been able to find an ellipsoid that encloses all of the points in our problem. But here is the second question: Does this single ellipsoid accurately describe the original ellipsoid(s) from which the points were drawn? and does it really represent the least-volume bounding ellipsoid for all of these points?

One way to ensure that this ellipsoid is indeed the least-volume ellipsoid is to check and see if the points are uniformly distributed inside our single ellipsoid. But this turns out to be a very challenging task.

An easier way to see if the single ellipsoid is a good fit to our points is to compute the area of the single ellipsoid, then move on to assume that our data came from two ellipsoids instead of a single ellipsoid. At this point, we can use the K-means clustering method to find the two clusters from which these points could have been drawn.

Now, here is the critical step: We compute the sum of the areas enclosed by these two ellipsoids (which could overlap, but that is fine, we proceed as if they were not overlapping). Then we can compare this sum with the area of the original single ellipsoid in the above figure:

1. If the area of the single ellipsoid is smaller than the sum of the areas of the two child-ellipsoids, we assume that all of the points in our dataset came from the single ellipsoid, and stop further searches for potentially more clusters in our dataset.
2. However, if the area of the single ellipsoid is larger than the sum of the areas of the two child-ellipsoids, then we know that the two smaller ellipsoids are likely better fit to our dataset than a single ellipsoid. Therefore, our dataset was likely generated from two-ellipsoids.

But what if there are more than two ellipsoids responsible for the generation of the points? One way to test this hypothesis is to repeat the above procedure for the two child-ellipsoids and see whether any one of them can be replaced with two sub-child-ellipsoids instead. This procedure can be then repeated for as many times as needed, until the algorithm stops, implying that all of the child-ellipsoids have been found, or, the number of points for a sub-clustering task becomes 3 or less than 3, in which case no more clustering is possible, because we need at least three points to form a 2D ellipsoid.

Write an algorithm based on the above description and provided scripts that can classify all of the points in a given dataset into an automatically-determined number of ellipsoids, such that each point in the dataset is enclosed by at least one ellipsoid. The first graph above shows an example of a set of such ellipsoids illustrated by the green dots.

Note that the ellipsoids found by your algorithm are not unique, meaning that different runs of the algorithm could potentially yield different sets of best-fit ellipsoids. However, we can hope that each set of such ellipsoids found by the algorithm is a good approximation to the original ellipsoids from which the points were drawn.

Here is an animation of this algorithm at work, for a set of points with an evolving overall-shape over time,