A naive implementation of Kmedoids clustering

Problem The K-medoids is a classical partitioning technique of clustering that splits the data set of $n$ objects into $k$ clusters, where the number $k$ of clusters is assumed to be known a priori. Unlike the K-means algorithm however, the...

Online comparison of the Kmeans clustering algorithm with DBSCAN

Problem On this website, you will find an online simulator of the Kmeans clustering technique. Visit this page and choose the first choice stating I’ll Chooose. You will be taken to a new page. On this new page, choose Smiley...

Online experimentation with DBSCAN clustering technique

Problem On this website, you will find an online simulator of the DBSCAN clustering technique. Visit this page and choose the first dataset option named Uniform. Recall from our lecture notes that the DBSCAN method has two free adjustable parameters...

Kmeans clustering - an implementation

Problem Consider this dataset points.txt. Write a script that reads this dataset and plots the second column of the dataset versus the first column as the following, Now write another script that applies Kmeans clustering technique to this data set...

Kmeans clustering: Determining the cluster number using the Elbow method

Problem Consider this dataset customers.csv of a Mall’s customers containing the details of customers in a mall. Our aim is to cluster the customers based on the relevant features “annual income” and “spending score”. Write a script that reads this...

Puzzle: Matchstick Wrong Equation

Problem Move just one matchstick in the following equation to make it hold.

Puzzle: How many living creatures are in the pond

Problem How many living creatures can you identify in this figure? (Hint: There are two).

Regression: Predicting the global land temperature of Earth in 2050 from the past data: Choosing the best model

Problem Consider this dataset, 1880_2020.csv, which contains the global land and ocean temperature anomalies of the earth from January 1880 to June 2020 at every month. As stated in the file, temperatures are in Degrees Celsius and reported as anomalies...

Regression: Estimating the parameters of a linear model for a Normally-distributed sample

Problem Supposed we have observed a dataset comprised of events with one attribute as in this file: z.csv. Plotting these points would yield a histogram like the following plot, Now our goal is to form a hypothesis about this dataset,...

Regression: Estimating the parameters of a Normally-distributed sample

Problem Supposed we have observed a dataset comprised of $15027$ events with one attribute variable in this file: dataFull.csv. Plotting these points would yield a histogram like the following plot, Now our goal is to form a hypothesis about this...

Computing the cross-correlation of sin() and cos()

Problem Generate two arrays corresponding to the values of $\sin(x)$ and $\cos(x+\pi/2)$ functions in the range $[0, 10\pi]$. Make a plot of the resulting arrays like the following illustration. Now use an FFT package in the language of your choice...

Computing the cross-correlation of two data attributes

Problem Consider this dataset of carbon emissions history per country. Make a visualization of the global carbon emission data in the CSV file in the above by summing over the contributions of all countries per year to obtain an illustration...

Computing the autocorrelation of a dataset

Problem Recall the globalLandTempHist.txt dataset that consisted of the global land temperature of Earth over the past 300 years. Also recall that the autocorrelation of a time-series is defined as the correlation of a univariate dataset with itself, with some...

Computing and removing the autocorrelation of a dataset

Problem Consider the following Banana function. def getLogFuncBanana(point): import numpy as np from scipy.stats import multivariate_normal as mvn from scipy.special import logsumexp NPAR = 2 # sum(Banana,gaussian) normalization factor normfac = 0.3 # sum(Banana,gaussian) normalization factor lognormfac = np.log(normfac) #...

Ugly visualization

Problem What is ugly in the following graph?

The population growths of the US states

Problem Which color scale has been used in the following visualization?

The cities with the most and least moderate temperature

Problem Consider the following plot displaying the temperatures of a number of US cities. Which city’s temperature is the least varying throughout the year? Which city’s temperature is the wildest varying throughout the year? Which city the hottest in the...

Wrong visualization

Problem What is wrong in the following visualization?

Excel Bar plot

Problem Consider the following salary data. Data Scientist | Physicist | Bioinformatician ---------------|-----------|----------------- $110,000 | $122,000 | $58,000 Make a graph of this data in Microsoft Excel similar to the following visualization.