ParaMonte Fortran 2.0.0
Parallel Monte Carlo and Machine Learning Library
See the latest version documentation.
pm_knn Module Reference

This module contains procedures and generic interfaces for computing the nearest neighbor statistics of random samples.
More...

Data Types

interface  setKnnSorted
 Return the input distance matrix whose columns are sorted in ascending order on output, optionally only up to the kth row of each column, such that the kth row in the ith column is the kth nearest neighbor to the \(i^{th}\) point.
More...
 

Variables

character(*, SK), parameter MODULE_NAME = "@pm_knn"
 

Detailed Description

This module contains procedures and generic interfaces for computing the nearest neighbor statistics of random samples.

The k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951 and later expanded by Thomas Cover.
It is used for classification and regression and in both cases, the input consists of the k closest training examples in a data set.
The output depends on whether k-NN is used for classification or regression:

  1. In k-NN classification, the output is a class membership.
    An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small).
    If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
  2. In k-NN regression, the output is the property value for the object.
    This value is the average of the values of k nearest neighbors.
    If k = 1, then the output is simply assigned to the value of that single nearest neighbor.

k-NN is a type of classification where the function is only approximated locally and all computation is deferred until function evaluation.
Since this algorithm relies on distance for classification, if the features represent different physical units or come in vastly different scales then normalizing the training data can improve its accuracy dramatically.

Algorithm

The training examples are vectors in a multidimensional feature space, each with a class label.
The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.
In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point.
A commonly used distance metric for continuous variables is Euclidean distance.
For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance).
In the context of gene expression microarray data, for example, k-NN has been employed with correlation coefficients, such as Pearson and Spearman, as a metric.
Often, the classification accuracy of k-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighborhood components analysis.

Drawbacks

A major drawback of the basic majority voting classification occurs when the class distribution is skewed.
That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the k nearest neighbors due to their large number.
One way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its k nearest neighbors.
The class (or value, in regression problems) of each of the k nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point.
Another way to overcome skew is by abstraction in data representation.
For example, in a self-organizing map (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. K-NN can then be applied to the SOM.

See also
pm_distanceMahal
pm_distanceEuclid
pm_distanceHellinger
pm_distanceManhattan
pm_distanceMinkowski
Test:
test_pm_distanceEuclid


Final Remarks


If you believe this algorithm or its documentation can be improved, we appreciate your contribution and help to edit this page's documentation and source file on GitHub.
For details on the naming abbreviations, see this page.
For details on the naming conventions, see this page.
This software is distributed under the MIT license with additional terms outlined below.

  1. If you use any parts or concepts from this library to any extent, please acknowledge the usage by citing the relevant publications of the ParaMonte library.
  2. If you regenerate any parts/ideas from this library in a programming environment other than those currently supported by this ParaMonte library (i.e., other than C, C++, Fortran, MATLAB, Python, R), please also ask the end users to cite this original ParaMonte library.

This software is available to the public under a highly permissive license.
Help us justify its continued development and maintenance by acknowledging its benefit to society, distributing it, and contributing to it.

Author:
Fatemeh Bagheri, Thursday 8:40 PM, July 20, 2023, Dallas, TX Amir Shahmoradi, Saturday 1:00 AM, September, 1, 2018, Dallas, TX

Variable Documentation

◆ MODULE_NAME

character(*, SK), parameter pm_knn::MODULE_NAME = "@pm_knn"

Definition at line 89 of file pm_knn.F90.