Similarity Measures
Contents
17.2. Similarity Measures#
This section provides a review of popular similarity and dissimilarity measures in literature.
17.2.1. Introduction#
17.2.1.1. Similarity Function#
The following is an informal definition of a similarity function.
Definition 17.5 (Similarity function)
Let
which grows as the data points look similar to each other and satisfies the following properties
. for every . .
Normally, we require that the similarity function is symmetric; i.e.,
There are rare cases where asymmetric similarity functions have been used.
Example 17.6 (Similarity from metrics)
Let
. . . .
One way to induce a similarity function is:
We can see that
Since
for every , hence .Since
, hence .Since
is symmetric, hence is symmetric.Since
, hence .
Thus,
Note that a metric satisfies the triangle inequality. However, this property was not required for the similarity function induced from the metric. Thus, a metric imposes an additional structure which is more stringent than the needs for a similarity measure.
For example, the squared distance function
17.2.1.2. Proximity Matrices#
Definition 17.6 (Distance matrix)
Let
where
Definition 17.7 (Similarity matrix)
Let
where
Definition 17.8 (Proximity matrix)
Let
In other words, a proximity matrix is either a distance matrix or a similarity matrix.
Each entry in a proximity matrix is called a proximity index.
Remark 17.1 (Asymmetry in proximity)
There are several problems where a proximity metric may not be symmetric. For example, when the objects in the dataset are locations in a hilly area and the measure of the distance is the time taken in going from point A to point B, the time taken may be quite different when going uphill or downhill. Such problems do give rise to asymmetrical proximity matrices. However, we will normally be concerned with symmetric cases.
Definition 17.9 (Proximity graph)
Let
A symmetric proximity matrix leads to an undirected graph while an asymmetric proximity matrix leads to a directed graph.
17.2.1.3. Scatter and Covariance#
Definition 17.10 (Scatter matrix)
Let
Then the scatter matrix for
Definition 17.11 (Statistical scatter)
The trace of the scatter matrix is known as the statistical scatter of the data set. It is given by
Recall that
Hence
since this quantity is a scalar.
Let
where
The within-cluster scatter matrix of the partition is given by
The within-cluster scatter is the trace of this matrix given by
The between-cluster scatter matrix for the partition
We shall denote the dataset after mean removal as
where the matrix vector subtraction is done according to the standard broadcasting rules.
The covariance matrix of the dataset is given by
Sample covariance matrix is often used in statistics which is given by
17.2.2. Common Proximity Measures#
We look at some of the common proximity measures for numerical data.
17.2.2.1. Euclidean Distance#
17.2.2.2. Squared Euclidean Distance#
17.2.2.3. Manhattan Distance#
This is also known as city-block distance
Manhattan segmental distance is a variant in which
only some of the dimensions selected by a subset
17.2.2.4. Maximum Distance#
This is also known as the Chebyshev distance.
17.2.2.5. Minkowski Distance#
This is the generalization of Euclidean, Manhattan and Chebyshev distances
for some
17.2.2.6. Mahalanobis Distance#
As discussed in Example 17.5, Mahalanobis distance can address the issues related to the scale differences between different features and the correlation among features. This is given by
Mahalanobis distance is invariant to nonsingular transformations.
Let
Let
Then
17.2.2.7. Chord Distance#
We normalize the data points to project them
to the unit sphere in
It is easy to see how this formula is arrived at.
Assume
When they are not normalized, we normalize them
Finally, for computing the length of the chord, we need to take the square root.
17.2.2.8. Geodesic Distance#
Once the data points have been normalized to the unit sphere, one can calculate the shortest path (arc) between any two data points along the surface of the unit sphere. This is known as the geodesic distance. It is given by
17.2.2.9. Cosine Similarity#
The cosine similarity between two points is given by
It is easy to see that cosine similarity is bound between
17.2.3. Proximity Between Clusters#
Many clustering algorithms are hierarchical. In the context of agglomerative hierarchical clustering, one needs to merge clusters which are similar to each other in each iteration. Thus, it is imperative to have some measures of similarity or dissimilarity between clusters.
In this subsection, we shall consider two
arbitrary clusters of the dataset
Let the mean of
Let the mean of
17.2.3.1. Mean Based Distance#
The mean based distance between two clusters is defined as
17.2.3.2. Nearest Neighbor Distance#
17.2.3.3. Farthest Neighbor Distance#
17.2.3.4. Average Neighbor Distance#
17.2.3.5. Statistical Distance#
17.2.3.6. Lance-Williams Formula#
Consider a step in an agglomerative hierarchical algorithm
where clusters
The following formula provides a recursive definition
such that we can compute the distance
where
Some common choices of the parameters
Algorithm |
||||
Single-link |
||||
Complete-link |
||||
Group-average |
||||
Weighted group average |
||||
Centroid |
||||
Median |