# Similarity Measures

## Contents

# 17.2. Similarity Measures#

This section provides a review of popular similarity and dissimilarity measures in literature.

## 17.2.1. Introduction#

### 17.2.1.1. Similarity Function#

The following is an informal definition of a similarity function.

(Similarity function)

Let \(\bX\) be a dataset of \(n\) data points sampled from
the Euclidean space \(\RR^m\). Then a *similarity function*
is some function \(s : \RR^m \times \RR^m \to \RR\) for
any pair of data points \(\bx, \by \in \RR^m\)
with a structure

which grows as the data points look similar to each other and satisfies the following properties

\(0 \leq s(\bx, \by) \leq 1\).

\(s (\bx, \bx) = 1\) for every \(\bx \in \RR^m\).

\(s(\bx, \by) = s(\by, \bx)\).

Normally, we require that the similarity function is symmetric; i.e.,

There are rare cases where asymmetric similarity functions have been used.

(Similarity from metrics)

Let \(d\) be a metric (distance function) defined on \(\RR^n\). Then we have for any \(\bx, \by, \bz \in \RR^m\),

\(d (\bx, \bx) = 0\).

\(d (\bx, \by) \geq 0\).

\(d (\bx, \by) = d(\by, \bx)\).

\(d (\bx, \by) \leq d(\bx, \bz) + d(\bz, \by)\).

One way to induce a similarity function is:

We can see that

Since \(\exp(t) \geq 0\) for every \(t \in \RR\), hence \(s(\bx, \by) \geq 0\).

Since \(d(\bx, \by) \geq 0\), hence \(s(\bx, \by) \leq 1\).

Since \(d\) is symmetric, hence \(s\) is symmetric.

Since \(d(\bx, \bx) = 0\), hence \(s (\bx, \bx) = 1\).

Thus, \(s\) is indeed a similarity function.

Note that a metric satisfies the triangle inequality. However, this property was not required for the similarity function induced from the metric. Thus, a metric imposes an additional structure which is more stringent than the needs for a similarity measure.

For example, the squared distance function \(d^2\) satisfies the identity of indiscernibles, is nonnegative and symmetric. It doesnâ€™t satisfy triangle inequality. However, we can construct a similarity function

### 17.2.1.2. Proximity Matrices#

(Distance matrix)

Let \(d\) be a metric on the space \(\RR^m\).
Let \(\bX\) be a dataset of \(n\) points
\(\{ \bx_1, \dots, \bx_n \}\). Then the
*distance matrix* for \(\bX\) is defined as

where \(d_{i j} = d(\bx_i, \bx_j)\).

(Similarity matrix)

Let \(s\) be a similarity function on the space \(\RR^m\).
Let \(\bX\) be a dataset of \(n\) points
\(\{ \bx_1, \dots, \bx_n \}\). Then the
*similarity matrix* for \(\bX\) is defined as

where \(s_{i j} = s(\bx_i, \bx_j)\).

(Proximity matrix)

Let \(\bX\) be a dataset of \(n\) points
\(\{ \bx_1, \dots, \bx_n \}\). Then an \(n \times n\)
square matrix \(\bM\) where every
\((i,j)\)-th element is some measure of similarity
or distance between \(\bx_i\) and \(\bx_j\)
is known as a *proximity matrix*.

In other words, a proximity matrix is either a distance matrix or a similarity matrix.

Each entry in a proximity matrix is called
a *proximity index*.

(Asymmetry in proximity)

There are several problems where a proximity metric may not be symmetric. For example, when the objects in the dataset are locations in a hilly area and the measure of the distance is the time taken in going from point A to point B, the time taken may be quite different when going uphill or downhill. Such problems do give rise to asymmetrical proximity matrices. However, we will normally be concerned with symmetric cases.

(Proximity graph)

Let \(\bX\) be a dataset of \(n\) points
\(\{ \bx_1, \dots, \bx_n \}\).
Let \(\bM\) be a proximity matrix associated
with the dataset \(\bX\).
Then the corresponding *proximity graph*
\(\GGG\) is a weighted graph whose
nodes/vertices are the data points \(\bx_i\)
and the edges have weights equal to the
proximity indices.

A symmetric proximity matrix leads to an undirected graph while an asymmetric proximity matrix leads to a directed graph.

### 17.2.1.3. Scatter and Covariance#

(Scatter matrix)

Let \(\bX\) be a dataset of \(n\) points \(\{ \bx_1, \dots, \bx_n \}\) sampled from the Euclidean space \(\RR^n\). Let \(\bar{\bx}\) be the arithmetic mean of the data points.

Then the scatter matrix for \(\bX\) is defined as

(Statistical scatter)

The trace of the scatter matrix is known as the
*statistical scatter* of the data set.
It is given by

Recall that \(\Trace(\bA \bB) = \Trace (\bB \bA)\).

Hence

since this quantity is a scalar.

Let \(\CCC = \{C_1, \dots, C_k \}\) be a (hard) clustering (partition) of \(\bX\)
into \(k\) clusters.
Then for a given cluster \(C_j\), the *within-scatter-matrix* is
given by

where \(\bz_j\) is the mean of the cluster \(C_i\)

The *within-cluster* scatter matrix of the partition
is given by

The *within-cluster* scatter is the trace of this matrix given by

The *between-cluster* scatter matrix for the partition \(\CCC\) is
given by

We shall denote the dataset after mean removal as

where the matrix vector subtraction is done according to the standard broadcasting rules.

The covariance matrix of the dataset is given by

Sample covariance matrix is often used in statistics which is given by

## 17.2.2. Common Proximity Measures#

We look at some of the common proximity measures for numerical data.

### 17.2.2.1. Euclidean Distance#

### 17.2.2.2. Squared Euclidean Distance#

### 17.2.2.3. Manhattan Distance#

This is also known as city-block distance

Manhattan segmental distance is a variant in which only some of the dimensions selected by a subset \(P \subseteq \{ 1,\dots, m\}\) are used.

### 17.2.2.4. Maximum Distance#

This is also known as the *Chebyshev distance*.

### 17.2.2.5. Minkowski Distance#

This is the generalization of Euclidean, Manhattan and Chebyshev distances for some \(p \geq 1\).

### 17.2.2.6. Mahalanobis Distance#

As discussed in Example 17.5, Mahalanobis distance can address the issues related to the scale differences between different features and the correlation among features. This is given by

Mahalanobis distance is invariant to nonsingular transformations. Let \(\bC\) be any \(m \times m\) non-singular matrix.

Let \(\bY = \{ \by_1, \dots, \by_n \}\) be a new dataset constructed by transforming the dataset \(\bX\) as

Then

### 17.2.2.7. Chord Distance#

We normalize the data points to project them to the unit sphere in \(\RR^m\). Then we measure the length of the chord between the two points on the sphere. It is given by

It is easy to see how this formula is arrived at. Assume \(\bx, \by\) lie on the unit sphere. Then

When they are not normalized, we normalize them

Finally, for computing the length of the chord, we need to take the square root.

### 17.2.2.8. Geodesic Distance#

Once the data points have been normalized to the
unit sphere, one can calculate the shortest path (arc)
between any two data points along the surface
of the unit sphere. This is known as the
*geodesic distance*. It is given by

### 17.2.2.9. Cosine Similarity#

The cosine similarity between two points is given by

It is easy to see that cosine similarity is bound between \(0\) and \(1\), is symmetric and \(s_{\cos}(\bx, \bx) = 1\).

## 17.2.3. Proximity Between Clusters#

Many clustering algorithms are hierarchical. In the context of agglomerative hierarchical clustering, one needs to merge clusters which are similar to each other in each iteration. Thus, it is imperative to have some measures of similarity or dissimilarity between clusters.

In this subsection, we shall consider two arbitrary clusters of the dataset \(\bX\) denoted by \(C_1\) and \(C_2\) where

Let the mean of \(C_1\) be given by

Let the mean of \(C_2\) be given by

### 17.2.3.1. Mean Based Distance#

The mean based distance between two clusters is defined as

### 17.2.3.2. Nearest Neighbor Distance#

### 17.2.3.3. Farthest Neighbor Distance#

### 17.2.3.4. Average Neighbor Distance#

### 17.2.3.5. Statistical Distance#

### 17.2.3.6. Lance-Williams Formula#

Consider a step in an agglomerative hierarchical algorithm where clusters \(C_i\) and \(C_j\) are being merged into a new cluster \(C\) and we wish to consider the distance between another cluster \(C_k\) and the new cluster \(C\).

The following formula provides a recursive definition such that we can compute the distance \(d(C_k, C)\) using existing distance information.

where \(d\) is some distance function between two clusters.

Some common choices of the parameters \(\alpha_i, \alpha_j, \beta, \gamma\) are given in the following table.

Algorithm |
\(\alpha_i\) |
\(\alpha_j\) |
\(\beta\) |
\(\gamma\) |

Single-link |
\(\frac{1}{2}\) |
\(\frac{1}{2}\) |
\(0\) |
\(-\frac{1}{2}\) |

Complete-link |
\(\frac{1}{2}\) |
\(\frac{1}{2}\) |
\(0\) |
\(\frac{1}{2}\) |

Group-average |
\(\frac{n_i}{n_i + n_j}\) |
\(\frac{n_j}{n_i + n_j}\) |
\(0\) |
\(0\) |

Weighted group average |
\(\frac{1}{2}\) |
\(\frac{1}{2}\) |
\(0\) |
\(0\) |

Centroid |
\(\frac{n_i}{n_i + n_j}\) |
\(\frac{n_j}{n_i + n_j}\) |
\(\frac{-n_i n_j}{(n_i + n_j)^2}\) |
\(0\) |

Median |
\(\frac{1}{2}\) |
\(\frac{1}{2}\) |
\(-\frac{1}{4}\) |
\(0\) |