17.2. Similarity Measures#

This section provides a review of popular similarity and dissimilarity measures in literature.

17.2.1. Introduction#

17.2.1.1. Similarity Function#

The following is an informal definition of a similarity function.

Definition 17.5 (Similarity function)

Let X be a dataset of n data points sampled from the Euclidean space Rm. Then a similarity function is some function s:Rm×RmR for any pair of data points x,yRm with a structure

s(x,y)=s(x1,,xd,y1,,yd)

which grows as the data points look similar to each other and satisfies the following properties

  1. 0s(x,y)1.

  2. s(x,x)=1 for every xRm.

  3. s(x,y)=s(y,x).

Normally, we require that the similarity function is symmetric; i.e.,

s(x,y)=s(y,x).

There are rare cases where asymmetric similarity functions have been used.

Example 17.6 (Similarity from metrics)

Let d be a metric (distance function) defined on Rn. Then we have for any x,y,zRm,

  1. d(x,x)=0.

  2. d(x,y)0.

  3. d(x,y)=d(y,x).

  4. d(x,y)d(x,z)+d(z,y).

One way to induce a similarity function is:

s(x,y)=exp(d(x,y)).

We can see that

  1. Since exp(t)0 for every tR, hence s(x,y)0.

  2. Since d(x,y)0, hence s(x,y)1.

  3. Since d is symmetric, hence s is symmetric.

  4. Since d(x,x)=0, hence s(x,x)=1.

Thus, s is indeed a similarity function.

Note that a metric satisfies the triangle inequality. However, this property was not required for the similarity function induced from the metric. Thus, a metric imposes an additional structure which is more stringent than the needs for a similarity measure.

For example, the squared distance function d2 satisfies the identity of indiscernibles, is nonnegative and symmetric. It doesn’t satisfy triangle inequality. However, we can construct a similarity function

s(x,y)=exp(d2(x,y)).

17.2.1.2. Proximity Matrices#

Definition 17.6 (Distance matrix)

Let d be a metric on the space Rm. Let X be a dataset of n points {x1,,xn}. Then the distance matrix for X is defined as

Mdist(X)=[0d12d1nd210d2ndn1dn20]

where dij=d(xi,xj).

Definition 17.7 (Similarity matrix)

Let s be a similarity function on the space Rm. Let X be a dataset of n points {x1,,xn}. Then the similarity matrix for X is defined as

Msim(X)=[1s12s1ns211s2nsn1sn21]

where sij=s(xi,xj).

Definition 17.8 (Proximity matrix)

Let X be a dataset of n points {x1,,xn}. Then an n×n square matrix M where every (i,j)-th element is some measure of similarity or distance between xi and xj is known as a proximity matrix.

In other words, a proximity matrix is either a distance matrix or a similarity matrix.

Each entry in a proximity matrix is called a proximity index.

Remark 17.1 (Asymmetry in proximity)

There are several problems where a proximity metric may not be symmetric. For example, when the objects in the dataset are locations in a hilly area and the measure of the distance is the time taken in going from point A to point B, the time taken may be quite different when going uphill or downhill. Such problems do give rise to asymmetrical proximity matrices. However, we will normally be concerned with symmetric cases.

Definition 17.9 (Proximity graph)

Let X be a dataset of n points {x1,,xn}. Let M be a proximity matrix associated with the dataset X. Then the corresponding proximity graph G is a weighted graph whose nodes/vertices are the data points xi and the edges have weights equal to the proximity indices.

A symmetric proximity matrix leads to an undirected graph while an asymmetric proximity matrix leads to a directed graph.

17.2.1.3. Scatter and Covariance#

Definition 17.10 (Scatter matrix)

Let X be a dataset of n points {x1,,xn} sampled from the Euclidean space Rn. Let x¯ be the arithmetic mean of the data points.

x¯=1ni=1nxi.

Then the scatter matrix for X is defined as

Mt(X)=i=1n(xix¯)(xix¯)T.

Definition 17.11 (Statistical scatter)

The trace of the scatter matrix is known as the statistical scatter of the data set. It is given by

tr(Mt(X))=i=1n(xix¯)T(xix¯).

Recall that tr(AB)=tr(BA).

Hence

tr(Mt(X))=tr(i=1n(xix¯)(xix¯)T)=tr(i=1n(xix¯)T(xix¯))=i=1n(xix¯)T(xix¯)

since this quantity is a scalar.

Let C={C1,,Ck} be a (hard) clustering (partition) of X into k clusters. Then for a given cluster Cj, the within-scatter-matrix is given by

Mt(Cj)=iCj(xizj)(xizj)T

where zj is the mean of the cluster Ci

zj=1|Cj|iCjxi.

The within-cluster scatter matrix of the partition is given by

(17.2)#Mw(C)=j=1kMt(Cj)=j=1kiCj(xizj)(xizj)T.

The within-cluster scatter is the trace of this matrix given by

(17.3)#tr(Mw(C))=j=1kiCj(xizj)T(xizj)

The between-cluster scatter matrix for the partition C is given by

Mb(C)=Mt(X)Mw(C).

We shall denote the dataset after mean removal as

X~=Xx¯

where the matrix vector subtraction is done according to the standard broadcasting rules.

The covariance matrix of the dataset is given by

(17.4)#Σ=1nX~X~T.

Sample covariance matrix is often used in statistics which is given by

(17.5)#S=1n1X~X~T=nn1Σ.

17.2.2. Common Proximity Measures#

We look at some of the common proximity measures for numerical data.

17.2.2.1. Euclidean Distance#

(17.6)#d2(x,y)=i=1m(xiyi)2=(xy)T(xy).

17.2.2.2. Squared Euclidean Distance#

(17.7)#d22(x,y)=i=1m(xiyi)2=(xy)T(xy).

17.2.2.3. Manhattan Distance#

This is also known as city-block distance

(17.8)#d1(x,y)=i=1m|xiyi|.

Manhattan segmental distance is a variant in which only some of the dimensions selected by a subset P{1,,m} are used.

(17.9)#dP(x,y)=1|P|iP|xiyi|.

17.2.2.4. Maximum Distance#

This is also known as the Chebyshev distance.

(17.10)#dmax(x,y)=maxi=1m|xiyi|.

17.2.2.5. Minkowski Distance#

This is the generalization of Euclidean, Manhattan and Chebyshev distances for some p1.

(17.11)#dp(x,y)=(i=1m|xiyi|p)1p.

17.2.2.6. Mahalanobis Distance#

As discussed in Example 17.5, Mahalanobis distance can address the issues related to the scale differences between different features and the correlation among features. This is given by

(17.12)#dmah(x,y)=(xy)TΣ1(xy).

Mahalanobis distance is invariant to nonsingular transformations. Let C be any m×m non-singular matrix.

Let Y={y1,,yn} be a new dataset constructed by transforming the dataset X as

yi=Cyi.

Then

dmah(xi,xj)=dmah(yi,yj).

17.2.2.7. Chord Distance#

We normalize the data points to project them to the unit sphere in Rm. Then we measure the length of the chord between the two points on the sphere. It is given by

(17.13)#dchord(x,y)=22x,yx2y2.

It is easy to see how this formula is arrived at. Assume x,y lie on the unit sphere. Then

xy,xy=x22+y222x,y=22x,y.

When they are not normalized, we normalize them

22x,yx2y2.

Finally, for computing the length of the chord, we need to take the square root.

17.2.2.8. Geodesic Distance#

Once the data points have been normalized to the unit sphere, one can calculate the shortest path (arc) between any two data points along the surface of the unit sphere. This is known as the geodesic distance. It is given by

(17.14)#dgeo(x,y)=arccos(1dchord(x,y)2).

17.2.2.9. Cosine Similarity#

The cosine similarity between two points is given by

(17.15)#scos(x,y)=|x,y|x2y2.

It is easy to see that cosine similarity is bound between 0 and 1, is symmetric and scos(x,x)=1.

17.2.3. Proximity Between Clusters#

Many clustering algorithms are hierarchical. In the context of agglomerative hierarchical clustering, one needs to merge clusters which are similar to each other in each iteration. Thus, it is imperative to have some measures of similarity or dissimilarity between clusters.

In this subsection, we shall consider two arbitrary clusters of the dataset X denoted by C1 and C2 where

C1={y1,,yr} and C2={z1,,zs}.

Let the mean of C1 be given by

y¯=1ri=1ryi.

Let the mean of C2 be given by

z¯=1si=1szi.

17.2.3.1. Mean Based Distance#

The mean based distance between two clusters is defined as

(17.16)#dmean(C1,C2)=d(y¯,z¯).

17.2.3.2. Nearest Neighbor Distance#

(17.17)#dnn(C1,C2)=min1ir,1jsd(yi,zj).

17.2.3.3. Farthest Neighbor Distance#

(17.18)#dfn(C1,C2)=max1ir,1jsd(yi,zj).

17.2.3.4. Average Neighbor Distance#

(17.19)#dave(C1,C2)=1rsi=1rj=1sd(yi,zj).

17.2.3.5. Statistical Distance#

(17.20)#dstat(C1,C2)=rsr+s(y¯z¯)T(y¯z¯).

17.2.3.6. Lance-Williams Formula#

Consider a step in an agglomerative hierarchical algorithm where clusters Ci and Cj are being merged into a new cluster C and we wish to consider the distance between another cluster Ck and the new cluster C.

The following formula provides a recursive definition such that we can compute the distance d(Ck,C) using existing distance information.

(17.21)#d(Ck,CiCj)=αid(Ck,Ci)+αjd(Ck,Cj)+βd(Ci,Cj)+γ|d(Ck,Ci)d(Ck,Cj)|

where d is some distance function between two clusters.

Some common choices of the parameters αi,αj,β,γ are given in the following table.

Algorithm

αi

αj

β

γ

Single-link

12

12

0

12

Complete-link

12

12

0

12

Group-average

nini+nj

njni+nj

0

0

Weighted group average

12

12

0

0

Centroid

nini+nj

njni+nj

ninj(ni+nj)2

0

Median

12

12

14

0