17.6. Spectral Clustering#

Spectral clustering is a graph based clustering algorithm [84].

We build a graph $G = {T, W}$ to obtain the clustering $C$ of $X$ .

Each vertex in the graph represents a data point.
Each edge in the graph represents the similarity between two data points.
$T$ denotes the list of vertices.
$W$ denotes the adjacency matrix built from the similarities.

Once the graph has been built, the following steps are performed.

The degree of a vertex $t_{s} \in T$ is defined as $d_{s} = \sum_{j = 1}^{S} w_{s j}$ .
The degree matrix $D$ is defined as the diagonal matrix with the degrees ${d_{s}}_{s = 1}^{S}$ .
The unnormalized graph Laplacian is defined as $L = D - W$ .
The normalized graph Laplacian 1 is defined as

$L_{rw} ≜ D^{- 1} L = I - D^{- 1} W$

The subscript $rw$ stands for random walk.
We compute $L_{rw}$ and examine its eigen-structure to estimate the number of clusters $C$ and the label vector $L$ .
If $C$ is known in advance, usually the first $C$ eigen vectors of $L_{rw}$ corresponding to the smallest eigen-values are taken and their row vectors are clustered using K-means algorithm [71].
Since, we don’t make any assumption on the number of clusters, we need to estimate it.
A simple way is to track the eigen-gap statistic.
After arranging the eigen values in increasing order, we can choose the number $C$ such that the eigen values $λ_{1}, \dots, λ_{C}$ are very small and $λ_{C + 1}$ is large.
This is guided by the theoretical results that if a Graph has $C$ connected components then exactly $C$ eigen values of $L_{rw}$ are 0.
However, when the data points are not clearly separated, and noise is introduced, this approach becomes tricky.
We can go for a more robust approach by analyzing the eigen vectors as described in [87].
The approach of [87], with a slightly different definition of the graph Laplacian $(D^{- 1 / 2} W D^{- 1 / 2})$ [62], has been adapted for working with the Laplacian $L_{rw}$ as defined above.
We estimate the number of clusters from the Graph Laplacian.
It can be easily shown that $0$ is an eigen value of $L_{rw}$ with an eigen vector $1_{S}$ [84].
Further, the multiplicity of eigen value $0$ equals the number of connected components in $G$ .
In fact the adjacency matrix can be factored as

$\begin{array}{r} W = [\begin{array}{c} W_{1} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & W_{P} \end{array}] Γ \end{array}$

where $W_{p} \in R^{S_{p} \times S_{p}}$ is the adjacency matrix for the $p$ -th connected component of $G$ corresponding to the $p$ -th cluster and $Γ$ is the unknown permutation matrix.
The graph Laplacian for each $W_{p}$ has an eigen value $0$ and the eigen-vector $1_{S_{p}}$ .
Thus, if we look at the $P$ -dimensional eigen-space of $L_{rw}$ corresponding to eigen value $0$ , then there exists a basis $\hat{V} \in R^{S \times P}$ such that each row of $\hat{V}$ is a unit vector in $R^{P}$ and the columns contain $S_{1}, \dots, S_{P}$ ones.
Actual eigen vectors obtained through any numerical method will be a rotated version of $\hat{V}$ given by $V = \hat{V} R$ .
[87] suggests a cost function over the entries in $V$ such that the cost is minimized when the rows of $V$ are close to coordinate vectors.
It then estimates a rotation matrix as a product of Givens rotations which can rotate $V$ to minimize the cost.
The parameters of the rotation matrix are the angles of Givens rotations which are estimated through a Gradient descent process.
Since $P$ is unknown, the algorithm is run over multiple values of $C$ and we choose the value which gives minimum cost.
Note that, we reuse the rotated version of $V$ obtained for a particular value of $C$ when we go for examining $C + 1$ eigen-vectors.
This may appear to be ad-hoc, but is seen to help in faster convergence of the gradient descent algorithm for next iteration.
When $S$ is small, we can do a complete SVD of $L_{rw}$ to get the eigen vectors.
However, this is time consuming when $S$ is large (say 1000+).
An important question is how many eigen vectors we really need to examine!
As $C$ increases, the number of Givens rotation parameters increase as $C (C - 1) / 2$ .
Thus, if we examine too many eigen-vectors, we will lose out unnecessarily on speed.
We can actually use the eigen-gap statistic described above to decide how many eigen vectors we should examine.
Finally, we assign labels to each data point to identify the cluster they belong to.
As described above, we maintain the rotated version of $V$ during the estimation of rotation matrix.
Once, we have zeroed in on the right value of $C$ , then assigning labels to $x^{s}$ is straight-forward.
We simply perform non-maximum suppression on the rows of V, i.e. we keep the largest (magnitude) entry in each row of $V$ and assign zero to the rest.
The column number of the largest entry in the $s$ -th row of $V$ is the label $l_{s}$ for $x^{s}$ .
This completes the clustering process.

While eigen gap statistic based estimation of number of clusters is quick, it requires running an additional $K$ -means algorithm step on the first $C$ eigen vectors to assign the labels. In contrast, eigen vector based estimation of number of clusters is involved and slow but it allows us to pick the labels very quickly.

1: We specifically use the random walk version of normalized Graph Laplacian as defined in [84]. There are other ways to define normalized graph Laplacian.

Topics in Signal Processing

Spectral Clustering

17.6. Spectral Clustering#