# 22.1. Introduction#

When designing a dictionary for a particular application, we have several options . On the one hand we can go through the long list of analytically constructed or tunable dictionaries and select one of them as suitable for the application in concern. On the other hand, we can take up a number of real example signals from our application and try to construct a dictionary which is optimized for these.

Dictionary learning (DL) [30, 75] is a process which attempts to solve the problem of constructing a dictionary directly from the set of example signals. The atoms of a learnt dictionary come from the underlying empirical data of training set of example signals for the specific application. While analytically constructed dictionaries are typically meant for only specific applications, the learning method allows one to construct dictionaries for any family of signals which are amenable to the sparse and redundant representations model. This certainly comes at a cost. Learnt dictionaries have to be held completely in memory explicitly as they happen to be lack any structure. Thus they don’t provide an efficient implementation of analysis ($$\bDDD^H \bx$$) and synthesis ($$\bDDD \ba$$) operators. Thus using them in applications leads to more computational costs.

We start by formalizing the notation for Dictionary Learning (DL).

1. We consider a set of $$S$$ example signals put together in a signal matrix $$\bX \in \CC^{N \times S}$$.

2. Consider a dictionary $$\bDDD \in \CC^{N \times D}$$.

3. Let $$\ba^i \in \CC^{D}$$ be the sparsest possible representation of $$\bx^i$$ in $$\bDDD$$ with

$\bx^i = \bDDD \ba^i + \be^i.$
4. We put all $$\ba^i$$ together in a matrix $$\bA \in \CC^{D \times S}$$.

5. Then we have

$\bX = \bDDD \bA + \bE$

where $$\bE \in \CC^{N \times S}$$ represents approximation error.

6. We are looking for best dictionary from the set of possible dictionaries such that we are able to get sparse representations of $$\bx^i$$ with low approximation error.

7. We can quantify the notion of good approximation by putting an upper bound on the norm of approximation error as $$\| \be^i \|_2 \leq \epsilon$$.

8. Combining these ideas, we introduce the notion of a sparse signal model denoted as $$\MMM_{\bDDD, K, \epsilon}$$ which consists of a dictionary $$\bDDD$$ providing $$K$$-sparse representations for a class of signals (from which the example signals are drawn) with an upper bound on approximation error given by $$\epsilon$$.

9. The DL problem tries to learn best model $$\MMM$$ based on the example signals $$\bX$$.

1. We can formulate the DL problem as an optimization problem as follows:

(22.1)#$\begin{split}& \underset{\bDDD, \{\ba^i \}_{i=1}^{S} }{\text{minimize }} & & \sum_{i=1}^S \| \ba^i \|_0 \\ & \text{subject to } & & \| \bx^i - \bDDD \ba^i \|_2 \leq \epsilon, \; i = 1, \ldots, S.\end{split}$
2. In this version, we are trying to minimize total sparsity of $$\ba^i$$ while using the upper bound on approximation error as optimization constraint.

3. We are not enforcing sparsity constraint that $$\| \ba^i \|_0 \leq K \Forall 1 \leq i \leq S$$.

4. Alternatively we can also write:

(22.2)#$\begin{split} & \underset{\bDDD, \{\ba^i \}_{i=1}^{S} }{\text{minimize }} & & \sum_{i=1}^S \| \bx^i - \bDDD \ba^i \|_2^2 \\ & \text{subject to } & & \| \ba^i \|_0 \leq K, \; i = 1, \ldots, S.\end{split}$
5. In this version we are trying to minimize approximation error while keeping $$K$$-sparsity as optimization constraint.

6. We are not enforcing the constraint that $$\| \be^i \|_2 \leq \epsilon$$.

7. Unfortunately, there doesn’t exist any computationally tractable algorithm to solve these optimization problems.

8. An alternative is to consider a heuristic iterative approach presented in Algorithm 22.1.

Algorithm 22.1 (Dictionary learning: iterative approach)

Initialization:

1. Choose an initial dictionary $$\bDDD^0$$.

2. $$k \leftarrow 0$$ # iteration counter. \Repeat{halting criteria is true}{

Algorithm:

1. If halting criteria is met: break.

2. $$k \leftarrow k + 1$$.

3. Obtain representations $$\ba^i$$ for $$\bx^i$$ with the given $$\bDDD^{k-1}$$ and sparsity level $$K$$ using some sparse approximation algorithm.

4. Obtain new dictionary $$\bDDD^k$$ from the signals $$\bx^i$$ and their representations $$\ba^i$$.

Some possible halting criteria are:

• Stop after a fixed number of iterations.

• Stop when $$\| \be^i \|_2 \leq \epsilon$$ for every example.

• Stop when approximation errors stop improving.

For initialization of the algorithm, we have few options.

• We can start with a randomly generated dictionary.

• We can select $$D$$ examples from $$\bX$$ and put them together as our starting dictionary.

• We can start with some analytical dictionary suitable for the application domain.

## 22.1.1. Dictionary Learning Methods#

Two popular algorithms implementing this approach are MOD (Method of Optimal Directions) and K-SVD.

### 22.1.1.1. Method Of Optimal Directions#

In MOD , dictionary update is formulated as

$\bDDD^k = \text{arg} \underset{\bDDD}{\min} \| \bX - \bDDD \bA^k \|_{F}^2$

A straight forward least squares solution is obtained as

$\bDDD^k = \bX (\bA^k)^{\dag}.$

### 22.1.1.2. K-SVD#

K-SVD  is slightly different. Rather than recomputing whole dictionary at once by solving the LS problem, it updates atoms of $$\bDDD$$ one by one.

1. Let $$j$$ be the index of atom $$\bd_j$$ being updated.

2. Consider the error

$\bE_j = \bX - \sum_{k, k\neq j} \bd_k \bb_k^T$
3. $$\bb_k$$ refers to the $$k$$-th row of $$\bA$$; i.e. entries for atom $$\bd_k$$ in every example.

4. $$\bE_j$$ means the approximation error when the atom $$\bd_j$$ (and corresponding entries in $$\bA$$) has been dropped from consideration.

5. We can see that

$\bX - \bDDD \bA = \bX - \sum_{k=1}^D \bd_k \bb_k^T = \bX - \sum_{k, k\neq j} \bd_k \bb_k^T - \bd_j \bb_j^T = \bE_j - \bd_j \bb_j^T.$
6. Hence

$\| \bX - \bDDD \bA \|_F^2 = \| \bE_j - \bd_j \bb_j^T \|_F^2.$
7. On the L.H.S. we have total approximation error.

8. On the R.H.S. we have the same expressed in terms of $$\bE_j$$ and $$\bd_j$$ where $$\bE_j$$ doesn’t depend on $$\bd_j$$.

9. An optimal $$\bd_j$$ is one which can minimize R.H.S.

10. This can easily be obtained by rank-1 approximation of $$\bE_j$$.

11. The rank-1 approximation gives us both $$\bd_j$$ as well as the entries in the $$j$$-th row of $$\bA$$ given by $$\bb_j$$. There is a small catch though.

12. In general the rank-1 approximation can lead to a dense $$\bb_j$$ meaning the atom $$\bd_j$$ gets used in many signal representations. We wish to avoid that. We don’t want $$\bd_j$$ to appear in many signal representations.

13. For this, we identify representations $$\ba^i$$ in which atom $$\bd_j$$ appears, and let them be indexed by $$\Gamma$$.

14. We then restrict $$\bE_j$$ to signals indexed by $$\Gamma$$ to avoid a dense $$\bb_j$$.

15. Rank-1 approximation can be easily obtained using singular value decomposition.

16. We perform SVD of $$\bE_{j,\Gamma} = \bU \Sigma \bV^H$$.

17. We then pick $$\bu_1, \sigma_1, \bv_1$$.

18. $$\bu_1$$ is the new update for $$\bd_j$$.

19. Further $$\sigma_1 \bv_1$$ is the update for $$\bb_{j, \Gamma}$$.

20. Rest of $$\bb_j$$ is left with zero entries.

21. We repeat the process for each of the atoms in $$\bDDD$$ to obtain next update of $$\bDDD$$.