18.5. Sparse and Redundant Representations#

18.5.1. Dictionaries#

Definition 18.4 (Dictionary)

A dictionary for $$\CC^N$$ is a finite collection $$\bDDD$$ of unit-norm vectors which span the whole space.

The elements of a dictionary are called atoms and they are denoted by $$\phi_{\omega}$$ where $$\omega$$ is drawn from an index set $$\Omega$$.

The dictionary is written as

$\bDDD = \{\phi_{\omega} : \omega \in \Omega \}$

where

$\| \phi_{\omega} \|_2 = 1 \Forall \omega \in \Omega$

and every signal $$\bx \in \CC^N$$ can be expressed as

$\bx = \sum_{\omega \in \Omega} c_{\omega} \phi_{\omega}.$

We use the letter $$D$$ to denote the number of elements in the dictionary; i.e.,

$D = | \Omega |.$

This definition is adapted from [77].

The indices may have an interpretation, such as the time-frequency or time-scale localization of an atom, or they may simply be labels without any underlying meaning.

Note that the dictionary need not provide a unique representation for any vector $$\bx \in \CC^N$$, but it provides at least one representation for each $$\bx \in \CC^N$$.

When $$D=N$$ we have a set of unit norm vectors which span the whole of $$\CC^N$$. Thus we have a basis (not-necessarily an orthonormal basis). A dictionary cannot have $$D < N$$. The more interesting case is when $$D > N$$.

18.5.2. Redundant Dictionaries and Sparse Signals#

With $$D > N$$, clearly there are more atoms than necessary to provide a representation of every signal $$\bx \in \CC^N$$. Thus such a dictionary is able provide multiple representations to same vector $$\bx$$. We call such dictionaries redundant dictionaries or over-complete dictionaries.

In contrast a basis with $$D=N$$ is called a complete dictionary. A special class of signals is those signals which have a sparse representation in a given dictionary $$\bDDD$$.

Definition 18.5 ($$(\bDDD,K)$$-sparse signals)

A signal $$\bx \in \CC^N$$ is called $$(\bDDD,K)$$-sparse if it can be expressed as a linear combination of at-most $$K$$ atoms from the dictionary $$\bDDD$$.

Normally, for sparse signals, we have $$K \ll D$$. It is usually expected that $$K \ll N$$ also holds.

Let $$\Lambda \subset \Omega$$ be a subset of indices with $$|\Lambda|=K$$.

Let $$\bx$$ be any signal in $$\CC^N$$ such that $$\bx$$ can be expressed as

$\bx = \sum_{\lambda \in \Lambda} b_{\lambda} \phi_{\lambda} \quad \text{where } b_{\lambda} \in \CC.$

Note that this is not the only possible representation of $$\bx$$ in $$\bDDD$$. This is just one of the possible representations of $$\bx$$. The special feature of this representation is that it is $$K$$-sparse i.e. only at most $$K$$ atoms from the dictionary are being used.

Now there are $$\binom{D}{K}$$ ways in which we can choose a set of $$K$$ atoms from the dictionary $$\bDDD$$.

Thus the set of $$(\bDDD,K)$$-sparse signals is given by

$\Sigma_{(\bDDD,K)} = \{\bx \in \CC^N \ST \bx = \sum_{\lambda \in \Lambda} b_{\lambda} \phi_{\lambda}, \quad \Lambda \subseteq \Omega, | \Lambda | = K \}.$

This set $$\Sigma_{(\bDDD,K)}$$ is dependent on the chosen dictionary $$\bDDD$$. In the sequel, we will simply refer to it as $$\Sigma_K$$.

Example 18.15 ($$K$$-sparse signals for standard basis)

For the special case where $$\bDDD$$ is nothing but the standard basis of $$\CC^N$$, then

$\Sigma_K = \{ \bx \in \CC^N \ST \| \bx \|_0 \leq K\};$

i.e., the set of signals which have $$K$$ or less non-zero elements.

Example 18.16 ($$K$$-sparse signals for orthonormal basis)

In contrast if we choose an orthonormal basis $$\Psi$$ such that every $$\bx \in \CC^N$$ can be expressed as

$\bx = \Psi \ba$

then with the dictionary $$\bDDD = \Psi$$, the set of $$K$$-sparse signals is given by

$\Sigma_K = \{ \bx = \Psi \ba \ST \| \ba \|_0 \leq K\}.$

We also note that for a specific choice of $$\Lambda \subseteq \Omega$$ with $$|\Lambda| = K$$, the set of vectors

$\{\bx \in \CC^N \ST \bx = \sum_{\lambda \in \Lambda} b_{\lambda} \phi_{\lambda} \} = \span \{\phi_{\lambda} \ST \lambda \in \Lambda \}$

form a subspace of $$\CC^N$$.

So we have $$\binom{D}{K}$$ $$K$$-sparse subspaces contained in the dictionary $$\bDDD$$. And the $$K$$-sparse signals lie in the union of these subspaces.

18.5.3. Sparse Approximation Problem#

In sparse approximation problem, we attempt to approximate a given signal $$\bx \in \CC^N$$ as a linear combination of $$K$$ atoms from the dictionary $$\bDDD$$ where $$K \ll N$$ and typically $$N \ll D$$; i.e., the number of atoms in a dictionary $$\bDDD$$ is typically much larger than the ambient signal space dimension $$N$$.

Naturally, we wish to obtain a best possible sparse representation of $$\bx$$ over the atoms $$\phi_{\omega} \in \bDDD$$ which minimizes the approximation error.

Let $$\Lambda$$ denote the index set of atoms which are used to create a $$K$$-sparse representation of $$\bx$$ where $$\Lambda \subset \Omega$$ with $$|\Lambda| = K$$.

Let $$\bx_{\Lambda}$$ denote an approximation of $$\bx$$ over the set of atoms indexed by $$\Lambda$$.

Then we can write $$\bx_{\Lambda}$$ as

$\bx_{\Lambda} = \sum_{\lambda \in \Lambda} b_{\lambda} \phi_{\lambda} \quad \text{where } b_{\lambda} \in \CC.$

We put all complex valued coefficients $$b_{\lambda}$$ in the sum into a list $$\bb_{\Lambda}$$. The approximation error is given by

$\be = \| \bx - \bx_{\Lambda} \|_2.$

Clearly we would like to minimize the approximation error over all possible choices of $$K$$ atoms and corresponding set of coefficients $$\bb_{\Lambda}$$.

Thus the sparse approximation problem can be cast as a minimization problem given by

(18.18)#$\underset{|\Lambda| = K}{\text{min}} \, \underset{\bb_{\Lambda}}{\text{min}} \left \| \bx - \sum_{\lambda \in \Lambda} b_{\lambda} \phi_{\lambda} \right \|_2.$

If we choose a particular $$\Lambda$$, then the inner minimization problem becomes a straight-forward least squares problem. But there are $$\binom{D}{K}$$ possible choices of $$\Lambda$$ and solving the inner least squares problem for each of them becomes prohibitively expensive.

We reemphasize here that in this formulation we are using a fixed dictionary $$\bDDD$$ while the vector $$\bx \in \CC^N$$ is arbitrary.

This problem is known as $$(\bDDD, K)$$-SPARSE approximation problem.

A related problem is known as $$(\bDDD, K)$$-EXACT-SPARSE problem where it is known a-priori that $$\bx$$ is a linear combination of at-most $$K$$ atoms from the given dictionary $$\bDDD$$ i.e. $$\bx$$ is a $$K$$-sparse signal as defined above for the dictionary $$\bDDD$$.

This formulation simplifies the minimization problem (18.18) since it is known a priori that for $$K$$-sparse signals, a $$0$$ approximation error can be achieved. The only problem is to find a set of subspaces from the $$\binom{D}{K}$$ possible $$K$$-sparse subspaces which are able to provide a $$K$$-sparse representation of $$\bx$$ and among them choose one. It is imperative to note that even the $$K$$-sparse representation need not be unique.

Clearly the EXACT-SPARSE problem is simpler than the SPARSE approximation problem. Thus if EXACT-SPARSE problem is NP-Hard then so is the harder SPARSE-approximation problem. It is expected that solving the EXACT-SPARSE problem will provide insights into solving the SPARSE problem.

In Theorem 18.8 we identified conditions under which a sparse representation for a given vector $$\bx$$ in a two-ortho-basis is unique. It would be useful to get similar conditions for general dictionaries. such conditions would help us guarantee the uniqueness of EXACT-SPARSE problem.

18.5.4. Synthesis and Analysis#

The atoms of a dictionary $$\bDDD$$ can be organized into a $$N \times D$$ matrix as follows:

$\Phi \triangleq \begin{bmatrix} \phi_{\omega_1} & \phi_{\omega_2} & \dots & \phi_{\omega_D} \end{bmatrix}.$

where $$\Omega = \{\omega_1, \omega_2, \dots, \omega_N\}$$ is the index set for the atoms of $$\bDDD$$. We recall that $$\phi_{\omega} \in \CC^N$$, hence they have a column vector representation in the standard basis for $$\CC^N$$. The order of columns doesn’t matter as long as it remains fixed once chosen.

Thus in matrix terminology a representation of $$\bx \in \CC^N$$ in the dictionary can be written as

$\bx = \Phi \bb$

where $$\bb \in \CC^D$$ is a vector of coefficients to produce a superposition $$\bx$$ from the atoms of dictionary $$\bDDD$$. Clearly with $$D > N$$, $$\bb$$ is not unique. Rather for every vector $$\bz \in \NullSpace(\Phi)$$, we have:

$\Phi (\bb + \bz) = \Phi \bb + \Phi \bz = \bx + \bzero = \bx.$

Definition 18.6 (Synthesis matrix)

The matrix $$\Phi$$ is called a synthesis matrix since $$\bx$$ is synthesized from the columns of $$\Phi$$ with the coefficient vector $$\bb$$.

We can also view the synthesis matrix $$\Phi$$ as a linear operator from $$\CC^D$$ to $$\CC^N$$.

There is another way to look at $$\bx$$ through $$\Phi$$.

Definition 18.7 (Analysis matrix)

The conjugate transpose $$\Phi^H$$ of the synthesis matrix $$\Phi$$ is called the analysis matrix. It maps a given vector $$\bx \in \CC^N$$ to a list of inner products with the dictionary:

$\bc = \Phi^H \bx$

where $$\bc \in \CC^D$$.

Note that in general $$\bx \neq \Phi (\Phi^H \bx)$$ unless $$\bDDD$$ is an orthonormal basis.

Definition 18.8 ($$(\bDDD, K)$$ EXACT SPARSE problem)

With the help of synthesis matrix $$\Phi$$, the $$(\bDDD, K)$$ EXACT SPARSE problem can now be written as

(18.19)#$\begin{split}& \underset{\ba}{\text{minimize}} & & \| \ba \|_0 \\ & \text{subject to } & & \bx = \Phi \ba\\ & \text{and } & & \| \ba \|_0 \leq K\end{split}$

If $$\bx \notin \Sigma_K$$, then the EXACT SPARSE problem is infeasible. Otherwise, we are looking to find the sparsest possible solution.

Definition 18.9 ($$(\bDDD, K)$$ SPARSE approximation problem)

With the help of synthesis matrix $$\Phi$$, the $$(\bDDD, K)$$ SPARSE approximation problem can now be written as

(18.20)#$\begin{split}& \underset{\ba}{\text{minimize}} & & \| \bx - \Phi \ba \|_2 \\ & \text{subject to } & & \| \ba \|_0 \leq K.\end{split}$

This problem can be visualized as a projection of $$\bx$$ on to the set $$\Sigma_K$$. Hence, it always has a solution.

18.5.5. P-Norms#

There are some simple and useful results on relationships between different $$p$$-norms listed in this section. We also discuss some interesting properties of $$l_1$$-norm specifically.

Definition 18.10 (Complex sign vector)

Let $$\bv \in \CC^N$$. Let the entries in $$\bv$$ be represented as

$v_k = r_k \exp (i \theta_k)$

where $$r_k = | v_k |$$ with the convention that $$\theta_k = 0$$ whenever $$r_k = 0$$.

The sign vector for $$\bv$$ denoted by $$\sgn(\bv)$$ is defined as

$\begin{split} \sgn(\bv) = \begin{bmatrix}\sgn(v_1) \\ \vdots \\ \sgn(v_N) \end{bmatrix} \end{split}$

where

$\begin{split} \sgn(v_k) = \begin{cases} \exp (i \theta_k) & \text{ if } r_k \neq 0;\\ 0 & \text{ if } r_k = 0. \end{cases} \end{split}$

Theorem 18.10 ($$\ell_1$$ norm as product of vector with its sign)

For any $$\bv \in \CC^N$$:

$\| \bv \|_1 = \sgn(\bv)^H \bv = \langle \bv , \sgn(\bv) \rangle.$

Proof. This follows from:

$\| \bv \|_1 = \sum_{k=1}^N r_k = \sum_{k=1}^N \left [r_k e^{i \theta_k} \right ] e^{- i \theta_k} = \sum_{k=1}^N v_k e^{- i \theta_k} = \sgn(\bv)^H \bv.$

Note that whenever $$v_k = 0$$, corresponding $$0$$ entry in $$\sgn(\bv)$$ has no effect on the sum.

Theorem 18.11 (Equivalence of $$\ell_1$$ and $$\ell_2$$ norms)

Suppose $$\bv \in \CC^N$$. Then

$\| \bv \|_2 \leq \| \bv\|_1 \leq \sqrt{N} \| \bv \|_2.$

Proof. For the lower bound, we go as follows

$\| \bv \|_2^2 = \sum_{i=1}^N | v_i|^2 \leq \left ( \sum_{i=1}^N | v_i|^2 + 2 \sum_{i, j, i \neq j} | v_i | | v_j| \right ) = \left ( \sum_{i=1}^N | v_i| \right )^2 = \| \bv \|_1^2.$

This gives us

$\| \bv \|_2 \leq \| \bv \|_1.$

We can write $$\ell_1$$ norm as

$\| \bv \|_1 = \langle \bv, \sgn (\bv) \rangle.$

By Cauchy-Schwartz inequality we have

$\langle \bv, \sgn (\bv) \rangle \leq \| \bv \|_2 \| \sgn (\bv) \|_2.$

Since $$\sgn(\bv)$$ can have at most $$N$$ non-zero values, each with magnitude 1,

$\| \sgn (\bv) \|_2^2 \leq N \implies \| \sgn (\bv) \|_2 \leq \sqrt{N}.$

Thus, we get

$\| \bv \|_1 \leq \sqrt{N} \| \bv \|_2.$

Theorem 18.12 (Equivalence of $$\ell_2$$ and $$\ell_{\infty}$$ norms)

Let $$\bv \in \CC^N$$. Then

$\| \bv \|_2 \leq \sqrt{N} \| \bv \|_{\infty}.$

Proof. This follows from:

$\| \bv \|_2^2 = \sum_{i=1}^N | v_i |^2 \leq N \underset{1 \leq i \leq N}{\max} ( | v_i |^2) = N \| \bv \|_{\infty}^2.$

Thus

$\| \bv \|_2 \leq \sqrt{N} \| \bv \|_{\infty}.$

Theorem 18.13 (Relationship between $$p$$-norms)

Let $$\bv \in \CC^N$$. Let $$1 \leq p, q \leq \infty$$. Then

$\| \bv \|_q \leq \| \bv \|_p \text{ whenever } p \leq q.$

Proof. TBD

Theorem 18.14

Let $$\bone \in \CC^N$$ be the vector of all ones; i.e., $$\bone = (1, \dots, 1)$$. Let $$\bv \in \CC^N$$ be some arbitrary vector. Let $$| \bv |$$ denote the vector of absolute values of entries in $$\bv$$; i.e., $$|v|_i = |v_i| \Forall 1 \leq i \leq N$$. Then

$\| \bv \|_1 = \bone^T | \bv | = \bone^H | \bv |.$

Proof. This follows from:

$\bone^T | \bv | = \sum_{i=1}^N | v |_i = \sum_{i=1}^N | v_i | = \| \bv \|_1.$

Finally since $$\bone$$ consists only of real entries, hence its transpose and Hermitian transpose are same.

Theorem 18.15

Let $$\OneMat \in \CC^{N \times N}$$ be a square matrix of all ones. Let $$\bv \in \CC^N$$ be some arbitrary vector. Then

$|\bv|^T \OneMat | \bv | = \| \bv \|_1^2.$

Proof. We know that

$\OneMat = \bone \bone^T$

Thus,

$|\bv|^T \OneMat | \bv | = |\bv|^T \bone \bone^T | \bv | = (\bone^T | \bv | )^T (\bone^T | \bv |) = \| \bv \|_1 \| \bv \|_1 = \| \bv \|_1^2.$

We used the fact that $$\| \bv \|_1 = \bone^T | \bv |$$.

Theorem 18.16 (An upper bound on the $$k$$-th largest value)

$$k$$-th largest (magnitude) entry in a vector $$\bx \in \CC^N$$ denoted by $$x_{(k)}$$ obeys

(18.21)#$| x_{(k)} | \leq \frac{\| \bx \|_1}{k}.$

Proof. Let $$n_1, n_2, \dots, n_N$$ be a permutation of $$\{ 1, 2, \dots, N \}$$ such that

$|x_{n_1} | \geq | x_{n_2} | \geq \dots \geq | x_{n_N} |.$

Thus, the $$k$$-th largest entry in $$\bx$$ is $$x_{n_k}$$. It is clear that

$\| \bx \|_1 = \sum_{i=1}^N | x_i | = \sum_{i=1}^N |x_{n_i} |.$

Obviously

$|x_{n_1} | \leq \sum_{i=1}^N |x_{n_i} | = \| \bx \|_1.$

Similarly

$k |x_{n_k} | = |x_{n_k} | + \dots + |x_{n_k} | \leq |x_{n_1} | + \dots + |x_{n_k} | \leq \sum_{i=1}^N |x_{n_i} | \leq \| \bx \|_1.$

Thus

$|x_{n_k} | \leq \frac{\| \bx \|_1}{k}.$

18.5.6. Sparse Signals#

In this subsection we explore some useful properties of $$\Sigma_K$$, the set of $$K$$-sparse signals in standard basis for $$\CC^N$$.

We recall that

$\Sigma_K = \{ \bx \in \CC^N \ST \| \bx \|_0 \leq K \}.$

We established before that this set is a union of $$\binom{N}{K}$$ subspaces of $$\CC^N$$ each of which is is constructed by an index set $$\Lambda \subset \{1, \dots, N \}$$ with $$| \Lambda | = K$$ choosing $$K$$ specific dimensions of $$\CC^N$$.

We first present some results which connect the $$l_1$$, $$l_2$$ and $$l_{\infty}$$ norms of vectors in $$\Sigma_K$$.

Theorem 18.17 (Relation between norms of sparse vectors)

Suppose $$\bu \in \Sigma_K$$. Then

$\frac{\| \bu\|_1}{\sqrt{K}} \leq \| \bu \|_2 \leq \sqrt{K} \| \bu \|_{\infty}.$

Proof. Due to Theorem 18.10, we can write $$\ell_1$$ norm as

$\| \bu \|_1 = \langle \bu, \sgn (\bu) \rangle.$

By Cauchy-Schwartz inequality we have

$\langle \bu, \sgn (\bu) \rangle \leq \| \bu \|_2 \| \sgn (\bu) \|_2$

Since $$\bu \in \Sigma_K$$, $$\sgn(\bu)$$ can have at most $$K$$ non-zero values each with magnitude 1. Thus, we have

$\| \sgn (\bu) \|_2^2 \leq K \implies \| \sgn (\bu) \|_2 \leq \sqrt{K}.$

Thus we get the lower bound

$\| \bu \|_1 \leq \| \bu \|_2 \sqrt{K} \implies \frac{\| \bu \|_1}{\sqrt{K}} \leq \| \bu \|_2.$

Now $$| u_i | \leq \max(| u_i |) = \| \bu \|_{\infty}$$. So we have

$\| \bu \|_2^2 = \sum_{i= 1}^{N} | u_i |^2 \leq K \| \bu \|_{\infty}^2$

since there are only $$K$$ non-zero terms in the expansion of $$\| \bu \|_2^2$$. This establishes the upper bound:

$\| \bu \|_2 \leq \sqrt{K} \| \bu \|_{\infty}.$

18.5.7. Compressible Signals#

In this subsection, we first look at some general results and definitions related to $$K$$-term approximations of arbitrary signals $$\bx \in \CC^N$$. We then define the notion of a compressible signal and study properties related to it.

18.5.7.1. K-term Approximation of General Signals#

Definition 18.11 (Restriction of a signal)

Let $$\bx \in \CC^N$$. Let $$T \subset \{ 1, 2, \dots, N\}$$ be any index set. Further let

$T = \{t_1, t_2, \dots, t_{|T|}\}$

such that

$t_1 < t_2 < \dots < t_{|T|}.$

Let $$\bx_T \in \CC^{|T|}$$ be defined as

(18.22)#$\bx_T = \begin{pmatrix} x_{t_1} & x_{t_2} & \dots & x_{t_{|T|}} \end{pmatrix}.$

Then $$\bx_T$$ is a restriction of the signal $$\bx$$ on the index set $$T$$.

Alternatively let $$\bx_T \in \CC^N$$ be defined as

(18.23)#$\begin{split}\bx_{T}(i) = \begin{cases} \bx(i) & \text{ if } i \in T;\\ 0 & \text{ otherwise}. \end{cases}\end{split}$

In other words, $$\bx_T \in \CC^N$$ keeps the entries in $$\bx$$ indexed by $$T$$ while sets all other entries to 0. Then we say that $$\bx_T$$ is obtained by masking $$\bx$$ with $$T$$.

As an abuse of notation, we will use any of the two definitions whenever we are referring to $$\bx_T$$. The definition being used should be obvious from the context.

Example 18.17 (Restrictions on index sets)

$\bx = \begin{pmatrix} -1 & 5 & 8 & 0 & 0 & -3 & 0 & 0 & 0 & 0 \end{pmatrix} \in \CC^{10}.$

Let

$T = \{ 1, 3, 7, 8\}.$

Then

$\bx_T = \begin{pmatrix} -1 & 0 & 8 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{pmatrix} \in \CC^{10}.$

Since $$|T| = 4$$, sometimes we will also write

$\bx = \begin{pmatrix} -1 & 8 & 0 & 0 \end{pmatrix} \in \CC^4.$

Definition 18.12 ($$K$$-term signal approximation)

Let $$\bx \in \CC^N$$ be an arbitrary signal. Consider any index set $$T \subset \{1, \dots, N \}$$ with $$|T| = K$$. Then $$\bx_T$$ is a $$K$$-term approximation of $$\bx$$.

Clearly for any $$\bx \in \CC^N$$ there are $$\binom{N}{K}$$ possible $$K$$-term approximations of $$\bx$$.

Example 18.18 ($$K$$-term approximation)

Let

$\bx = \begin{pmatrix} -1 & 5 & 8 & 0 & 0 & -3 & 0 & 0 & 0 & 0 \end{pmatrix} \in \CC^{10}.$

Let $$T= \{ 1, 6 \}$$. Then

$\bx_T = \begin{pmatrix} -1 & 0 & 0 & 0 & 0 & -3 & 0 & 0 & 0 & 0 \end{pmatrix}$

is a $$2$$-term approximation of $$\bx$$.

If we choose $$T= \{7,8,9,10\}$$, the corresponding $$4$$-term approximation of $$\bx$$ is

$\begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{pmatrix}.$

Definition 18.13 ($$K$$-largest entries approximation)

Let $$\bx \in \CC^N$$ be an arbitrary signal. Let $$\lambda_1, \dots, \lambda_N$$ be indices of entries in $$\bx$$ such that

$| x_{\lambda_1} | \geq | x_{\lambda_2} | \geq \dots \geq | x_{\lambda_N} |.$

In case of ties, the order is resolved lexicographically; i.e., if $$|x_i| = |x_j|$$ and $$i < j$$ then $$i$$ will appear first in the sequence $$\{ \lambda_k \}$$.

Consider the index set $$\Lambda_K = \{ \lambda_1, \lambda_2, \dots, \lambda_K\}$$. The restriction of $$\bx$$ on $$\Lambda_K$$ given by $$x_{\Lambda_K}$$ contains the $$K$$ largest entries $$\bx$$ while setting all other entries to 0. This is known as the $$K$$ largest entries approximation of $$\bx$$.

This signal is denoted henceforth as $$\bx|_K$$; i.e.

(18.24)#$\bx|_K = \bx_{\Lambda_K}$

where $$\Lambda_K$$ is the index set corresponding to $$K$$ largest entries in $$\bx$$ (magnitude wise).

Example 18.19 (Largest entries approximation)

Let

$\bx = \begin{pmatrix} -1 & 5 & 8 & 0 & 0 & -3 & 0 & 0 & 0 & 0 \end{pmatrix}.$

Then

$\bx|_1 = \begin{pmatrix} 0 & 0 & 8 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{pmatrix}.$
$\bx|_2 = \begin{pmatrix} 0 & 5 & 8 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{pmatrix}.$
$\bx|_3 = \begin{pmatrix} 0 & 5 & 8 & 0 & 0 & -3 & 0 & 0 & 0 & 0 \end{pmatrix}$
$\bx|_4 = \bx.$

All further $$K$$ largest entries approximations are same as $$\bx$$.

A pertinent question at this point is: which $$K$$-term approximation of $$\bx$$ is the best $$K$$-term approximation? Certainly in order to compare two approximations we need some criterion. Let us choose $$\ell_p$$ norm as the criterion. The next result gives an interesting result for best $$K$$-term approximations in $$\ell_p$$ norm sense.

Theorem 18.18 (Best $$K$$-term approximation for $$\ell_p$$ norms)

Let $$\bx \in \CC^N$$. Let the best $$K$$ term approximation of $$\bx$$ be obtained by the following optimization program:

(18.25)#$\begin{split}& \underset{T \subset \{1, \dots, N\}}{\text{maximize}} & & \| \bx_T \|_p \\ & \text{subject to } & & |T| = K.\end{split}$

where $$p \in [1, \infty]$$.

Let an optimal solution for this optimization problem be denoted by $$\bx_{T^*}$$. Then

$\| \bx|_K \|_p = \| \bx_{T^*} \|_p;$

i.e., the $$K$$-largest entries approximation of $$\bx$$ is an optimal solution to (18.25).

Proof. For $$p=\infty$$, the result is obvious. In the following, we focus on $$p \in [1, \infty)$$.

We note that maximizing $$\| \bx_T \|_p$$ is equivalent to maximizing $$\| \bx_T \|_p^p$$.

Let $$\lambda_1, \dots, \lambda_N$$ be indices of entries in $$x$$ such that

$| x_{\lambda_1} | \geq | x_{\lambda_2} | \geq \dots \geq | x_{\lambda_N} |.$

Further let $$\{ \omega_1, \dots, \omega_N\}$$ be any permutation of $$\{1, \dots, N \}$$. Clearly

$\| \bx|_K \|_p^{p} = \sum_{i=1}^K |\bx_{\lambda_i}|^{p} \geq \sum_{i=1}^K |\bx_{\omega_i}|^{p}.$

Thus if $$T^*$$ corresponds to an optimal solution of (18.25) then

$\| \bx|_K \|_p^{p} = \| \bx_{T^*} \|_p^{p}.$

Thus $$\bx|_K$$ is an optimal solution to (18.25).

This result helps us establish that whenever we are looking for a best $$K$$-term approximation of $$\bx$$ under any $$\ell_p$$ norm, all we have to do is to pickup the $$K$$-largest entries in $$\bx$$.

Definition 18.14 (Restriction of a matrix)

Let $$\Phi \in \CC^{M \times N}$$. Let $$T \subset \{ 1, 2, \dots, N\}$$ be any index set. Further let

$T = \{t_1, t_2, \dots, t_{|T|}\}$

such that

$t_1 < t_2 < \dots < t_{|T|}.$

Let $$\Phi_T \in \CC^{M \times |T|}$$ be defined as

(18.26)#$\Phi_T = \begin{bmatrix} \phi_{t_1} & \phi_{t_2} & \dots & \phi_{t_{|T|}} \end{bmatrix}.$

Then $$\Phi_T$$ is a restriction of the matrix $$\Phi$$ on the index set $$T$$.

Alternatively let $$\Phi_T \in \CC^{M \times N}$$ be defined as

(18.27)#$\begin{split}(\Phi_{T})_i = \left\{ \begin{array}{ll} \phi_i & \mbox{if i \in T};\\ \bzero & \mbox{otherwise}. \end{array} \right.\end{split}$

In other words, $$\Phi_T \in \CC^{M \times N}$$ keeps the columns in $$\Phi$$ indexed by $$T$$ while sets all other columns to $$\bzero$$. Then we say that $$\Phi_T$$ is obtained by masking $$\Phi$$ with $$T$$.

As an abuse of notation, we will use any of the two definitions whenever we are referring to $$\Phi_T$$. The definition being used should be obvious from the context.

Theorem 18.19

Let $$\supp(\bx) = \Lambda$$. Then

$\Phi \bx = \Phi_{\Lambda} \bx_{\Lambda}.$

Proof. This follows from:

$\Phi \bx = \sum_{i=1}^N x_i \phi_i = \sum_{\lambda_i \in \Lambda} x_{\lambda_i} \phi_{\lambda_i} = \Phi_{\Lambda} x_{\Lambda}.$

The result remains valid whether we use the restriction or the mask version of $$\bx_{\Lambda}$$ notation as long as same version is used for both $$\Phi$$ and $$\bx$$.

Corollary 18.1

Let $$S$$ and $$T$$ be two disjoint index sets such that for some $$\bx \in \CC^N$$

$\bx = x_T + x_S$

using the mask version of $$x_{\Lambda}$$ notation. Then the following holds

$\Phi \bx = \Phi_T \bx_T + \Phi_S \bx_S.$

Proof. Straightforward application of Theorem 18.19:

$\Phi \bx = \Phi \bx_T + \Phi \bx_S = \Phi_T \bx_T + \Phi_S \bx_S.$

Theorem 18.20

Let $$T$$ be any index set. Let $$\Phi \in \CC^{M \times N}$$ and $$\by \in \CC^M$$. Then

$[\Phi^H \by]_T = \Phi_T^H \by.$

Proof. Note that

$\begin{split} \Phi^H \by = \begin{bmatrix} \langle \phi_1 , \by \rangle\\ \vdots \\ \langle \phi_N , \by \rangle\\ \end{bmatrix} \end{split}$

Now let

$T = \{ t_1, \dots, t_K \}.$

Then

$\begin{split} [\Phi^H \by]_T = \begin{bmatrix} \langle \phi_{t_1} , \by \rangle\\ \vdots \\ \langle \phi_{t_K} , \by \rangle\\ \end{bmatrix} = \Phi_T^H \by. \end{split}$

The result remains valid whether we use the restriction or the mask version of $$\Phi_T$$ notation.

18.5.7.2. Compressible Signals#

We will now define the notion of a compressible signal in terms of the decay rate of magnitude of its entries when sorted in descending order.

Definition 18.15 (Compressible signal)

Let $$\bx \in \CC^N$$ be an arbitrary signal. Let $$\lambda_1, \dots, \lambda_N$$ be indices of entries in $$\bx$$ such that

$| x_{\lambda_1} | \geq | x_{\lambda_2} | \geq \dots \geq | x_{\lambda_N} |.$

In case of ties, the order is resolved lexicographically, i.e. if $$|x_i| = |x_j|$$ and $$i < j$$ then $$i$$ will appear first in the sequence $$\{ \lambda_k \}$$. Define

(18.28)#$\widehat{\bx} = (x_{\lambda_1}, x_{\lambda_2}, \dots, x_{\lambda_N}).$

The signal $$\bx$$ is called $$p$$-compressible with magnitude $$R$$ if there exists $$p \in (0, 1]$$ such that

(18.29)#$| \widehat{x}_i |\leq R \cdot i^{-\frac{1}{p}} \quad \forall i=1, 2,\dots, N.$

Theorem 18.21 ($$1$$-compressible signals)

Let $$\bx$$ be be $$p$$-compressible with $$p=1$$. Then

$\| \bx \|_1 \leq R (1 + \ln (N)).$

Proof. Recalling $$\widehat{x}$$ from (18.28) it is straightforward to see that

$\|\bx\|_1 = \|\widehat{\bx}\|_1$

since the $$\ell_1$$ norm doesn’t depend on the ordering of entries in $$\bx$$.

Now since $$\bx$$ is $$1$$-compressible, hence from (18.29) we have

$|\widehat{x}_i | \leq R \frac{1}{i}.$

This gives us

$\|\widehat{x}\|_1 \leq \sum_{i=1}^N R \frac{1}{i} = R \sum_{i=1}^N \frac{1}{i}.$

The sum on the R.H.S. is the $$N$$-th Harmonic number (sum of reciprocals of first $$N$$ natural numbers). A simple upper bound on Harmonic numbers is

$H_k \leq 1 + \ln(k).$

This completes the proof.

We now demonstrate how a compressible signal is well approximated by a sparse signal.

Theorem 18.22 (Sparse approximation of compressible signals)

Let $$\bx$$ be a $$p$$-compressible signal and let $$\bx|_K$$ be its best $$K$$-term approximation. Then the $$\ell_1$$ norm of approximation error satisfies

(18.30)#$\| \bx - \bx|_K\|_1 \leq C_p \cdot R \cdot K^{1 - \frac{1}{p}}$

with

$C_p = \left (\frac{1}{p} - 1 \right)^{-1}.$

Moreover the $$\ell_2$$ norm of approximation error satisfies

(18.31)#$\| \bx - \bx|_K\|_2 \leq D_p \cdot R \cdot K^{1 - \frac{1}{p}}$

with

$D_p = \left (\frac{2}{p} - 1 \right )^{-1/2}.$

Proof. Expanding the $$\ell_1$$ approximation error

$\| \bx - \bx|_K\|_1 = \sum_{i=K+1}^N |x_{\lambda_i}| \leq R \sum_{i=K+1}^N i^{-\frac{1}{p}}.$

We now approximate the R.H.S. sum with an integral.

$\sum_{i=K+1}^N i^{-\frac{1}{p}} \leq \int_{x=K}^N x^{-\frac{1}{p}} d x \leq \int_{x=K}^{\infty} x^{-\frac{1}{p}} d x.$

Now

$\int_{x=K}^{\infty} x^{-\frac{1}{p}} d x = \left [ \frac{x^{1-\frac{1}{p}}}{1-\frac{1}{p}} \right ]_{K}^{\infty} = C_p K^{1 - \frac{1}{p}}.$

We can similarly show the result for $$\ell_2$$ norm.

Example 18.20 (Sparse approximation for $$\frac{1}{2}$$-compressible signals)

Let $$p = \frac{1}{2}$$. Then

$C_p = \left (\frac{1}{p} - 1 \right)^{-1} = 1 \text{ and } D_p = \left (\frac{2}{p} - 1 \right )^{-1/2} = \frac{1}{\sqrt{3}}.$

Hence

$\| \bx - \bx|_K\|_1 \leq \frac{R}{K}$

and

$\| \bx - \bx|_K\|_2 \leq \frac{1}{\sqrt{3}} \frac{R}{K}.$

Both $$\ell_1$$ and $$\ell_2$$ approximation error bounds decrease as $$K$$ increases at the same rate.