# Sparse and Redundant Representations

## Contents

# 18.5. Sparse and Redundant Representations#

## 18.5.1. Dictionaries#

(Dictionary)

A *dictionary* for \(\CC^N\) is a finite collection \(\bDDD\) of unit-norm vectors
which span the whole space.

The elements of a dictionary are called *atoms* and they are denoted by \(\phi_{\omega}\)
where \(\omega\) is drawn from an index set \(\Omega\).

The dictionary is written as

where

and every signal \(\bx \in \CC^N\) can be expressed as

We use the letter \(D\) to denote the number of elements in the dictionary; i.e.,

This definition is adapted from [77].

The indices may have an interpretation, such as the time-frequency or time-scale localization of an atom, or they may simply be labels without any underlying meaning.

Note that the dictionary need not provide a unique representation for any vector \(\bx \in \CC^N\), but it provides at least one representation for each \(\bx \in \CC^N\).

When \(D=N\) we have a set of unit norm vectors which span the whole of \(\CC^N\). Thus we have a basis (not-necessarily an orthonormal basis). A dictionary cannot have \(D < N\). The more interesting case is when \(D > N\).

## 18.5.2. Redundant Dictionaries and Sparse Signals#

With \(D > N\), clearly there are more atoms than necessary to provide
a representation of every signal \(\bx \in \CC^N\).
Thus such a dictionary is able provide multiple representations to same vector \(\bx\).
We call such dictionaries *redundant dictionaries* or *over-complete dictionaries*.

In contrast a basis with \(D=N\) is called a *complete dictionary*.
A special class of signals is those signals which have a sparse representation in a given
dictionary \(\bDDD\).

\((\bDDD,K)\)-sparse signals)

(A signal \(\bx \in \CC^N\) is called \((\bDDD,K)\)-sparse if it can be expressed as a linear combination of at-most \(K\) atoms from the dictionary \(\bDDD\).

Normally, for sparse signals, we have \(K \ll D\). It is usually expected that \(K \ll N\) also holds.

Let \(\Lambda \subset \Omega\) be a subset of indices with \(|\Lambda|=K\).

Let \(\bx\) be any signal in \(\CC^N\) such that \(\bx\) can be expressed as

Note that this is not the only possible representation of \(\bx\) in \(\bDDD\). This is just one of the possible representations of \(\bx\). The special feature of this representation is that it is \(K\)-sparse i.e. only at most \(K\) atoms from the dictionary are being used.

Now there are \(\binom{D}{K}\) ways in which we can choose a set of \(K\) atoms from the dictionary \(\bDDD\).

Thus the set of \((\bDDD,K)\)-sparse signals is given by

This set \(\Sigma_{(\bDDD,K)}\) is dependent on the chosen dictionary \(\bDDD\). In the sequel, we will simply refer to it as \(\Sigma_K\).

\(K\)-sparse signals for standard basis)

(For the special case where \(\bDDD\) is nothing but the standard basis of \(\CC^N\), then

i.e., the set of signals which have \(K\) or less non-zero elements.

\(K\)-sparse signals for orthonormal basis)

(In contrast if we choose an orthonormal basis \(\Psi\) such that every \(\bx \in \CC^N\) can be expressed as

then with the dictionary \(\bDDD = \Psi\), the set of \(K\)-sparse signals is given by

We also note that for a specific choice of \(\Lambda \subseteq \Omega\) with \(|\Lambda| = K\), the set of vectors

form a subspace of \(\CC^N\).

So we have \(\binom{D}{K}\) \(K\)-sparse subspaces contained in the dictionary \(\bDDD\).
And the \(K\)-sparse signals lie in the *union of these subspaces*.

## 18.5.3. Sparse Approximation Problem#

In sparse approximation problem, we attempt to approximate a given signal \(\bx \in \CC^N\) as a linear combination of \(K\) atoms from the dictionary \(\bDDD\) where \(K \ll N\) and typically \(N \ll D\); i.e., the number of atoms in a dictionary \(\bDDD\) is typically much larger than the ambient signal space dimension \(N\).

Naturally, we wish to obtain a best possible sparse representation of \(\bx\) over the atoms \(\phi_{\omega} \in \bDDD\) which minimizes the approximation error.

Let \(\Lambda\) denote the index set of atoms which are used to create a \(K\)-sparse representation of \(\bx\) where \(\Lambda \subset \Omega\) with \(|\Lambda| = K\).

Let \(\bx_{\Lambda}\) denote an approximation of \(\bx\) over the set of atoms indexed by \(\Lambda\).

Then we can write \(\bx_{\Lambda}\) as

We put all complex valued coefficients \(b_{\lambda}\) in the sum into a list \(\bb_{\Lambda}\). The approximation error is given by

Clearly we would like to minimize the approximation error over all possible choices of \(K\) atoms and corresponding set of coefficients \(\bb_{\Lambda}\).

Thus the sparse approximation problem can be cast as a minimization problem given by

If we choose a particular \(\Lambda\), then the inner minimization problem becomes a straight-forward least squares problem. But there are \(\binom{D}{K}\) possible choices of \(\Lambda\) and solving the inner least squares problem for each of them becomes prohibitively expensive.

We reemphasize here that in this formulation we are using a *fixed* dictionary \(\bDDD\)
while the vector \(\bx \in \CC^N\) is *arbitrary*.

This problem is known as \((\bDDD, K)\)-SPARSE approximation problem.

A related problem is known as \((\bDDD, K)\)-EXACT-SPARSE problem where it is known a-priori that \(\bx\) is a linear combination of at-most \(K\) atoms from the given dictionary \(\bDDD\) i.e. \(\bx\) is a \(K\)-sparse signal as defined above for the dictionary \(\bDDD\).

This formulation simplifies the minimization problem (18.18) since it is known a priori that for \(K\)-sparse signals, a \(0\) approximation error can be achieved. The only problem is to find a set of subspaces from the \(\binom{D}{K}\) possible \(K\)-sparse subspaces which are able to provide a \(K\)-sparse representation of \(\bx\) and among them choose one. It is imperative to note that even the \(K\)-sparse representation need not be unique.

Clearly the EXACT-SPARSE problem is simpler than the SPARSE approximation problem. Thus if EXACT-SPARSE problem is NP-Hard then so is the harder SPARSE-approximation problem. It is expected that solving the EXACT-SPARSE problem will provide insights into solving the SPARSE problem.

In Theorem 18.8 we identified conditions under which a sparse representation for a given vector \(\bx\) in a two-ortho-basis is unique. It would be useful to get similar conditions for general dictionaries. such conditions would help us guarantee the uniqueness of EXACT-SPARSE problem.

## 18.5.4. Synthesis and Analysis#

The atoms of a dictionary \(\bDDD\) can be organized into a \(N \times D\) matrix as follows:

where \(\Omega = \{\omega_1, \omega_2, \dots, \omega_N\}\) is the index set for the atoms of \(\bDDD\). We recall that \(\phi_{\omega} \in \CC^N\), hence they have a column vector representation in the standard basis for \(\CC^N\). The order of columns doesn’t matter as long as it remains fixed once chosen.

Thus in matrix terminology a representation of \(\bx \in \CC^N\) in the dictionary can be written as

where \(\bb \in \CC^D\) is a vector of coefficients to produce a superposition \(\bx\) from the atoms of dictionary \(\bDDD\). Clearly with \(D > N\), \(\bb\) is not unique. Rather for every vector \(\bz \in \NullSpace(\Phi)\), we have:

(Synthesis matrix)

The matrix \(\Phi\) is called a *synthesis matrix* since \(\bx\)
is synthesized from the columns of
\(\Phi\) with the coefficient vector \(\bb\).

We can also view the synthesis matrix \(\Phi\) as a linear operator from \(\CC^D\) to \(\CC^N\).

There is another way to look at \(\bx\) through \(\Phi\).

(Analysis matrix)

The conjugate transpose \(\Phi^H\) of the synthesis matrix \(\Phi\) is called the *analysis matrix*.
It maps a given vector \(\bx \in \CC^N\) to a list of inner products with the dictionary:

where \(\bc \in \CC^D\).

Note that in general \(\bx \neq \Phi (\Phi^H \bx)\) unless \(\bDDD\) is an orthonormal basis.

\((\bDDD, K)\) EXACT SPARSE problem)

(With the help of synthesis matrix \(\Phi\), the \((\bDDD, K)\) EXACT SPARSE problem can now be written as

If \(\bx \notin \Sigma_K\), then the EXACT SPARSE problem is infeasible. Otherwise, we are looking to find the sparsest possible solution.

\((\bDDD, K)\) SPARSE approximation problem)

(With the help of synthesis matrix \(\Phi\), the \((\bDDD, K)\) SPARSE approximation problem can now be written as

This problem can be visualized as a projection of \(\bx\) on to the set \(\Sigma_K\). Hence, it always has a solution.

## 18.5.5. P-Norms#

There are some simple and useful results on relationships between different \(p\)-norms listed in this section. We also discuss some interesting properties of \(l_1\)-norm specifically.

(Complex sign vector)

Let \(\bv \in \CC^N\). Let the entries in \(\bv\) be represented as

where \(r_k = | v_k |\) with the convention that \(\theta_k = 0\) whenever \(r_k = 0\).

The sign vector for \(\bv\) denoted by \(\sgn(\bv)\) is defined as

where

\(\ell_1\) norm as product of vector with its sign)

(For any \(\bv \in \CC^N\):

Proof. This follows from:

Note that whenever \(v_k = 0\), corresponding \(0\) entry in \(\sgn(\bv)\) has no effect on the sum.

\(\ell_1\) and \(\ell_2\) norms)

(Equivalence ofSuppose \(\bv \in \CC^N\). Then

Proof. For the lower bound, we go as follows

This gives us

We can write \(\ell_1\) norm as

By Cauchy-Schwartz inequality we have

Since \(\sgn(\bv)\) can have at most \(N\) non-zero values, each with magnitude 1,

Thus, we get

\(\ell_2\) and \(\ell_{\infty}\) norms)

(Equivalence ofLet \(\bv \in \CC^N\). Then

Proof. This follows from:

Thus

\(p\)-norms)

(Relationship betweenLet \(\bv \in \CC^N\). Let \(1 \leq p, q \leq \infty\). Then

Proof. TBD

Let \(\bone \in \CC^N\) be the vector of all ones; i.e., \(\bone = (1, \dots, 1)\). Let \(\bv \in \CC^N\) be some arbitrary vector. Let \(| \bv |\) denote the vector of absolute values of entries in \(\bv\); i.e., \(|v|_i = |v_i| \Forall 1 \leq i \leq N\). Then

Proof. This follows from:

Finally since \(\bone\) consists only of real entries, hence its transpose and Hermitian transpose are same.

Let \(\OneMat \in \CC^{N \times N}\) be a square matrix of all ones. Let \(\bv \in \CC^N\) be some arbitrary vector. Then

Proof. We know that

Thus,

We used the fact that \(\| \bv \|_1 = \bone^T | \bv |\).

\(k\)-th largest value)

(An upper bound on the\(k\)-th largest (magnitude) entry in a vector \(\bx \in \CC^N\) denoted by \(x_{(k)}\) obeys

Proof. Let \(n_1, n_2, \dots, n_N\) be a permutation of \(\{ 1, 2, \dots, N \}\) such that

Thus, the \(k\)-th largest entry in \(\bx\) is \(x_{n_k}\). It is clear that

Obviously

Similarly

Thus

## 18.5.6. Sparse Signals#

In this subsection we explore some useful properties of \(\Sigma_K\), the set of \(K\)-sparse signals in standard basis for \(\CC^N\).

We recall that

We established before that this set is a union of \(\binom{N}{K}\) subspaces of \(\CC^N\) each of which is is constructed by an index set \(\Lambda \subset \{1, \dots, N \}\) with \(| \Lambda | = K\) choosing \(K\) specific dimensions of \(\CC^N\).

We first present some results which connect the \(l_1\), \(l_2\) and \(l_{\infty}\) norms of vectors in \(\Sigma_K\).

(Relation between norms of sparse vectors)

Suppose \(\bu \in \Sigma_K\). Then

Proof. Due to Theorem 18.10, we can write \(\ell_1\) norm as

By Cauchy-Schwartz inequality we have

Since \(\bu \in \Sigma_K\), \(\sgn(\bu)\) can have at most \(K\) non-zero values each with magnitude 1. Thus, we have

Thus we get the lower bound

Now \(| u_i | \leq \max(| u_i |) = \| \bu \|_{\infty}\). So we have

since there are only \(K\) non-zero terms in the expansion of \(\| \bu \|_2^2\). This establishes the upper bound:

## 18.5.7. Compressible Signals#

In this subsection, we first look at some general results and definitions related to \(K\)-term approximations of arbitrary signals \(\bx \in \CC^N\). We then define the notion of a compressible signal and study properties related to it.

### 18.5.7.1. K-term Approximation of General Signals#

(Restriction of a signal)

Let \(\bx \in \CC^N\). Let \(T \subset \{ 1, 2, \dots, N\}\) be any index set. Further let

such that

Let \(\bx_T \in \CC^{|T|}\) be defined as

Then \(\bx_T\) is a *restriction* of the signal \(\bx\) on the index set \(T\).

Alternatively let \(\bx_T \in \CC^N\) be defined as

In other words, \(\bx_T \in \CC^N\) keeps the entries in \(\bx\)
indexed by \(T\) while sets all other entries to 0.
Then we say that \(\bx_T\) is obtained by *masking* \(\bx\) with \(T\).

As an abuse of notation, we will use any of the two definitions whenever we are referring to \(\bx_T\). The definition being used should be obvious from the context.

(Restrictions on index sets)

Let

Then

Since \(|T| = 4\), sometimes we will also write

\(K\)-term signal approximation)

(Let \(\bx \in \CC^N\) be an arbitrary signal.
Consider any index set \(T \subset \{1, \dots, N \}\)
with \(|T| = K\).
Then \(\bx_T\) is a *\(K\)-term approximation* of \(\bx\).

Clearly for any \(\bx \in \CC^N\) there are \(\binom{N}{K}\) possible \(K\)-term approximations of \(\bx\).

\(K\)-term approximation)

(Let

Let \(T= \{ 1, 6 \}\). Then

is a \(2\)-term approximation of \(\bx\).

If we choose \(T= \{7,8,9,10\}\), the corresponding \(4\)-term approximation of \(\bx\) is

\(K\)-largest entries approximation)

(Let \(\bx \in \CC^N\) be an arbitrary signal. Let \(\lambda_1, \dots, \lambda_N\) be indices of entries in \(\bx\) such that

In case of ties, the order is resolved lexicographically; i.e., if \(|x_i| = |x_j|\) and \(i < j\) then \(i\) will appear first in the sequence \(\{ \lambda_k \}\).

Consider the index set \(\Lambda_K = \{ \lambda_1, \lambda_2, \dots, \lambda_K\}\).
The restriction of \(\bx\) on \(\Lambda_K\) given by \(x_{\Lambda_K}\)
contains the \(K\) largest entries \(\bx\)
while setting all other entries to 0. This is known
as the *\(K\) largest entries approximation* of \(\bx\).

This signal is denoted henceforth as \(\bx|_K\); i.e.

where \(\Lambda_K\) is the index set corresponding to \(K\) largest entries in \(\bx\) (magnitude wise).

(Largest entries approximation)

Let

Then

All further \(K\) largest entries approximations are same as \(\bx\).

A pertinent question at this point is: which \(K\)-term approximation of \(\bx\) is the best \(K\)-term approximation? Certainly in order to compare two approximations we need some criterion. Let us choose \(\ell_p\) norm as the criterion. The next result gives an interesting result for best \(K\)-term approximations in \(\ell_p\) norm sense.

\(K\)-term approximation for \(\ell_p\) norms)

(BestLet \(\bx \in \CC^N\). Let the best \(K\) term approximation of \(\bx\) be obtained by the following optimization program:

where \(p \in [1, \infty]\).

Let an optimal solution for this optimization problem be denoted by \(\bx_{T^*}\). Then

i.e., the \(K\)-largest entries approximation of \(\bx\) is an optimal solution to (18.25).

Proof. For \(p=\infty\), the result is obvious. In the following, we focus on \(p \in [1, \infty)\).

We note that maximizing \(\| \bx_T \|_p\) is equivalent to maximizing \( \| \bx_T \|_p^p\).

Let \(\lambda_1, \dots, \lambda_N\) be indices of entries in \(x\) such that

Further let \(\{ \omega_1, \dots, \omega_N\}\) be any permutation of \(\{1, \dots, N \}\). Clearly

Thus if \(T^*\) corresponds to an optimal solution of (18.25) then

Thus \(\bx|_K\) is an optimal solution to (18.25).

This result helps us establish that whenever we are looking for a best \(K\)-term approximation of \(\bx\) under any \(\ell_p\) norm, all we have to do is to pickup the \(K\)-largest entries in \(\bx\).

(Restriction of a matrix)

Let \(\Phi \in \CC^{M \times N}\). Let \(T \subset \{ 1, 2, \dots, N\}\) be any index set. Further let

such that

Let \(\Phi_T \in \CC^{M \times |T|}\) be defined as

Then \(\Phi_T\) is a *restriction* of the matrix \(\Phi\) on the index set \(T\).

Alternatively let \(\Phi_T \in \CC^{M \times N}\) be defined as

In other words, \(\Phi_T \in \CC^{M \times N}\)
keeps the columns in \(\Phi\)
indexed by \(T\) while sets all other columns to \(\bzero\).
Then we say that \(\Phi_T\) is obtained by *masking* \(\Phi\) with \(T\).

As an abuse of notation, we will use any of the two definitions whenever we are referring to \(\Phi_T\). The definition being used should be obvious from the context.

Let \(\supp(\bx) = \Lambda\). Then

Proof. This follows from:

The result remains valid whether we use the restriction or the mask version of \(\bx_{\Lambda}\) notation as long as same version is used for both \(\Phi\) and \(\bx\).

Let \(S\) and \(T\) be two disjoint index sets such that for some \(\bx \in \CC^N\)

using the mask version of \(x_{\Lambda}\) notation. Then the following holds

Proof. Straightforward application of Theorem 18.19:

Let \(T\) be any index set. Let \(\Phi \in \CC^{M \times N}\) and \(\by \in \CC^M\). Then

Proof. Note that

Now let

Then

The result remains valid whether we use the restriction or the mask version of \(\Phi_T\) notation.

### 18.5.7.2. Compressible Signals#

We will now define the notion of a compressible signal in terms of the decay rate of magnitude of its entries when sorted in descending order.

(Compressible signal)

Let \(\bx \in \CC^N\) be an arbitrary signal. Let \(\lambda_1, \dots, \lambda_N\) be indices of entries in \(\bx\) such that

In case of ties, the order is resolved lexicographically, i.e. if \(|x_i| = |x_j|\) and \(i < j\) then \(i\) will appear first in the sequence \(\{ \lambda_k \}\). Define

The signal \(\bx\) is called *\(p\)-compressible* with magnitude \(R\)
if there exists \(p \in (0, 1]\) such that

\(1\)-compressible signals)

(Let \(\bx\) be be \(p\)-compressible with \(p=1\). Then

Proof. Recalling \(\widehat{x}\) from (18.28) it is straightforward to see that

since the \(\ell_1\) norm doesn’t depend on the ordering of entries in \(\bx\).

Now since \(\bx\) is \(1\)-compressible, hence from (18.29) we have

This gives us

The sum on the R.H.S. is the \(N\)-th Harmonic number (sum of reciprocals of first \(N\) natural numbers). A simple upper bound on Harmonic numbers is

This completes the proof.

We now demonstrate how a compressible signal is well approximated by a sparse signal.

(Sparse approximation of compressible signals)

Let \(\bx\) be a \(p\)-compressible signal and let \(\bx|_K\) be its best \(K\)-term approximation. Then the \(\ell_1\) norm of approximation error satisfies

with

Moreover the \(\ell_2\) norm of approximation error satisfies

with

Proof. Expanding the \(\ell_1\) approximation error

We now approximate the R.H.S. sum with an integral.

Now

We can similarly show the result for \(\ell_2\) norm.

\(\frac{1}{2}\)-compressible signals)

(Sparse approximation forLet \(p = \frac{1}{2}\). Then

Hence

and

Both \(\ell_1\) and \(\ell_2\) approximation error bounds decrease as \(K\) increases at the same rate.