# Sparsity in Orthonormal Bases

## Contents

# 18.4. Sparsity in Orthonormal Bases#

We start this section with a quick review of orthonormal bases and orthogonal transforms for finite dimensional signals \(\bx \in \CC^N\). We look at several examples of sparse signals in different orthonormal bases. We then demonstrate that while an orthonormal bases is a complete representation of all signals \(\bx \in \CC^N\) yet, it is not a good tool for exploiting the sparsity in \(\bx\) adequately.

We present an uncertainty principle which explains why a pair of orthonormal bases (like Dirac and Fourier basis) cannot have sparse representation of the same signal simultaneously.

We then demonstrate that a combination of two orthonormal bases can be quite useful in creating a redundant yet sparse representation of a larger class of signals which could not be sparsely represented in either of the two bases individually.

This motivates us to discuss more general over-complete signal dictionaries in the next section.

## 18.4.1. Orthonormal Bases and Orthogonal Transforms#

In signal processing, we often convert a finite length time domain signal into a different domain
using finite length transforms. Some of the most common transforms are
*discrete Fourier transform*,
the *discrete cosine transform*, and the *Haar transform*.
They all belong to the class of transforms called orthogonal transforms.

Orthogonal transforms are characterized by a pair of equations

and

where \(\Psi\) is an *orthonormal basis* for the complex vector space \(\CC^N\).
In particular, the columns of \(\Psi\) are unit norm, and orthogonal to each other.
Thus if we write

then

In other words:

Eq. (18.5) is known as the synthesis equation (\(\bx\) is synthesized by columns of \(\Psi\)). Eq. (18.6) is known as the analysis equation as we compute the coefficients in \(\ba\) by taking the inner product of \(\bx\) with columns of \(\Psi\).

\(\Psi\) is known as synthesis operator while \(\Psi^H\) is known as analysis operator. Orthogonal transforms preserve the norm of the signal:

This result is commonly known as *Parseval’s identity* in signal processing community.

More generally, orthogonal transforms preserve inner products:

(Dirac basis and sparse signals)

The simplest orthogonal transform is the identity basis or the standard ordered basis for \(\CC^N\).

We have

In this basis

We will drop \(N\) from suffix for convenience and refer to the matrix as \(\bI\) only.

This basis is also known as *Dirac basis*. The name *Dirac* comes from the
Dirac delta functions used in signal analysis in continuous time domain.

The basis consists of finite length impulses denoted by \(\be_i\) where

If a signal \(\bx\) consists of a linear combination of few \(K \ll N\) impulses, then it is a sparse signal in this basis. For example the signal

is \(3\)-sparse in Dirac basis since

can be expressed as a linear combination of just 3 impulses.

In contrast if we consider a complex sinusoid in \(\CC^N\), there is no way we can find a sparse representation for it in the Dirac basis.

## 18.4.2. Fourier Basis and Sparse Signals#

The most popular finite length orthogonal transform is DFT (Discrete Fourier Transform).

We define the \(N\)-th root of unity as

Clearly

We define the synthesis matrix of DFT as

where

\(k\) iterates over rows of \(\bF_N\) while \(n\) iterates over columns of \(\bF_N\).

The definition is symmetric. Hence

Note that we have multiplied with \(\frac{1}{\sqrt{N}}\) to make sure that columns of \(\bF_N\) are unit norm.

The columns of \(\bF_N\) form the *Fourier basis* for signals in \(\CC^N\).

\(N=2\))

(Fourier basis for2nd root of unity is given by

Hence

In this case

\(N=3\))

(Fourier basis for3rd root of unity is given by

Hence

In this case

\(N=4\))

(Fourier basis for4th root of unity is given by

Hence

In this case

We drop the suffix \(N\) wherever convenient and simply refer to the synthesis matrix as \(\bF\).

If a signal \(\bx\) is a linear combination of only a few (\(K \ll N\)) complex sinusoids, then \(\bx\) has a sparse representation in the DFT basis \(\bF\).

\(\bF\))

(Sparse signals inConsider the following signal

Its representation in \(\bF_4\) is given by

Clearly the signal is \(2\)-sparse in \(\bF_4\).

Now consider a signal \(\be_2\) which is sparse in the Dirac basis

Its representation in \(\bF_4\) is

Thus we see that while \(\be_2\) is sparse in \(\bI_4\), it is not at all sparse in \(\bF_4\).

## 18.4.3. An Uncertainty Principle#

As we noted, Dirac basis can give sparse representations for impulses but not for complex sinusoids. Vice versa Fourier basis can give sparse representations for complex sinusoids but not impulses.

Can we claim that a signal cannot be simultaneously represented both in time (Dirac basis) and in frequency domain (Fourier basis)?

More generally, let \(\Psi\) and \(\Chi\) be any two arbitrary orthonormal bases for \(\CC^N\). For some \(\bx \in \CC^N\), let

where \(\ba\) and \(\bb\) are representations of \(\bx\) in \(\Psi\) and \(\Chi\) respectively. Can we claim something for the relationship between sparsity levels \(\| \ba \|_0\) and \(\| \bb \|_0\)?

The answer turns out to be yes, but it depends on how much the two bases are similar or close to each other. The results in this section were originally developed in [31].

(Proximity/Mutual coherence)

The *proximity* between two orthonormal bases \(\Psi\) and \(\Chi\) where

and

is defined as the maximum absolute value of inner products between the columns of these two bases:

This is also known as *mutual coherence* of the two orthonormal bases.

If the two bases are identical, then clearly \(\mu = 1\). If any vector in \(\Psi\) is very close to some vector in \(\Chi\) then we will have a very high value of \(\mu\) (close to 1).

If the vectors in \(\Psi\) and \(\Chi\) are highly dissimilar, then we will have very low value of \(\mu\) close to 0.

(Proximity or mutual coherence of Dirac and Fourier bases)

Consider the mutual coherence between Dirac and Fourier bases.

For some small values of \(N\) (the dimension of ambient space \(\CC^N\)) the values are tabulated in below.

N |
\(\mu\) |
---|---|

2 |
0.7071 |

4 |
0.5000 |

6 |
0.4082 |

8 |
0.3536 |

For larger values of \(N\) we can see the variation of \(\mu\) in the plot below.

We present some results related to mutual coherence of two orthonormal bases.

(Product of orthonormal bases)

The product of two orthonormal bases \(\Psi\) and \(\Chi\) for \(\CC^N\) given by \(\Psi^H \Chi\) forms an orthonormal basis by itself.

Proof. Consider the matrix \(\Psi^H \Chi\)

Any column of the product matrix is

But then \(\Psi\) preserves norms, hence

Thus each column of the product \(\Psi^H \Chi\) is itself a unit norm vector.

Consider the inner product of two columns of \(\Psi^H \Chi\)

Thus, the columns of \(\Psi^H \Chi\) are orthogonal to each other.

Hence \(\Psi^H \Chi\) forms an orthonormal basis.

(The group of orthonormal bases)

A more general result would be to show that the set of orthonormal bases forms a group under the matrix multiplication operation. The identity element is the Dirac basis. The inverse of an orthonormal basis is also an orthonormal basis. The product of two orthonormal bases is also an orthonormal basis. The matrix multiplication satisfies associative law.

(Mutual coherence bounds)

Mutual coherence of two orthonormal bases \(\Psi\) and \(\Chi\) for the complex vector space \(\CC^N\) is bounded by

Proof. Since columns of both \(\Psi\) and \(\Chi\) are unit norm, hence \(|\langle \psi_i, \chi_j \rangle |\) cannot be greater than 1.

Now if \(\Psi = \Chi\) then we have \(\mu(\Psi, \Chi) = 1\). This proves the upper bound.

Now consider the matrix \(\Psi^H \Chi\) which forms an orthonormal basis by itself. Consider any column of this matrix

Since the column is unit norm, hence some of squares of the absolute values of entries of the column is 1. Thus the absolute value of each of the entries cannot be simultaneously less than \(\frac{1}{\sqrt{N}}\).

Hence there exists an entry (in each column) such that

Hence we get the lower bound on mutual coherence of \(\Psi\) and \(\Chi\) given by

(Mutual coherence of Dirac and Fourier bases)

Mutual coherence of Dirac and Fourier bases is \(\frac{1}{\sqrt{N}}\).

Proof. Theorem 18.3 shows that

We just need to show that its in fact an equality.

Consider \(i\)-th column of \(\bI\) as \(\be_i\).

Consider \(j\)-th column of \(\bF\):

where \(\omega\) is the \(N\)-th root of unity. Then

This doesn’t depend on the choice of \(i\)-th column of \(\bI_N\) and \(j\)-th column of \(\bF_N\). Hence

With basic properties of mutual coherence in place, we are now ready to state an uncertainty principle on the sparsity levels of representations of same signal \(x\) in two different orthonormal bases:

(Uncertainty principle)

For any arbitrary pair of orthonormal bases \(\Psi\), \(\Chi\) with mutual coherence \(\mu(\Psi, \Chi)\), and for any arbitrary non-zero vector \(\bx \in \CC^N\) with representations \(\ba\) and \(\bb\) respectively, the following inequality holds true:

Moreover for unit-length \(\bx\) we have:

Proof. Dividing \(\bx\) by \(\| \bx \|_2\) doesn’t change \(\ell_0\) “norm” of \(\ba\) and \(\bb\). i.e.,

Hence without loss of generality, we will assume that \(\| \bx \|_2 = 1\).

We are given that \(\bx^H \bx = 1\), \(\bx = \Psi \ba\) and \(\bx = \Chi \bb\). Since \(\Psi\) and \(\Chi\) are orthonormal bases hence \(\| \ba \|_2 = \| \bb \|_2 = 1\) also.

We can write

where we note that

and

Hence we get the inequality

Using the inequality between algebraic mean and geometric mean

we get

This is an uncertainty principle for the \(\ell_1\) norms of the two representations. We still have to get the uncertainty principle for \(\ell_0\) case. Let

Consider the sets

and

Clearly

The representations \(\bu\) for vectors \(\by\) in \(X_A\) have exactly \(A\) non-zero entries and are all unit norm representations. Which of them would have the longest \(\ell_1\) norm \(\| \bu \|_1\)?

This can be written as an optimization problem of the form

Let the optimal solution for this problem be \(\bv_a\) with corresponding representation \(\ba^* = \Psi^H \bv_a\). Clearly

Similarly from the set \(X_B\) let us find the vector \(\bv_b\) with maximum \(\ell_1\) norm representation in \(\Chi\)

Let the optimal solution for this problem be \(\bv_b\) with corresponding representation \(\bb^* = \Chi^H \bv_b\). Clearly

Returning back to the inequality

we can write

An equivalent formulation of the optimization problem (18.10)is

This formulation doesn’t require any specific mention of the basis \(\Psi\). Let the optimal value for this problem be given by

Here we consider the optimization problem to be parameterized by the \(\ell_0\)-“norm” of \(\ba\); i.e., \(\| \ba \|_0 = A\) and we write the optimal value as a function \(g\) of the parameter \(A\).

Then by symmetry, optimal value for the problem (18.11) is

Thus we can write

This is our intended result since we have been able to write the inequality as a function of \(\ell_0\) “norm”s of the representations \(\ba\) and \(\bb\).

In order to complete the result, we need to find the solution of the optimization problem (18.12) given by the function \(g\).

Without loss of generality, let us assume that the \(A\) non-zero entries in the optimization variable \(\ba\) appear in its first \(A\) entries and rest are zero. This is fine since changing the order of entries in \(\ba\) doesn’t affect any of the norms of concern \(\| \ba \|_0\), \(\| \ba \|_1\) and \(\| \ba \|_2\).

Let us further assume that all non-zero entries of \(\ba\) are strictly positive real numbers. This assumption is valid since only absolute values are used in this problem. Specifically for any \(\ba\) with non-zero complex entries \((a_1, a_2, \dots, a_A, 0, \dots, 0)\) there exists \(\ba'\) with positive entries \((|a_1|, |a_2|, \dots, |a_A|, 0, \dots, 0)\) such that \(\|\ba\|_0 = \|\ba'\|_0\), \(\|\ba\|_1 = \|\ba'\|_1\) and \(\|\ba\|_2 = \|\ba'\|_2\), hence solving the optimization problem for complex \(\ba\) is same as solving the optimization problem for \(\ba\) with strictly positive first \(A\) entries.

Using Lagrange multipliers, the \(\ell_0\) constraint vanishes (since the assumptions mentioned above allow us to focus on only the first \(A\) coordinates of \(\ba\)), and we obtain

Differentiating w.r.t. \(a_i\) and equating to 0 we get

The \(\ell_2\) constraint requires

Thus

and

Thus the optimal value of the optimization problem (18.12) is

Similarly

Putting back in (18.13) we get

Applying the algebraic mean-geometric mean inequality we get the desired result

This theorem suggests that if two orthonormal bases have low mutual coherence then

the two representations for \(\bx\) cannot be jointly \(l_1\)-short and

the two representations for \(\bx\) cannot be jointly sparse.

Challenge

Can we show that the above result is sharp? i.e. For a pair of orthonormal bases \(\Psi\) and \(\Chi\), it is always possible to find a non-zero vector \(\bx\) with corresponding representations \(\bx = \Psi \ba\) and \(\bx = \Chi \bb\) which satisfies the lower bound

(Sparse representations with Dirac and Fourier bases)

We showed in Theorem 18.4 that

Let \(\bx \in \CC^N\). Let its representation in \(\bF\) be given by

Applying Theorem 18.5 we have

This tells us that a signal cannot have fewer than \(2 \sqrt{N}\) non-zeros in time and frequency domain together.

## 18.4.4. Linear Combinations of Impulses and Sinusoids#

What happens if a signal \(\bx\) is a linear combination of few complex sinusoids and few impulses?

The set of sinusoids and impulses involved in the construction of \(\bx\) actually specifies the degrees of freedom of \(\bx\). This is the indicator of inherent sparsity of \(\bx\) provided this set of component signals of \(\bx\) is known a priori.

In absence of prior knowledge of component signals of \(\bx\), we attempt to look for a sparse representation of \(\bx\) in one of the well understood orthonormal bases. Here we are specifically looking at the two bases Dirac and Fourier.

While the Dirac basis can provide sparse representation for impulses, sinusoids have dense representation in Dirac basis. Vice versa, in Fourier basis, complex sinusoids have sparse representation, yet impulses have dense representations. Thus neither of the two bases is capable of providing a sparse representation for a combination of impulses and sinusoids.

The natural question arises if there is a way to come up with a sparse representation for such signals by combining the Dirac and Fourier basis?

## 18.4.5. Dirac Fourier Basis#

Now we develop a representation of signals \(\bx \in \CC^N\) in terms of a combination of Dirac basis \(\bI\) and Fourier basis \(\bF\).

We define a new synthesis matrix

We can write \(\bI\) as

and \(\bF\) as

This enables us to write \(\bH\) as

We will look for a representation of \(\bx\) using the synthesis matrix \(\bH\) as

where \(\ba \in \CC^{2N}\).

Since this representation is under-determined and \(\ColSpace (\bH ) = \CC^N\) hence there are always infinitely many possible representations of \(\bx\) in \(\bH\).

We would prefer to choose the sparsest representation which can be stated as an optimization problem

(Sparse representation using Dirac Fourier Basis)

Let \(N=4\). Then the Dirac Fourier basis is

Let

A sparse representation of \(\bx\) in \(\bH\) is

This representation is \(2\)-sparse.

Thus we see that Dirac Fourier basis is able to provide a sparser representation of a linear combination of impulses and sinusoids compared to individual orthonormal bases (the Dirac basis and the Fourier basis).

This gives us motivation to consider such combination of bases which help us provide a sparse representation to a larger class of signals. This is the objective of the next section.

In due course we will revisit the Dirac Fourier basis further in several examples.

## 18.4.6. Two-Ortho Basis#

Before we leave this section, let us define general two-ortho bases.

(Two ortho basis)

Let \(\Psi\) and \(\Chi\) be any two \(\CC^{N \times N}\) matrices whose columns vectors form orthonormal bases for the complex vector space \(\CC^N\) individually.

We define

The columns of \(\Eta\) form a *two-ortho basis* for \(\CC^N\).

Clearly columns of \(\Eta\) span \(\CC^N\).

(Coherence of a two ortho basis)

The *coherence* of a two ortho basis is defined as

It is the magnitude of the largest inner product between columns of \(\Eta\).

We present a very interesting result about the null space of \(\Eta\).

(Denseness of vectors in null space of two ortho bases)

For a two ortho basis \(\Eta = \begin{bmatrix}\Psi & \Chi \end{bmatrix}\) with low coherence, the non-zero vectors in the null space of \(\Eta\) are not sparse.

Concretely

Proof. Let \(\bv \in \NullSpace(\Eta)\) be some non-zero vector.

Now let \(\bv_{\psi}\) and \(\bv_{\chi}\) be first N and last N entries of \(\bv\). Then

If \(\by\) were \(\bzero\), then both \(\bv_{\psi}\) and \(\bv_{\chi}\) would have to be \(\bzero\) which is not the case since by assumption \(\bv \neq \bzero\).

We note that \(\bv_{\psi}\) and \(-\bv_{\chi}\) are representations of same vector \(\by\) in two different orthonormal bases \(\Psi\) and \(\Chi\) respectively, and \(\| -\bv_{\chi}\|_0 = \| \bv_{\chi} \|_0\).

Applying Theorem 18.5 we have

We also note that since the orthonormal bases preserve norm, hence

This shows us that the energy of a null space vector is evenly distributed in the two components corresponding to each orthonormal basis.

Challenge

For a two-ortho basis \(\Eta = \begin{bmatrix}\Psi & \Chi \end{bmatrix}\) is it possible to find a null space vector \(\bv\) satisfying the lower bound

Let \(\bx \in \CC^N\) be any arbitrary signal. Then its representation in \(\Eta\) is given by

Obviously for every \(\bz \in \NullSpace(\Eta)\), the vector \(\ba + \bz\) is also a representation of \(\bx\).

What we are particularly interested in are sparse representations of \(\bx\). A key concern for us is to ensure that a sparse representation of \(\bx\) in \(\Eta\) is unique. Under what conditions such is possible?

Formally, let \(\ba\) and \(\bb\) be two different representations of \(\bx\) in \(\Eta\). Can we say that if \(\ba\) is sparse then \(\bb\) won’t be sparse?

This is established in the next uncertainty principle.

(Uncertainty principle for representations in two ortho basis)

Let \(\bx \in \CC^N\) be any signal and let \(\Eta\) be a two ortho basis defined in (18.16). Let \(\ba\) and \(\bb\) be two distinct representations of \(\bx\) in \(\Eta\) i.e.

Then the following holds

This is an uncertainty principle for the sparsity of distinct representations in two ortho basis.

Proof. Let

be the difference vector of representations of \(\bx\). Clearly

Thus \(\be \in \NullSpace(\Eta)\). Applying Theorem 18.6, we get

But since \(\be = \ba - \bb\) hence we have

Challenge

For a two-ortho basis \(\Eta = \begin{bmatrix}\Psi & \Chi \end{bmatrix}\) is it possible to find a vector \(\bx\) with two alternative representations \(\ba\) and \(\bb\) satisfying the lower bound

This theorem suggests as that if \(\mu(\Eta)\) is small (i.e. the coherence between the two orthonormal bases is small) then two representations of \(\bx\) cannot be simultaneously sparse.

Rather if a representation is sufficiently sparse, then all other representations of \(\bx\) are guaranteed to be non-sparse providing the uniqueness of sparse representation.

This is stated formally in the following *uniqueness* theorem.

(Uniqueness of sparse representation in two ortho basis)

If a representation of \(\bx\) in the two ortho basis \(\Eta = \begin{bmatrix}\Psi & \Chi \end{bmatrix}\) has fewer than \(\frac{1}{\mu(\Eta)}\) non-zero entries, then it is necessarily the sparsest one possible, and any other representation must be denser.

Proof. Let \(\ba\) be a candidate representation with

Let \(\bb\) be any other candidate representation. By applying Theorem 18.7 we have

This gives us

Thus we find that

which is true for every representation \(\bb\) of \(\bx\) in \(\Eta\) other than \(\ba\).

Hence \(\ba\) is the sparsest possible representation of \(\bx\) in \(\Eta\).

We note here that any arbitrary choice of two bases may not be helpful in coming up with a two ortho basis which can provide us sparse representations for our signals of interest. In next few sections, we will explore this issue further in the more general context of signal dictionaries.

Challenge

Clearly, are signals for which a sufficiently sparse (and unique) representation doesn’t exist in a given two-ortho basis. What kind of relationships may exist between different (insufficiently) sparse representations of such signals?

## 18.4.7. Dirac-DCT Basis#

Dirac Fourier 2-ortho-basis is optimal in the sense that it has the smallest mutual coherence between the two bases. However, one problem with the Dirac Fourier basis is that it requires us to work with complex numbers. A close second is the Dirac DCT basis which consists of the Dirac basis and the DCT basis both of which are real.

(Dirac DCT basis)

The Dirac-DCT basis is a two-ortho basis consisting of the union of the Dirac and the DCT bases.

This two ortho basis is suitable for real signals since both Dirac and DCT are totally real bases \(\in \RR^{N \times N}\).

The two ortho basis is obtained by combining the \(N \times N\) identity matrix (Dirac basis) with the \(N \times N\) DCT matrix for signals in \(\RR^N\).

Let \(\Psi_{\text{DCT}, N}\) denote the DCT matrix for \(\RR^N\). Let \(\bI_N\) denote the identity matrix for \(\RR^N\). Then

Let

The \(k\)-th column of \(\Psi_{\text{DCT}, N}\) is given by

with \(\Omega_k = \frac{1}{\sqrt{2}}\) for \(k=1\) and \(\Omega_k = 1\) for \(2 \leq k \leq N\).

Note that for \(k=1\), the entries become

Thus, the \(\ell_2\) norm of \(\psi_1\) is 1. We can similarly verify the \(\ell_2\) norm of other columns also. They are all one.

### 18.4.7.1. Example#

We show how a mixture signal consisting of impulses and sinusoids has a sparse representation in a Dirac DCT basis.

### 18.4.7.2. Coherence#

The Dirac-DCT basis has the mutual coherence of \(\sqrt{\frac{2}{N}}\).

Proof. The mutual coherence of a two ortho basis where one basis is Dirac basis is given by the magnitude of the largest entry in the other basis.

For \(\Psi_{\text{DCT}, N}\), the largest value is obtained when \(\Omega_k = 1\) and the \(\cos\) term evaluates to 1.

Clearly,

\[ \mu (\bDDD_{\text{DCT}}) = \sqrt{\frac{2}{N}}. \]

### 18.4.7.3. Construction of Sparse Representation#

We use sparse recovery/reconstruction algorithms to construct a sparse representation of a given signal in a two ortho basis. The algorithms will be discussed in later chapters. We give the time domain signal (e.g. the mixture signal above) and the two ortho basis to a sparse reconstruction algorithm. The algorithm attempts to construct a sparsest possible representation of the signal in the given two ortho basis. We reconstructed this mixture signal in Dirac DCT basis using three different algorithms:

Basis Pursuit

Matching Pursuit

Orthogonal Matching Pursuit