20.2. Basis Pursuit#

20.2.1. Introduction#

We recall following sparse approximation problems. Given a signal xCN which is known to have a sparse representation in a dictionary D, the exact-sparse recovery problem is:

(20.4)#a^=arg minaCDa0 subject to x=Da.

When xCN doesn’t have a sparse representation in D, a K-sparse approximation of x in D can be obtained by solving the following problem:

(20.5)#a^=arg minaCDxDa2 subject to a0K.

Here x is modeled as x=Da+e where a denotes a sparse representation of x and e denotes the approximation error.

A different way to formulate the approximation problem is to provide an upper bound to the acceptable approximation error e2ϵ and try to find sparsest possible representation within this approximation error bound as

(20.6)#a^=arg minaCDa0 subject to xDa2ϵ.

Basis Pursuit (BP) [23] suggests the convex relaxation of (20.4) by replacing 0-``norm” with 1-norm.

(20.7)#a^=arg minaCDa1 subject to x=Da.

For real signals, it can be implemented as a linear program. For complex signals, it can be implemented as a second order cone program.

In the presence of approximation error (20.6), where x=Da+e with a being a K-sparse approximate representation of x in D we can formulate corresponding 1-minimization problem as:

(20.8)#a^=arg minaCDa1 subject to xDa2ϵ

where ϵe2 provides an upper bound on the approximation error. This version is known as basis pursuit with inequality constraints (BPIC).

The dual problem constructed using Lagrange multipliers is

(20.9)#a^=arg minaCDa1+λxDa22.

This is known as basis pursuit denoising(BPDN). With appropriate choice of λ, the two problems BPIC and BPDN are equivalent. This formulation attempts to minimize the 1-norm subject to a penalty term over the approximation error. The Lagrangian constant λ controls how large the penalty due to approximation error will be.

Note that the constraint xDa2ϵ is equivalent to xDa22ϵ2. We have used the squared version to construct the dual BPDN problem since the term xDa22 is easier to differentiate and work with.

Efficient solvers are available to solve BP, BPIC, BPDN problems using convex optimization techniques. They are usually polynomial time and involve sophisticated algorithms for implementation. The good part is a guarantee that a globally unique solution can be found (since the problem is convex). The not so good part is that convex optimization methods are still quite computationally intensive.

An alternative formulation of BPDN is as follows.

(20.10)#a^=arg minaCD12xDa22+γa1.

The difference in the two formulations is essentially with which term the Lagrangian constant (λ or γ) is placed. By choosing λ=1/(2γ), the two formulations are essentially the same (with a scale factor in the objective function). This formulation attempts to minimize the approximation error subject to an 1-norm penalty. Thus, the two formulations differentiate w.r.t. which term is minimized and which term is considered as penalty.

Basis pursuit is not an algorithm but a principle which says that for most real life problems, the solution of 0-minimization problem is same as the solution of the corresponding 1-minimization problem. Actual algorithms for solving the basis pursuit formulation of sparse recovery problem come from convex optimization literature.

We start our discussion with the analysis of exact-sparse case.

As part of our theoretical analysis, we would like to explore conditions under which the problems (20.4) and (20.7) are equivalent i.e. there exists a unique solution to both of them and the solution is identical. Under such conditions, the NP-hard problem (20.4) can be easily replaced with a tractable (20.7) problem which is convex and solvable in polynomial time.

20.2.2. Two-Ortho-Case#

Further simplifying, we consider the case where the dictionary D is a two-ortho-basis

D=[ΨΦ]

with Ψ and Φ both being orthonormal bases for CN.

  1. Clearly, DCN×2N and D=2N.

  2. We denote

    Ω={1,2,,2N}

    as the index set for the representation vectors a.

  3. The representation a of a signal x in D can be written as

    x=Da=[ΨΦ][apaq]=Ψap+Φaq.
  4. We can assign

    kp=ap0andkq=aq0.
  5. Total sparsity of a is given by

    K=a0=kp+kq.
  6. Whenever KN, we have a sparse representation.

  7. Further, let Sp{1,,N} be the support corresponding to ap part of a (i.e. Sp=supp(ap)) and Sq{1,,N} be the support corresponding to aq part of a (i.e. Sq=supp(aq)).

  8. Clearly, |Sp|=kp and |Sq|=kq.

  9. Note that Sp and Sq need not be disjoint.

  10. But, Sp and Sq+N are disjoint.

  11. In fact, supp(a)=Sp(Sq+N).

  12. 1pCN will denote the indicator vector for Sp; i.e., 1p(i)=0iSp and 1p(i)=1iSp.

  13. Similarly, 1qCN will denote the indicator vector for Sq.

  14. 1CN will denote the vector {1,,1}.

  15. Also, 1CN×N will denote a square matrix of all ones.

  16. Note that 1=11T.

We now state our main result for equivalence of solutions of (20.4) and (20.7) for the two ortho-case. Going forward, we will simply use μ to refer to the coherence of D (i.e. μ(D)).

Theorem 20.6

Let D be a two-ortho-basis dictionary D=[ΨΦ]. Let x=Da, where x is known. If a K-sparse representation a exists with kpkq such that (kp,kq) obey

(20.11)#2μ2kpkq+μkp1<0;

then a is the unique solution of both problems (20.4) and (20.7).

A weaker condition is: If

(20.12)#a0=K=kp+kq<20.5μ;

then a is a unique (K-sparse) solution to both (20.4) and (20.7).

Proof. We first show that a is a unique solution of (20.4).

  1. Towards the end of this proof, we show that (20.11) (20.12).

  2. Due to (20.12),

    a0=K=kp+kq<20.5μ=0.414μ<1μ.
  3. Thus, if a satisfies (20.11), then it is necessarily the sparsest possible representation of x in D due to Theorem 18.8.

  4. All other representations are denser (i.e. have more non-zero entries).

  5. Hence a is a unique solution of (20.4).

We next show that a is also a unique solution of (20.7).

  1. a is a feasible vector to (20.7) since x=Da though it need not be an optimal solution.

  2. We have to find criteria under which a is optimal and no other feasible vector b is optimal.

  3. Since a is the unique solution to (20.4), hence b0>a0 for every other feasible b for (20.7).

  4. We consider the set of alternative feasible vectors to (20.7) given by

    C={b|ba,b1a1,b0>a0 and D(ab)=0}.

    This set contains all feasible vectors to (20.7) which are

    1. different from a

    2. have larger support (larger 0-“norm”)

    3. satisfy the linear system of equations x=Da

    4. have 1 norm less than or equal to a.

  5. If this set is nonempty, then there exists a solution to basis pursuit which is not same as a.

  6. If this set is empty, then the solutions of (20.4) and (20.7) coincide and are both unique.

  7. Writing e=bab=e+a, we have

    b1a1e+a1a10.
  8. Thus, we can rewrite C as

    Cs={e|e0,e+a1a10 and De=0}.
  9. In order to show that C is empty, we will show that a larger set containing Cs is also empty.

  10. Essentially, we wish to consider a larger set whose emptiness can be checked easily against (20.11).

  11. If that larger set is empty due to (20.11), then C would also be empty and we would have completed the proof.

Emptiness of Cs

  1. We start by the requirement e+a1a10.

  2. Let a=[apaq] and e=[epeq] , where p and q refer to parts corresponding to the orthonormal bases Ψ and Φ respectively (as described at the beginning of this section).

  3. Note that even if ap and aq are sparse, ep and eq need not be.

  4. In fact, support of ep and eq could be very different from Sp and Sq.

  5. We can now write

    0e+a1a1=(i=1N|eip+aip||aip|)+(i=1N|eiq+aiq||aiq|)=iSp|eip|+iSq|eiq|+(iSp|eip+aip||aip|)+(iSq|eiq+aiq||aiq|).

    We are splitting the sum as follows.

    1. We first split the sum into ep,ap and eq,aq parts.

    2. We split the sum on the p part to sum over indices in Sp and indices not in Sp.

    3. We split the sum on the q part to sum over indices in Sq and indices not in Sq.

    4. For iSp, aip=0 leading to |eip+aip||aip|=|eip|.

    5. Ditto for iSq.

  6. We recall from triangle inequality on complex numbers that

    |x+y||y||x|x,yC

    which implies |x+y||y||x|.

  7. Thus,

    |eip+aip||aip||eip|iSp

    and

    |eiq+aiq||aiq||eiq|iSq.
  8. With this, the above condition can be relaxed as

    0e+a1a1iSp|eip|+iSq|eiq|iSp|eip|iSq|eiq|.
  9. Every e satisfying this inequality will also satisfy the condition e+a1a10.

  10. To simplify notation we can write

    iSp|eip|=1pT|ep| and iSq|eiq|=1qT|eq|.
  11. Then we have

    ep1=iSp|eip|+iSp|eip|iSp|eip|=ep1iSp|eip|=ep11pT|ep|.
  12. Similarly,

    iSq|eiq|=eq11qT|eq|.
  13. Thus,

    iSp|eip|+iSq|eiq|iSp|eip|iSq|eiq|=ep121pT|ep|+eq121qT|eq|.
  14. We can now define the set

    Cs1={e|e0,ep1+eq121pT|ep|21qT|eq|0 and De=0}.
  15. Clearly, CsCs1 and if Cs1 is empty, then Cs will also be empty.

  16. Note that this formulation of Cs1 is dependent only on the support of a and not on values in a.

  17. We now turn back to the requirement De=0 and relax it further.

  18. We note that,

    De=[ΨΦ][epeq]=Ψep+Φeq=0.
  19. Multiplying by ΨH we get

    ep+ΨHΦeq=0ep=ΨHΦeq

    since ΨHΨ=I (unitary matrix).

  20. Similarly multiplying with ΦH, we obtain

    ΦHΨep+eq=0eq=ΦHΨep.
  21. Note that entries in ΨHΦ and ΦHΨ are inner products between columns of D, hence their magnitudes are upper bounded by μ (coherence).

  22. Denote B=ΨHΦ and consider the product v=ΨHΦeq=Beq.

  23. Then

    vi=j=1NBijejq.
  24. Thus,

    |vi|=|j=1NBijejq|j=1N|Bijejq|μj=1N|ejq|=μ1T|eq|.
  25. Applying this result on ep we get,

    |ep|=|ΨHΦeq|μ1|eq|.
  26. Similarly,

    |eq|=|ΦHΨep|μ1|ep|.
  27. Note that since 1=11T, it is a rank-1 matrix.

  28. We now construct a set Cs2 as

    Cs2={e|e0ep1+eq121pT|ep|21qT|eq|0|ep|μ1|eq| and |eq|μ1|ep|}.
  29. Clearly, Cs1Cs2 since for every eCs1, De=0eCs2.

  30. We now define fp=|ep| and fq=|eq| as the absolute value vectors.

  31. Correspondingly, let us define

    f=|e|=[fpfq].
  32. Clearly, ep1=1Tfp and eq1=1Tfq.

  33. Further fp0; i.e., every entry in fp is nonnegative.

  34. Similarly, fq0.

  35. We can then introduce a set Cf as

    Cf={f|f01Tfp+1Tfq21pTfp21qTfq0fpμ1fqfqμ1fp and fp0,fq0}.
  36. It is easy to see that if eCs2 then f=|e|Cf.

  37. Thus, if Cf is empty, then Cs2 should be empty too.

  38. We note that if fCf, then for all c>0, cfCf.

  39. Thus, in order to study (the emptiness of) Cf, it is sufficient to study unit 1-norm vectors fCf; i.e., there is no unit 1-norm vector in Cf, if and only if Cf is empty.

  40. Now

    f1=1Tf=1Tfp+1Tfq

    since f0.

  41. This leads to:

    f1=11Tfp+1Tfq=1.
  42. We construct the new set of unit 1-norm vectors

    Cr={f|f0121pTfp21qTfq0fpμ1fqfqμ1fp1Tfp+1Tfq=1 and fp0,fq0}.
  43. We have Cr=Cf=.

  44. Note that the constraint 121pTfp21qTfq0 can be rewritten as

    1pTfp+1qTfq12.
  45. The set Cr is much easier to analyze since

    • If has no explicit dependency on D. D is represented only by a single parameter, its coherence μ.

    • All constraints are simple linear constraints. Thus finding the elements of Cf can be formulated as a linear programming (feasibility) problem.

    • The order of non-zero entries inside fp and fq doesn’t have any influence on the requirements for f to belong to Cr. Thus, without loss of generality, we can focus on vectors for which the first kp entries are non-zero in fp and first kq entries are non-zero in fq respectively.

  46. We next verify that Cr must be empty under (20.11).

Emptiness of Cr

  1. In order to find vectors in Cr, we can solve the following linear program.

    (20.13)#maximize fp,fq1pTfp+1qTfqsubject to fpμ1fqfqμ1fp1T(fp+fq)=1fp0,fq0.
  2. f=0 is a feasible vector for this linear program, hence a solution does exist for this program.

  3. What is interesting is the value of the objective function for the optimal solution.

  4. Let fp,fq be (an) optimal solution for this linear program.

  5. If 1pTfp+1qTfq12, then f satisfies all the requirements of Cr and Cr is indeed not empty.

  6. This doesn’t guarantee that C will also be non-empty though.

  7. On the contrary, if 1pTfp+1qTfq<12, then Cr is indeed empty (as one of the requirements cannot be met).

  8. Hence Cf is also empty leading to CCf being empty too.

  9. Thus, a condition which leads to

    1pTfp+1qTfq<12

    is a sufficient condition for equivalence of (20.4) and (20.7).

  10. Consider a feasible f for (20.13).

  11. Let fp1=1Tfp=c.

  12. Since 1T(fp+fq)=1, hence fq1=1Tfq=1c.

  13. We note that

    1fp=11Tfp=fp11=c1.
  14. Similarly,

    1fq=(1c)1.
  15. Thus, the first two constraints change into

    fp(1c)μ1fqcμ1.
  16. Since the objective is to maximize 1pTfp+1qTfq, it is natural to maximize non-zero entries in fp and fq corresponding to Sp and Sq.

  17. A straight-forward option is to choose first kp entries in fp to be (1c)μ and first kq entries in fq to be cμ.

  18. Other entries can be chosen arbitrarily to meet the requirement that 1T(fp+fq)=1.

  19. With this choice, we have

    1pTfp+1qTfq=kp(1c)μ+kqcμ=μ(kpc(kpkq)).
  20. We recall that we have chosen kpkq.

  21. Thus, the expression is maximized if c is chosen to be as small as possible.

  22. The choice of c must meet following conditions on 1-norms. (Basically the sum of first kp terms of fp must not be more than the 1 norm of fp. Same for fq).

    fp1=1Tfp=ckp(1c)μfq1=1Tfq=1ckqcμ.
  23. Simplifying these inequalities we get

    ckp(1c)μckpμ1+kpμ1ckqcμc11+kqμ.
  24. Since these two conditions must be satisfied, hence we require kp,kq to meet

    kpμ1+kpμ11+kqμkpkq1μ2.
  25. We will verify later that this condition is met if (20.11) holds.

  26. Assuming the condition is met, obviously the smallest possible value of c is given by kpμ1+kpμ.

  27. The maximum value of objective function then becomes

    1pTfp+1qTfq=μ(kpc(kpkq))=μ(kpkpμ1+kpμ(kpkq))=kpμ+kpkqμ21+kpμ.
  28. Finally, for BP to succeed, we require this expression to be strictly less than half.

  29. This gives us

    kpμ+kpkqμ21+kpμ<122kpkqμ2+kpμ1<0

    which is the sufficient condition for BP to succeed in the theorem.

We now show that (20.11) the weaker condition (20.12).

  1. From (20.11) we can write kq as

    2kpkqμ2+kpμ1<02kpkqμ2<1kpμkq<1kpμ2kpμ2.
  2. Thus,

    a0=kp+kq<kp+1kpμ2kpμ2=2μ2kp2+1μkp2μ2kp=1μ2μ2kp2+1μkp2μkp.
  3. We define u=μkp and rewrite above as

    a0<1μ2u2u+12u.
  4. The weaker condition can now be obtained by minimizing the upper bound on R.H.S. of this equation.

  5. We define

    f(u)=2u2u+12u.
  6. Differentiating and equating with 0, we get

    f(u)=2u212u2=0.
  7. The optimal value is obtained when u=±0.5.

  8. Since both μ and kp are positive quantities, hence the negative value for u is rejected and we get u=0.5.

  9. This gives us

    a0<1μ20.520.5=20.5μ.
  10. Lastly, the property that arithmetic mean is greater than or equal to geometric mean gives us

    kpkq(kp+kq)24<(20.5)24μ2<1μ2.

20.2.3. General Case#

We now consider the case where DCN×D is an arbitrary (redundant) dictionary. We will require that D is full row rank. If D is not a full row rank matrix then some of its columns (atoms) can be removed to make it so.

We develop sufficient conditions under which solutions of (20.4) and (20.7) match for the general case [28, 30].

Theorem 20.7

Let D be an arbitrary full rank redundant dictionary. Let x=Da, where x is known. If a sparse representation a exists obeying

(20.14)#a0<12(1+1μ),

then a is the unique solution of both (20.4) and (20.7).

Proof. Due to Theorem 18.27, a is a unique solution for (20.4) since a satisfies (20.14). We need to show that it is also a unique solution to (20.7)

  1. For any other feasible b for (20.7), we have b0>a0 since it is unique sparsest solution of x=Da.

  2. We start with defining a set of alternative feasible vectors to (20.7):

    C={b|bab1a1b0>a0 and D(ba)=0}.

    This set contains all possible representations that

    1. are different from a

    2. have larger support

    3. satisfy Db=x

    4. have a better (or at least as good) 1-norm.

  3. We need to show that if (20.14) holds, then the set C will be empty.

  4. Otherwise, BP would choose a solution different than a.

  5. The condition b0>a0 is redundant under the assumption (20.14).

  6. Following the proof of Theorem 20.6, we define

    e=ba.
  7. We can then rewrite C as

    Cs={e|e0,e+a1a10, and De=0}.
  8. Again, we will enlarge the set Cs and show that even the larger set is empty when (20.14) holds.

  9. We start with the requirement e+a1a10.

  10. A simple permutation of columns of D can bring the nonzero entries in a to the beginning.

  11. Thus, without loss of generality, we assume that first K entries in a are nonzero and the rest are zero.

  12. We can now rewrite the requirement as

    e+a1a1=j=1K(|ej+aj||aj|)+j>K|ej|0.
  13. Using the inequality |x+y||y||x|, we can relax above condition as

    j=1K|ej|+j>K|ej|0.
  14. Let 1K denote a vector with K ones at the beginning and rest zeros.

  15. Then,

    j=1K|ej|=1KT|e|.
  16. Further,

    j>K|ej|=e1j=1K|ej|=1T|e|1KT|e|.
  17. Thus, we can rewrite above inequality as

    1T|e|21KT|e|0.
  18. We can now define

    Cs1={e|e0,1T|e|21KT|e|0, and De=0}.
  19. Clearly CsCs1.

  20. We will now relax the requirement of De=0.

  21. Multiplying by DH, we get

    DHDe=0.
  22. If eCs1, it will also satisfy this equation.

  23. Moreover, if e satisfies this, then e belongs to the null space of DHD.

  24. Since D is full rank, hence e has to be in the null space of D also.

  25. Thus the two conditions De=0 and DHDe=0 are equivalent.

  26. We note that off-diagonal entries in DHD are bounded by μ while the main diagonal consists of all ones.

  27. So, we can write

    DHDe=0(DHDI+I)e=0e=(DHDI)e.
  28. Suppose v=Gu. Then vi=jGijuj.

  29. Thus

    |vi|=|jGijuj|j|Gijuj|=j|Gij||uj|.
  30. This gives us |v||G||v| where indicates component wise inequality.

  31. Taking an entry-wise absolute value on both sides, we get

    |e|=|(DHDI)e||DHDI||e|μ(1I)|e|.
  32. The last part is due to the fact that all entries in the vector |e| and the matrix |DHDI| are non-negative and the entries in |DHDI| are dominated by μ.

  33. Further,

    |e|μ(1I)|e|(1+μ)|e|μ1|e|=μe11|e|μe11+μ1.
  34. In the above we used the fact that 1|e|=11T|e|=1e1.

  35. We can now define a new set

    Cs2={e|e0,1T|e|21KT|e|0 and |e|μe11+μ1}.
  36. Clearly, Cs1Cs2.

  37. We note that Cs2 is unbounded since if eCs2, then ceCs2c0.

  38. Thus, in order to study its behavior, it is sufficient to consider the set of vectors with unit norm vectors e1=1.

  39. We construct the new set as

    Cr={e|e1=1,121KT|e|0 and |e|μ1+μ1}.
  40. Note that we replaced 1T|e|=e1=1 in formulating the description of Cr and the condition e0 is automatically enforced since e1=1.

  41. Clearly Cs2=Cr=.

  42. In order to satisfy the requirement 121KT|e|0, we need to have 1KT|e| as large as possible.

  43. Since this quantity only considers first K entries in e, hence the energy in e should be concentrated inside the first K entries to maximize this quantity.

  44. However, entries in e are restricted by the third requirement in Cr.

  45. We can maximize it by choosing

    |ej|=μ1+μ

    for first K entries in e.

  46. We then get

    121KT|e|=12Kμ1+μ0.
  47. This gives us

    12Kμ1+μ01+μ2Kμ2K1+μμK12(1+1μ).
  48. This is a necessary condition for Cr to be non-empty.

  49. Thus, if

    K<12(1+1μ)

    then, the requirement 121KT|e|0 is not satisfied and Cr is empty.

  50. Consequently, C is empty and the theorem is proved.

We present another result which is based on μ1/2(G) Definition 18.33 measure of the Gram matrix of the dictionary. This result is due to [28].

Theorem 20.8

Let x=Da and a0<μ1/2(G), then then a is the unique solution of both (20.4) and (20.7).

Proof. Unique solution of (20.4).

  1. Let Λ=supp(a) and K=|Λ|.

  2. We have K=a0<μ1/2(G).

  3. By Theorem 18.73

    spark(D)2μ1/2(G)+1spark(D)>2K+1K<12spark(D).
  4. By Theorem 18.23, a is the unique sparsest solution.

Unique solution of (20.7)

  1. We need to show that for any a satisfying x=Da, we must have a1>a1.

  2. We will show that any vector in the null space of D exhibits less than 50% concentration on Λ; i.e., for every hN(D)

    kΛ|hk|<12h1.
  3. Now

    Dh=0Gh=DHDh=0.
  4. Subtracting both sides with h we get

    Ghh=(GI)h=h.
  5. Let F denote an K×D matrix formed from the rows of GI corresponding to the indices in Λ.

  6. Then

    (GI)h=hFh1=kΛ|hk|.

    hk for some kΛ is the negative of the inner product of some row in F with h.

  7. We know that

    Fh1F1h1

    where F1 is the max-column-sum norm of F.

  8. This gives us

    F1h1kΛ|hk|.
  9. In any column of F the number of entries is K.

  10. One of them is 0 (corresponding to the diagonal entry in G).

  11. Thus, leaving it the rest of the entries are K1.

  12. By assumption μ1/2(G)>K.

  13. Thus any set of entries in a column which is less than or equal to K entries cannot have a sum exceeding 12.

  14. This gives an upper bound on the max-column-sum of F; i.e.,

    F1<12.
  15. Thus, we get

    kΛ|hk|F1h1<12h1

    for every hN(D).

  16. The rest follows from the fact that for any other a such that x=Da=Da, we know that

    a1>a1

    whenever

    kΛ|hk|<12h1

    where h=aa (thus Dh=0).

20.2.4. BPIC#

In the subsection, we present a stability guarantee result for BPIC.

Theorem 20.9

Consider an instance of the (20.8) problem defined by the triplet (D,x,ϵ). Suppose that a vector aCD is a feasible solution to (20.8) satisfying the sparsity constraint

a0<14(1+1μ(D)).

The solution a^ of (20.8) must satisfy

a^a224ϵ21μ(D)(4a01).

Proof. As before, we define b=a^a.

  1. Then

    Db2=D(a^a)2=Da^x+xDa22ϵ.
  2. We now rewrite the inequality in terms of the Gram matrix G=DHD.

    4ϵ2=Db22=bHGb=bH(GI+I)b=b22+bH(GI)b.
  3. It is easy to show that:

    |b|T|A||b|bHAb|b|T|A||b|

    whenever A is Hermitian.

    1. To see this just notice that bHAb is a real quantity.

    2. Hence bHAb=±|bHAb|.

    3. Now, using triangle inequality we can easily show that |bHAb||b|T|A||b|.

  4. Since GI is Hermitian, hence

    bH(GI)b|b|T|GI||b|.
  5. Now

    |b|T|GI||b|=i,j|bi||diHdjδij||bj|μ(D)i,j,ij|bi||bj|=μ(D)|b|T(1I)|b|.
  6. Only the off-diagonal terms of G remain in the sum, which are all dominated by μ(D).

  7. Thus we get

    4ϵ2b22|b|T(1I)|b|=(1+μ(D))b22μ(D)|b|T1|b|=(1+μ(D))b22μ(D)b12.

    This is valid since vH1v=v12.

  8. Since a^ is optimal solution of (20.8), hence

    a^1=b+a1a1b+a1a10.
  9. Let Λ=supp(a) and K=|Λ|.

  10. By a simple permutation of columns of D, we can bring the entries in a to the first K entries making Λ={1,,K}.

  11. We will make this assumption going forward without loss of generality.

  12. Let 1K be corresponding support vector (of ones in first K places and 0 in rest).

  13. From our previous analysis, we recall that

    b+a1a1b121KT|b|.
  14. Thus

    b121KT|b|0b121KT|b|.
  15. 1KT|b| is the sum of first K terms of |b|.

  16. Considering bΛ as a vector CK and using the 1-2 norm relation v1Kv2vCN, we get

    1KT|b|=bΛ1KbΛ2Kb2.
  17. Thus,

    b121KT|b|2Kb2.
  18. Putting this back in the previous inequality

    4ϵ2(1+μ(D))b22μ(D)b12(1+μ(D))b22μ(D)4Kb22=(1(4K1)μ(D))b22.
  19. We note that this inequality is valid only if

    1(4K1)μ(D)>0.
  20. This condition can be reformulated as

    a0=K<14(1+1μ(D)).
  21. Rewriting the bound on b22 we get

    b224ϵ2(1(4K1)μ(D))

    which is the desired result.

20.2.5. BPDN#

In this subsection we will examine the 1 penalty problem (20.10) more closely.

(20.15)#a^=arg minaCD12xDa22+γa1.

We will focus on following issues:

  • Some results from convex analysis useful for our study

  • Conditions for the minimization of (20.15) over coefficients a supported on a subdictionary DΛ

  • Conditions under which the unique minimizer for a subdictionary is also the global minimizer for (20.15)

  • Application of (20.15) for sparse signal recovery

  • Application of (20.15) for identification of sparse signals in presence of noise

  • Application of (20.15) for identification of sparse signals in presence of Gaussian noise

We recall some definitions and results from convex analysis which will help us understand the minimizers for (20.15) problem.

Convex analysis for real valued functions the vector space (Cn,R) is developed using the bilinear inner product defined as

x,yB=Re(yHx).

The subscript B is there to distinguish it from the standard inner product for the complex coordinate space x,y=yHx. The two inner products are related as

x,yB=Re(x,y).

We consider real valued functions over the inner product space X=(CD,,B). Note that the dimension of X is 2D.

A real valued convex function f:XR satisfies the standard convexity inequality

f(θx+(1θ)y)θf(x)+(1θ)f(y)0θ1.

The objective function for the problem (20.15) is

(20.16)#L(a)=12xDa22+γa1.

Clearly, L is a real valued function over X and it is easy to so that it is a convex function. Moreover L(a)0 always.

We suggest the readers to review the material in Subgradients. For any function f:XR, its subdifferential set is defined as

f(x){gX|f(y)f(x)+yx,gByX}.

The elements of subdifferential set are called subgradients. If f possesses a gradient at x, then it is the unique subgradient at x. i.e.

f(x)={f(x)}

where f(x) is the gradient of f at x.

For convex function, the subdifferential of a sum is the (Minkowski) sum of the subdifferentials (Theorem 9.222); i.e.,

(f(x)+g(x))=f(x)+g(x)={h1+h2|h1f(x),h2g(x)}.

By Fermat’s optimality condition (Theorem 9.233), if f is a closed, proper convex function, then x is a global minimizer of f if and only if 0f(x).

We would be specifically interested in the subdifferential for the function a1.

Theorem 20.10

Let zX. The vector gX lies in the subdifferential z1 if and only if

  • |gk|1 whenever zk=0.

  • gk=sgn(zk) whenever zk0.

The proof is skipped.

We recall that the signum function for complex numbers is defined as

sgn(reiθ)={eiθif r>0;0if r=0.

The subdifferential for 1 norm function for Rn is developed in Example 9.67. We note that 1 norm is differentiable at nonzero vectors. Also, we can see that:

  1. g=1 when z0.

  2. g1 when z=0.

20.2.5.1. Restricted Minimizers#

  1. Suppose Λ index a subdictionary DΛ.

  2. DΛ is a linearly independent collection of atoms.

  3. Hence a unique 2 best approximation x^Λ of x using the atoms in DΛ can be obtained using the least square techniques.

  4. We define the orthogonal projection operator

    PΛ=DΛDΛ.
  5. And we get

    x^Λ=PΛx.
  6. The approximation is orthogonal to the residual; i.e., (xx^Λ)x^Λ.

  7. There is a unique coefficient vector cΛ supported on Λ that synthesizes the approximation x^Λ.

    cΛ=DΛx=DΛx^Λ.
  8. We also have

    x^Λ=PΛx=DΛcΛ.

Theorem 20.11 (Minimization of L over vectors supported on Λ)

Let Λ index a linearly independent collection of atoms in D and let a minimize the objective function L(a) in (20.16) over all coefficient vectors supported on Λ (i.e. supp(a)Λ). A necessary and sufficient condition on such a minimizer is that

(20.17)#cΛa=γ(DΛHDΛ)1g

where cΛ=DΛx and the vector g is drawn from a1. Moreover, the minimizer a is unique.

Proof. Since we are restricted a to be supported on Λ (i.e. aCΛ), hence

Da=DΛaΛ.

The objective function simplifies to

L(a)=12xDΛaΛ22+γaΛ1.
  1. We define x^Λ=PΛx.

  2. Now, both x^Λ and DΛaΛ belong to the column space of DΛ while xx^Λ is orthogonal to it.

  3. Hence

    xx^Λx^ΛDa.
  4. Thus, using the Pythagorean theorem, we get

    xDa22=xx^Λ+x^ΛDa22=xx^Λ22+x^ΛDa22.
  5. We can rewrite L(a) as

    L(a)=12xx^Λ22+12x^ΛDa22+γa1.
  6. Define

    F(a)=12x^ΛDa22+γa1.
  7. Then

    L(a)=12xx^Λ22+F(a).
  8. Note that the term xx^Λ22 is constant. It is the squared norm of the least square error.

  9. Thus, minimizing L(a) over the coefficient vectors supported on Λ is equivalent to minimizing F(a) over the same support set.

  10. Note that

    Da=DΛaΛ and a1=aΛ1.
  11. We can write F(a) as

    F(a)=12x^ΛDΛaΛ22+γaΛ1.
  12. Note that F(a) depends only on entries in a which are part of the support Λ.

  13. We can replace the variable aΛ with aCΛ and rewrite F(a) as

    F(a)=12x^ΛDΛa22+γa1aCΛ.
  14. Since atoms indexed by Λ are linearly independent, hence DΛ has full column rank.

  15. Thus, the quadratic term x^ΛDΛa22 is strictly convex.

  16. Since a1 is also convex, F(a) therefore is strictly convex and its minimizer is unique.

  17. Since F is strictly convex and unconstrained, hence 0F(a) is a necessary and sufficient condition for the coefficient vector a to minimize F(a).

  18. The gradient of the first (quadratic) term is

    (DΛHDΛ)aDΛHx^Λ.
  19. For the second term we have to consider its subdifferential a1.

  20. Thus, at a it follows that

    (DΛHDΛ)aDΛHx^Λ+γg=0.

    where g is some subgradient in a1.

  21. Premultiplying with (DΛHDΛ)1 we get

    aDΛx^Λ+γ(DΛHDΛ)1g=0DΛx^Λa=γ(DΛHDΛ)1g.
  22. Finally, we recall that DΛx^Λ=cΛ.

  23. Thus, we get the desired result

    cΛa=γ(DΛHDΛ)1g.

Some bounds follow as a result of this theorem. Readers are suggested to review the material in Matrix Norms.

Theorem 20.12

Suppose that Λ index a subdictionary DΛ and let a minimize the function L over all coefficient vectors supported on Λ. Then following bounds are in force:

cΛaγ(DΛHDΛ)1.
DΛ(cΛa)2γDΛ21.

Proof. We start with

(20.18)#cΛa=γ(DΛHDΛ)1g.
  1. we take the norm on both sides and apply some norm bounds

    cΛa=γ(DΛHDΛ)1gγ(DΛHDΛ)1gγ(DΛHDΛ)1.

    The last inequality is valid since from Theorem 20.10 we have: g1.

  2. Now let us premultiply (20.18) with DΛ and apply 2 norm

    DΛ(cΛa)2=γDΛ(DΛHDΛ)1g2=γ(DΛ)Hg2(DΛ)H2g=DΛ21gDΛ21.
  3. In this derivation we used facts like Apq=AHqp, AxqApqxp and g1.

20.2.5.2. The Correlation Condition#

So far we have established a condition which ensures that a is a unique minimizer of L given that a is supported on Λ. We now establish a sufficient condition under which a is also a global minimizer for L(a).

Theorem 20.13 (Correlation condition for global minimizer)

Assume that Λ indexes a subdictionary. Let a minimize the function L over all coefficient vectors supported on Λ. Suppose that

(20.19)#DH(xx^Λ)γ[1maxωΛ|DΛdω,g|]

where ga1 is determined by (20.17). Then a is the unique global minimizer of L. Moreover, the condition

(20.20)#DH(xx^Λ)γ[1maxωΛDΛdω1]

guarantees that a is the unique global minimizer of L.

Proof. Let a be the unique minimizer of L over coefficient vectors supported on Λ. Then, the value of the objective function L(a) increases if we change any coordinate of a indexed in Λ.

What we need is a condition which ensures that the value of objective function also increases if we change any other component of a (not indexed by Λ).

  1. If this happens, then a will become a local minimizer of L.

  2. Further, since L is convex, a will also be global minimizer.

Towards this, let ω be some index not in Λ and eωCD be corresponding unit vector. Let δeω be a small perturbation introduced in ω-th coordinate. (δC is a small scalar, though need not be positive real). We need find a condition which ensures

L(a+δeω)L(a)>0ωΛ.
  1. Let us expand the L.H.S. of this inequality:

    L(a+δeω)L(a)=[12xDaδdω2212xDa22]+γ[a+δeω1a1].

    Here we used the fact that Deω=dω.

  2. Note that since a is supported on Λ and ωΛ, hence

    a+δeω1=a1+δeω1.
  3. Thus

    a+δeω1a1=|δ|.
  4. We should also simplify the first bracket.

    xDa22=(xDa)H(xDa)=xHx+aHDHDaxHDaaHDHx.
  5. Similarly

    xDaδdω22=(xDaδdω)H(xDaδdω)=xHx+aHDHDaxHDaaHDHx(xDa)HδdωδdωH(xDa)+δdω22.
  6. Canceling the like terms we get

    δdω222Re(xDa,δdω).
  7. Thus,

    L(a+δeω)L(a)=12δdω22Re(xDa,δdω)+γ|δ|.
  8. Recall that since a is supported on Λ, hence Da=DΛa.

  9. We can further split the middle term by adding and subtracting DΛcΛ.

    Re(xDΛa,δdω)=Re(xDΛcΛ+DΛcΛDΛa,δdω)=Re(xDΛcΛ,δdω)+Re(DΛ(cΛa),δdω)
  10. Thus, we can write

    L(a+δeω)L(a)=12δdω22Re(xDΛcΛ,δdω)Re(DΛ(cΛa),δdω)+γ|δ|.
  11. The term 12δdω22 is strictly positive giving us

    L(a+δeω)L(a)>Re(xDΛcΛ,δdω)Re(DΛ(cΛa),δdω)+γ|δ|.
  12. Using lower triangle inequality we can write

    L(a+δeω)L(a)>γ|δ||xDΛcΛ,δdω||DΛ(cΛa),δdω|.
  13. Using linearity of inner product, we can take out |δ|:

    (20.21)#L(a+δeω)L(a)>|δ|[γ|xDΛcΛ,dω||DΛ(cΛa),dω|].
  14. Let us simplify this expression. Since a is a unique minimizer over coefficients in CΛ, hence using Theorem 20.11

    cΛa=γ(DΛHDΛ)1gDΛ(cΛa)=γDΛ(DΛHDΛ)1g=(DΛ)Hg.

    where ga1.

  15. Thus

    |DΛ(cΛa),dω|=γ|(DΛ)Hg,dω|=γ|DΛdω,g|

    using the fact that Ax,y=x,AHy.

  16. Also, we recall that x^Λ=DΛcΛ.

  17. Putting the back in (20.21) we obtain:

    (20.22)#L(a+δeω)L(a)>|δ|[γγ|DΛdω,g||xx^Λ,dω|].
  18. In (20.22), the L.H.S. is non-negative (our real goal) whenever the term in the bracket on the R.H.S. is non-negative (since |δ| is positive).

  19. Therefore we want that

    γγ|DΛdω,g||xx^Λ,dω|0.
  20. This can be rewritten as

    |xx^Λ,dω|γ[1|DΛdω,g|].
  21. Since this condition should hold for every ωΛ, hence we maximize the L.H.S. and minimize the R.H.S. over ωΛ.

  22. We get

    maxωΛ|xx^Λ,dω|minωΛγ[1|DΛdω,g|]=γ[1maxωΛ|DΛdω,g|].
  23. Recall that xx^Λ is orthogonal to the space spanned by atoms in DΛ.

  24. Hence

    maxωΛ|xx^Λ,dω|=maxω|xx^Λ,dω|=DH(xx^Λ).
  25. This gives us the desired sufficient condition

    DH(xx^Λ)γ[1maxωΛ|DΛdω,g|].
  26. This condition still uses g. We know that g1.

  27. Let us simplify as follows:

    |DΛdω,g|=|(DΛdω)Hg|=(DΛdω)Hg(DΛdω)Hg=(DΛdω)1g(DΛdω)1.
  28. Another way to understand this is as follows. For any vector vCD

    |v,g|=|i=1Dgivi|i=1D|gi||vi|[i=1D|vi|]gv1.
  29. Thus

    |DΛdω,g|DΛdω1.
  30. Thus, it is also sufficient that

    DH(xx^Λ)γ[1maxωΛ(DΛdω)1].

20.2.5.3. Exact Recovery Coefficient#

We recall that Exact Recovery Coefficient for a subdictionary is defined as

ERC(Λ)=1maxωΛDΛdω1.

Thus, the sufficient condition can be rewritten as

DH(xx^Λ)γERC(Λ).
  1. Note that the L.H.S. in both sufficient conditions is always non-negative.

  2. Hence, if the R.H.S. is negative (i.e. ERC(Λ)<0), the sufficient condition is useless.

  3. On the other hand if ERC(Λ)>0, then a sufficiently high γ can always be chosen to satisfy the condition in (20.20).

  4. At the same time as γ, the optimum minimizer is a=0.

How do we interpret the L.H.S. DH(xx^Λ)?

Definition 20.2

Given a non-zero signal v and a dictionary D, define the function

maxcor(v)maxωΩ|v,dω|v2.

If v=0, then define maxcor(v)=0.

This is known as the maximum correlation [79] of a signal with a dictionary.

Essentially, for any signal we normalize it and then find out its maximum inner product (absolute value) with atoms in the dictionary D. Clearly 0maxcor(v)1.

We can see that

DHv=maxcor(v)v2.

We can now interpret

DH(xx^Λ)=maxcor(xx^Λ)xx^Λ2.

Therefore, the sufficient condition in Theorem 20.13 is strongest when the magnitude of the residual (xx^Λ) and its maximum correlation with the dictionary are both small.

Since the maximum correlation of the residual never exceeds one, hence we obtain following (much weaker result)

Corollary 20.1

Let Λ index a subdictionary and let x be an input signal. Suppose that the residual vector xx^Λ satisfies

xx^Λ2γERC(Λ).

Then any coefficient vector a that minimizes the function L must be supported inside Λ.

20.2.5.4. Applications of 1 Penalization#

Having setup the basic results in place, we can now study the applications of (20.15).

Theorem 20.14 (BPDN reconstruction guarantees using ERC)

Let Λ index a subdictionary DΛ for which ERC(Λ)0. Suppose that x is an input signal whose 2 best approximation over Λ satisfies the correlation condition

DΛH(xx^Λ)γERC(Λ).

Let a solve the convex program (20.15) with parameter γ. We may conclude that:

  1. Support of a is contained in Λ.

  2. The error between a and the optimal coefficient vector cΛ satisfies

    acΛγ(DΛHDΛ)1.
  3. In particular, supp(a) contains every index λ in Λ for which

    |cΛ(λ)|>γ(DΛHDΛ)1.
  4. Moreover, the minimizer a is unique.

Proof. .

  1. Since the sufficient condition for correlation condition Theorem 20.13 are satisfied, hence a which minimizes L over coefficient vectors in CΛ is also a global minimizer of L.

  2. Since aCΛ, hence supp(a)Λ.

  3. For claim 2, application of Theorem 20.12 gives us

    cΛaγ(DΛHDΛ)1.
  4. Since the convex function L is strictly convex, hence a is unique global minimizer.

  5. For claim 3, suppose a(λ)=0 for some index λΛ for which

    |cΛ(λ)|>γ(DΛHDΛ)1.
  6. Then

    |a(λ)cΛ(λ)|=|cΛ(λ)|>γ(DΛHDΛ)1.
  7. But

    acΛ|a(λ)cΛ(λ)|.
  8. This violates the bound that

    acΛγ(DΛHDΛ)1.
  9. Thus, supp(a) contains every index λΛ for which

    |cΛ(λ)|>γ(DΛHDΛ)1.

We can formulate a simpler condition in terms of coherence of the dictionary.

Theorem 20.15 (BPDN reconstruction guarantees using coherence)

Suppose that Kμ12. Assume that |Λ|K i.e. Λ contains at most K indices. Suppose that x is an input signal whose 2 best approximation over Λ denoted by x^Λ satisfies the correlation condition

DΛH(xx^Λ)γ1(2K1)μ1(K1)μ.

Let a solve the convex program (20.15) with parameter γ. We may conclude that:

  1. Support of a is contained in Λ and

  2. The distance between a and the optimal coefficient vector cΛ satisfies

acΛγ11(K1)μ.
  1. In particular, supp(a) contains every index λ in Λ for which

    |cΛ(λ)|>γ11(K1)μ.
  2. Moreover, the minimizer a is unique.

Proof. .

  1. We recall from Theorem 18.75 the coherence bounds on ERC as

    ERC(Λ)1(2K1)μ1(K1)μ.
  2. Thus,

    DΛH(xx^Λ)γ1(2K1)μ1(K1)μγERC(Λ).
  3. A direct application of Theorem 20.14 validates claims 1 and 4.

  4. We recall from Theorem 18.37 the upper bound on norm of inverse Gram matrix of a subdictionary as

    G1=G1111μ1(K1)11(K1)μ.
  5. Putting this in Theorem 20.14 validates claims 2 and 3.

20.2.5.5. Exact Sparse Reconstruction Problem#

We now show how one can reconstruct an exactly sparse signal solving the convex program (20.15).

Theorem 20.16 (BPDN exact sparse recovery guarantee)

Assume that Λ indexes a subdictionary for which ERC(Λ)0. Choose an arbitrary coefficient vector copt supported on Λ. Fix an input signal x=Dcopt. Let a(γ) denote the unique minimizer of (20.15) with parameter γ. We may conclude that

  1. There is a positive number γ0 for which γ<γ0 implies that supp(a(γ))=Λ.

  2. In the limit as γ0, we have a(γ)copt.

Proof. .

  1. Since there is no noise, hence the best 2 approximation of x over Λ

    x^Λ=x

    itself and the corresponding coefficient vector is

    cΛ=copt.
  2. Therefore

    DΛH(xx^Λ)=0γERC(Λ).
  3. Thus, the correlation condition is in force for every positive value of γ.

  4. Thus, as per Theorem 20.14, minimizer a(γ) of the convex program (20.15) must be supported inside Λ.

  5. Moreover, we have

    a(γ)coptγ(DΛHDΛ)1.
  6. Clearly, as γ0, we have a(γ)copt.

  7. Finally, recall that supp(a(γ)) contains very index λ in Λ for which

    |copt(λ)|>γ(DΛHDΛ)1.
  8. In order for every index in Λ to be part of supp(a(γ)), we require

    minγΓ|copt(λ)|(DΛHDΛ)1>γ.
  9. Choosing the L.H.S. to be γ0, we get an explicit value of upper bound on γ such that γ<γ0 leads to complete discovery of support.