Basis Pursuit
Contents
20.2. Basis Pursuit#
20.2.1. Introduction#
We recall following sparse approximation problems.
Given a signal
When
Here
A different way to formulate the approximation problem is to provide an upper bound
to the acceptable approximation error
Basis Pursuit (BP) [23] suggests the convex relaxation of
(20.4)
by replacing
For real signals, it can be implemented as a linear program. For complex signals, it can be implemented as a second order cone program.
In the presence of approximation error (20.6),
where
where
The dual problem constructed using Lagrange multipliers is
This is known as basis pursuit denoising(BPDN).
With appropriate choice of
Note that the constraint
Efficient solvers are available to solve BP, BPIC, BPDN problems using convex optimization techniques. They are usually polynomial time and involve sophisticated algorithms for implementation. The good part is a guarantee that a globally unique solution can be found (since the problem is convex). The not so good part is that convex optimization methods are still quite computationally intensive.
An alternative formulation of BPDN is as follows.
The difference in the two formulations is essentially with which term the Lagrangian constant (
Basis pursuit is not an algorithm but a principle which says that for most real life problems,
the solution of
We start our discussion with the analysis of exact-sparse case.
As part of our theoretical analysis, we would like to explore conditions under which the problems (20.4) and (20.7) are equivalent i.e. there exists a unique solution to both of them and the solution is identical. Under such conditions, the NP-hard problem (20.4) can be easily replaced with a tractable (20.7) problem which is convex and solvable in polynomial time.
20.2.2. Two-Ortho-Case#
Further simplifying, we consider the case where the dictionary
with
Clearly,
and .We denote
as the index set for the representation vectors
.The representation
of a signal in can be written asWe can assign
Total sparsity of
is given byWhenever
, we have a sparse representation.Further, let
be the support corresponding to part of (i.e. ) and be the support corresponding to part of (i.e. .Clearly,
and .Note that
and need not be disjoint.But,
and are disjoint.In fact,
. will denote the indicator vector for ; i.e., and .Similarly,
will denote the indicator vector for . will denote the vector .Also,
will denote a square matrix of all ones.Note that
.
We now state our main result for equivalence of solutions of
(20.4) and
(20.7) for the two ortho-case.
Going forward,
we will simply use
Theorem 20.6
Let
then
A weaker condition is: If
then
Proof. We first show that
Towards the end of this proof, we show that (20.11)
(20.12).Due to (20.12),
Thus, if
satisfies (20.11), then it is necessarily the sparsest possible representation of in due to Theorem 18.8.All other representations are denser (i.e. have more non-zero entries).
Hence
is a unique solution of (20.4).
We next show that
is a feasible vector to (20.7) since though it need not be an optimal solution.We have to find criteria under which
is optimal and no other feasible vector is optimal.Since
is the unique solution to (20.4), hence for every other feasible for (20.7).We consider the set of alternative feasible vectors to (20.7) given by
This set contains all feasible vectors to (20.7) which are
different from
have larger support (larger
-“norm”)satisfy the linear system of equations
have
norm less than or equal to .
If this set is nonempty, then there exists a solution to basis pursuit which is not same as
.If this set is empty, then the solutions of (20.4) and (20.7) coincide and are both unique.
Writing
, we haveThus, we can rewrite
asIn order to show that
is empty, we will show that a larger set containing is also empty.Essentially, we wish to consider a larger set whose emptiness can be checked easily against (20.11).
If that larger set is empty due to (20.11), then
would also be empty and we would have completed the proof.
Emptiness of
We start by the requirement
.Let
and , where and refer to parts corresponding to the orthonormal bases and respectively (as described at the beginning of this section).Note that even if
and are sparse, and need not be.In fact, support of
and could be very different from and .We can now write
We are splitting the sum as follows.
We first split the sum into
and parts.We split the sum on the
part to sum over indices in and indices not in .We split the sum on the
part to sum over indices in and indices not in .For
, leading to .Ditto for
.
We recall from triangle inequality on complex numbers that
which implies
.Thus,
and
With this, the above condition can be relaxed as
Every
satisfying this inequality will also satisfy the condition .To simplify notation we can write
Then we have
Similarly,
Thus,
We can now define the set
Clearly,
and if is empty, then will also be empty.Note that this formulation of
is dependent only on the support of and not on values in .We now turn back to the requirement
and relax it further.We note that,
Multiplying by
we getsince
(unitary matrix).Similarly multiplying with
, we obtainNote that entries in
and are inner products between columns of , hence their magnitudes are upper bounded by (coherence).Denote
and consider the product .Then
Thus,
Applying this result on
we get,Similarly,
Note that since
, it is a rank-1 matrix.We now construct a set
asClearly,
since for every , .We now define
and as the absolute value vectors.Correspondingly, let us define
Clearly,
and .Further
; i.e., every entry in is nonnegative.Similarly,
.We can then introduce a set
asIt is easy to see that if
then .Thus, if
is empty, then should be empty too.We note that if
, then for all , .Thus, in order to study (the emptiness of)
, it is sufficient to study unit -norm vectors ; i.e., there is no unit -norm vector in , if and only if is empty.Now
since
.This leads to:
We construct the new set of unit
-norm vectorsWe have
.Note that the constraint
can be rewritten asThe set
is much easier to analyze sinceIf has no explicit dependency on
. is represented only by a single parameter, its coherence .All constraints are simple linear constraints. Thus finding the elements of
can be formulated as a linear programming (feasibility) problem.The order of non-zero entries inside
and doesn’t have any influence on the requirements for to belong to . Thus, without loss of generality, we can focus on vectors for which the first entries are non-zero in and first entries are non-zero in respectively.
We next verify that
must be empty under (20.11).
Emptiness of
In order to find vectors in
, we can solve the following linear program.(20.13)# is a feasible vector for this linear program, hence a solution does exist for this program.What is interesting is the value of the objective function for the optimal solution.
Let
be (an) optimal solution for this linear program.If
, then satisfies all the requirements of and is indeed not empty.This doesn’t guarantee that
will also be non-empty though.On the contrary, if
, then is indeed empty (as one of the requirements cannot be met).Hence
is also empty leading to being empty too.Thus, a condition which leads to
is a sufficient condition for equivalence of (20.4) and (20.7).
Consider a feasible
for (20.13).Let
.Since
, hence .We note that
Similarly,
Thus, the first two constraints change into
Since the objective is to maximize
, it is natural to maximize non-zero entries in and corresponding to and .A straight-forward option is to choose first
entries in to be and first entries in to be .Other entries can be chosen arbitrarily to meet the requirement that
.With this choice, we have
We recall that we have chosen
.Thus, the expression is maximized if
is chosen to be as small as possible.The choice of
must meet following conditions on -norms. (Basically the sum of first terms of must not be more than the norm of . Same for ).Simplifying these inequalities we get
Since these two conditions must be satisfied, hence we require
to meetWe will verify later that this condition is met if (20.11) holds.
Assuming the condition is met, obviously the smallest possible value of
is given by .The maximum value of objective function then becomes
Finally, for BP to succeed, we require this expression to be strictly less than half.
This gives us
which is the sufficient condition for BP to succeed in the theorem.
We now show that (20.11)
From (20.11) we can write
asThus,
We define
and rewrite above asThe weaker condition can now be obtained by minimizing the upper bound on R.H.S. of this equation.
We define
Differentiating and equating with 0, we get
The optimal value is obtained when
.Since both
and are positive quantities, hence the negative value for is rejected and we get .This gives us
Lastly, the property that arithmetic mean is greater than or equal to geometric mean gives us
20.2.3. General Case#
We now consider the case where
We develop sufficient conditions under which solutions of (20.4) and (20.7) match for the general case [28, 30].
Theorem 20.7
Let
Proof. Due to Theorem 18.27,
For any other feasible
for (20.7), we have since it is unique sparsest solution of .We start with defining a set of alternative feasible vectors to (20.7):
This set contains all possible representations that
are different from
have larger support
satisfy
have a better (or at least as good)
-norm.
We need to show that if (20.14) holds, then the set
will be empty.Otherwise, BP would choose a solution different than
.The condition
is redundant under the assumption (20.14).Following the proof of Theorem 20.6, we define
We can then rewrite
asAgain, we will enlarge the set
and show that even the larger set is empty when (20.14) holds.We start with the requirement
.A simple permutation of columns of
can bring the nonzero entries in to the beginning.Thus, without loss of generality, we assume that first
entries in are nonzero and the rest are zero.We can now rewrite the requirement as
Using the inequality
, we can relax above condition asLet
denote a vector with ones at the beginning and rest zeros.Then,
Further,
Thus, we can rewrite above inequality as
We can now define
Clearly
.We will now relax the requirement of
.Multiplying by
, we getIf
, it will also satisfy this equation.Moreover, if
satisfies this, then belongs to the null space of .Since
is full rank, hence has to be in the null space of also.Thus the two conditions
and are equivalent.We note that off-diagonal entries in
are bounded by while the main diagonal consists of all ones.So, we can write
Suppose
. Then .Thus
This gives us
where indicates component wise inequality.Taking an entry-wise absolute value on both sides, we get
The last part is due to the fact that all entries in the vector
and the matrix are non-negative and the entries in are dominated by .Further,
In the above we used the fact that
.We can now define a new set
Clearly,
.We note that
is unbounded since if , then .Thus, in order to study its behavior, it is sufficient to consider the set of vectors with unit norm vectors
.We construct the new set as
Note that we replaced
in formulating the description of and the condition is automatically enforced since .Clearly
.In order to satisfy the requirement
, we need to have as large as possible.Since this quantity only considers first
entries in , hence the energy in should be concentrated inside the first entries to maximize this quantity.However, entries in
are restricted by the third requirement in .We can maximize it by choosing
for first
entries in .We then get
This gives us
This is a necessary condition for
to be non-empty.Thus, if
then, the requirement
is not satisfied and is empty.Consequently,
is empty and the theorem is proved.
We present another result which is based on
Proof. Unique solution of (20.4).
Let
and .We have
.-
By Theorem 18.23,
is the unique sparsest solution.
Unique solution of (20.7)
We need to show that for any
satisfying , we must have .We will show that any vector in the null space of
exhibits less than % concentration on ; i.e., for everyNow
Subtracting both sides with
we getLet
denote an matrix formed from the rows of corresponding to the indices in .Then
for some is the negative of the inner product of some row in with .We know that
where
is the max-column-sum norm of .This gives us
In any column of
the number of entries is .One of them is 0 (corresponding to the diagonal entry in
).Thus, leaving it the rest of the entries are
.By assumption
.Thus any set of entries in a column which is less than or equal to
entries cannot have a sum exceeding .This gives an upper bound on the max-column-sum of
; i.e.,Thus, we get
for every
.The rest follows from the fact that for any other
such that , we know thatwhenever
where
(thus ).
20.2.4. BPIC#
In the subsection, we present a stability guarantee result for BPIC.
Theorem 20.9
Consider an instance of the (20.8) problem
defined by the triplet
The solution
Proof. As before, we define
Then
We now rewrite the inequality in terms of the Gram matrix
.It is easy to show that:
whenever
is Hermitian.To see this just notice that
is a real quantity.Hence
.Now, using triangle inequality we can easily show that
.
Since
is Hermitian, henceNow
Only the off-diagonal terms of
remain in the sum, which are all dominated by .Thus we get
This is valid since
.Since
is optimal solution of (20.8), henceLet
and .By a simple permutation of columns of
, we can bring the entries in to the first entries making .We will make this assumption going forward without loss of generality.
Let
be corresponding support vector (of ones in first K places and 0 in rest).From our previous analysis, we recall that
Thus
is the sum of first terms of .Considering
as a vector and using the - norm relation , we getThus,
Putting this back in the previous inequality
We note that this inequality is valid only if
This condition can be reformulated as
Rewriting the bound on
we getwhich is the desired result.
20.2.5. BPDN#
In this subsection we will examine the
We will focus on following issues:
Some results from convex analysis useful for our study
Conditions for the minimization of (20.15) over coefficients
supported on a subdictionaryConditions under which the unique minimizer for a subdictionary is also the global minimizer for (20.15)
Application of (20.15) for sparse signal recovery
Application of (20.15) for identification of sparse signals in presence of noise
Application of (20.15) for identification of sparse signals in presence of Gaussian noise
We recall some definitions and results from convex analysis which will help us understand the minimizers for (20.15) problem.
Convex analysis for real valued functions the vector space
The subscript
We consider real valued functions over the inner product space
A real valued convex function
The objective function for the problem (20.15) is
Clearly,
We suggest the readers to review the material in Subgradients.
For any function
The elements of subdifferential set are called subgradients.
If
where
For convex function, the subdifferential of a sum is the (Minkowski) sum of the subdifferentials (Theorem 9.222); i.e.,
By Fermat’s optimality condition
(Theorem 9.233),
if
We would be specifically interested in the subdifferential for the function
Theorem 20.10
Let
. .
The proof is skipped.
We recall that the signum function for complex numbers is defined as
The subdifferential for
when . when .
20.2.5.1. Restricted Minimizers#
Suppose
index a subdictionary . is a linearly independent collection of atoms.Hence a unique
best approximation of using the atoms in can be obtained using the least square techniques.We define the orthogonal projection operator
And we get
The approximation is orthogonal to the residual; i.e.,
.There is a unique coefficient vector
supported on that synthesizes the approximation .We also have
Theorem 20.11 (Minimization of
Let
where
Proof. Since we are restricted
The objective function simplifies to
We define
.Now, both
and belong to the column space of while is orthogonal to it.Hence
Thus, using the Pythagorean theorem, we get
We can rewrite
asDefine
Then
Note that the term
is constant. It is the squared norm of the least square error.Thus, minimizing
over the coefficient vectors supported on is equivalent to minimizing over the same support set.Note that
We can write
asNote that
depends only on entries in which are part of the support .We can replace the variable
with and rewrite asSince atoms indexed by
are linearly independent, hence has full column rank.Thus, the quadratic term
is strictly convex.Since
is also convex, therefore is strictly convex and its minimizer is unique.Since
is strictly convex and unconstrained, hence is a necessary and sufficient condition for the coefficient vector to minimize .The gradient of the first (quadratic) term is
For the second term we have to consider its subdifferential
.Thus, at
it follows thatwhere
is some subgradient in .Premultiplying with
we getFinally, we recall that
.Thus, we get the desired result
Some bounds follow as a result of this theorem. Readers are suggested to review the material in Matrix Norms.
Theorem 20.12
Suppose that
Proof. We start with
we take the
norm on both sides and apply some norm boundsThe last inequality is valid since from Theorem 20.10 we have:
.Now let us premultiply (20.18) with
and apply normIn this derivation we used facts like
, and .
20.2.5.2. The Correlation Condition#
So far we have established a condition which
ensures that
Theorem 20.13 (Correlation condition for global minimizer)
Assume that
where
guarantees that
Proof. Let
What we need is a condition which ensures that the value of objective function
also increases if we change any other component of
If this happens, then
will become a local minimizer of .Further, since
is convex, will also be global minimizer.
Towards this, let
Let us expand the L.H.S. of this inequality:
Here we used the fact that
.Note that since
is supported on and , henceThus
We should also simplify the first bracket.
Similarly
Canceling the like terms we get
Thus,
Recall that since
is supported on , hence .We can further split the middle term by adding and subtracting
.Thus, we can write
The term
is strictly positive giving usUsing lower triangle inequality we can write
Using linearity of inner product, we can take out
:(20.21)#Let us simplify this expression. Since
is a unique minimizer over coefficients in , hence using Theorem 20.11where
.Thus
using the fact that
.Also, we recall that
.Putting the back in (20.21) we obtain:
(20.22)#In (20.22), the L.H.S. is non-negative (our real goal) whenever the term in the bracket on the R.H.S. is non-negative (since
is positive).Therefore we want that
This can be rewritten as
Since this condition should hold for every
, hence we maximize the L.H.S. and minimize the R.H.S. over .We get
Recall that
is orthogonal to the space spanned by atoms in .Hence
This gives us the desired sufficient condition
This condition still uses
. We know that .Let us simplify as follows:
Another way to understand this is as follows. For any vector
Thus
Thus, it is also sufficient that
20.2.5.3. Exact Recovery Coefficient#
We recall that Exact Recovery Coefficient for a subdictionary is defined as
Thus, the sufficient condition can be rewritten as
Note that the L.H.S. in both sufficient conditions is always non-negative.
Hence, if the R.H.S. is negative (i.e.
), the sufficient condition is useless.On the other hand if
, then a sufficiently high can always be chosen to satisfy the condition in (20.20).At the same time as
, the optimum minimizer is .
How do we interpret the L.H.S.
Definition 20.2
Given a non-zero signal
If
This is known as the maximum correlation [79] of a signal with a dictionary.
Essentially, for any signal we normalize it and then find out
its maximum inner product (absolute value) with atoms in
the dictionary
We can see that
We can now interpret
Therefore, the sufficient condition in Theorem 20.13 is
strongest when the magnitude of the residual
Since the maximum correlation of the residual never exceeds one, hence we obtain following (much weaker result)
Corollary 20.1
Let
Then any coefficient vector
20.2.5.4. Applications of Penalization#
Having setup the basic results in place, we can now study the applications of (20.15).
Theorem 20.14 (BPDN reconstruction guarantees using ERC)
Let
Let
Support of
is contained in .The error between
and the optimal coefficient vector satisfiesIn particular,
contains every index in for whichMoreover, the minimizer
is unique.
Proof. .
Since the sufficient condition for correlation condition Theorem 20.13 are satisfied, hence
which minimizes over coefficient vectors in is also a global minimizer of .Since
, hence .For claim 2, application of Theorem 20.12 gives us
Since the convex function
is strictly convex, hence is unique global minimizer.For claim 3, suppose
for some index for whichThen
But
This violates the bound that
Thus,
contains every index for which
We can formulate a simpler condition in terms of coherence of the dictionary.
Theorem 20.15 (BPDN reconstruction guarantees using coherence)
Suppose that
Let
Support of
is contained in andThe distance between
and the optimal coefficient vector satisfies
In particular,
contains every index in for whichMoreover, the minimizer
is unique.
Proof. .
We recall from Theorem 18.75 the coherence bounds on ERC as
Thus,
A direct application of Theorem 20.14 validates claims 1 and 4.
We recall from Theorem 18.37 the upper bound on norm of inverse Gram matrix of a subdictionary as
Putting this in Theorem 20.14 validates claims 2 and 3.
20.2.5.5. Exact Sparse Reconstruction Problem#
We now show how one can reconstruct an exactly sparse signal solving the convex program (20.15).
Theorem 20.16 (BPDN exact sparse recovery guarantee)
Assume that
There is a positive number
for which implies that .In the limit as
, we have .
Proof. .
Since there is no noise, hence the best
approximation of overitself and the corresponding coefficient vector is
Therefore
Thus, the correlation condition is in force for every positive value of
.Thus, as per Theorem 20.14, minimizer
of the convex program (20.15) must be supported inside .Moreover, we have
Clearly, as
, we have .Finally, recall that
contains very index in for whichIn order for every index in
to be part of , we requireChoosing the L.H.S. to be
, we get an explicit value of upper bound on such that leads to complete discovery of support.