Basic Subgradient Method
Contents
11.1. Basic Subgradient Method#
The theory of subgradients for convex functions is discussed in Subgradients. This section discusses basic subgradient method for optimization. Primary references for this section are [15].
Recall that a vector
11.1.1. Unconstrained Minimization#
We start with a simple case of minimizing a convex function
The optimal value shall be denoted by
.We shall assume that the optimal value is finite (i.e.,
).The set of optimal minimizers shall be denoted by
.Since
attains its optimal value, hence is nonempty.A global minimizer will be denoted by
.
The basic subgradient method uses the following iteration.
In this iteration
is the current ( -th) iterate. is any subgradient of at . We have . is the step size for the -th iteration in the negative subgradient direction . is the new ( -th) iterate. denotes the step-length.
Thus, in each iteration of the subgradient method, we take a step in the direction of a negative subgradient.
11.1.1.1. Descent Directions#
Observation 11.1 (Negative subgradient need not be a descent direction)
Recall from Definition 10.22 that a descent direction of
at is a direction along which the directional derivative is negative; i.e., .If
is a descent direction at , then there exists a step size such thatIf
is a differentiable function and , thenHence, the negative gradient is always a descent direction if the gradient is nonzero.
However, for a nondifferentiable function
, a negative subgradient may not be a descent direction.Recall from max formula (Theorem 9.219) that
Let
be the chosen subgradient at an iteration of the subgradient method.Then
It is possible that the quantity
is negative for some other subgradient at .Hence the directional derivative along
may be positive.This argument shows that a negative subgradient is not necessarily a descent direction.
11.1.1.2. Tracking of the Best Estimate#
Since the negative subgradient
may not be a descent direction, hence it is quite likely that .Even if the selected
is a descent direction at , it is possible that the step size can be such that .This happens since typically there is no local line search method invoked to select an appropriate step size in a subgradient method.
Since the negative subgradient itself may not be a descent direction, hence running a local line search will be futile.
The alternative is to track the best estimate of the optimal value so far using the following rule
If the best estimate has decreased, we also update the estimate of the local minimizer as
to .It is easy to see that
We can see that the sequence
is a nonincreasing sequence.Hence it has a limit.
The success of the subgradient method depends on whether the limit equals the optimal value.
11.1.1.3. Step Size Selection#
In a gradient descent method, one typically uses a line search method to select the step size. Hence, it depends on current iterate and the function values in its neighborhood.
In a subgradient method, the step size selection is different. Typically, the step sizes (or step lengths) are determined beforehand and they don’t depend on the data computed during the execution of the algorithm. Following are some of the common methods for step size selection.
Constant step size.
for some positive constant which is independent of .Constant step length.
where
is a predefined step length. Note that with this choice of step size, we have:Square summable but not summable. We choose step sizes that satisfy the following constraints:
As an example, fix some
and and letNonsummable diminishing. We choose step sizes that satisfy the following constraints:
As and example, fix some
and letNonsummable diminishing step lengths: We choose step sizes as
where
11.1.1.4. The Subgradient Method#
We now present the overall template for the subgradient method.
Algorithm 11.1 (The subgradient method)
Inputs
: function to be minimized : initial guess for the minimizer : initial step size
Outputs
: Best estimate of minimum value of the best estimate of the minimizer
Initialization
. .
General iteration: for
Select a subgradient
from .Select a step size:
.Update minimizer estimate:
. .Compute
.if
:Update best estimate of optimal value:
.Update best estimate of minimizer:
.
Otherwise, retain current values
. .
If stopping criterion is met, then STOP.
For a concrete implementation, we shall need the following:
A way to compute
at every .A way to pick a subgradient
at every .A step size selection strategy.
A stopping criterion.
Note that there is no need to specify the complete
subdifferential at every
11.1.2. Convergence Analysis#
Our goal is to show that
We shall make the following assumptions in the analysis.
is a real valued convex function.The optimal value
is finite and the minimizer exists.The norm of the subgradients is bounded. i.e., there exists
such thatRecall from Theorem 9.232 that if a convex function is Lipschitz continuous, then its subgradients are bounded.
A number
that satisfiesfor some
is known beforehand. can be interpreted as an upper bound on the distance of the initial point to the set of optimal minimizers.
Theorem 11.1 (Suboptimality bound for subgradient method)
Let
After
If the subgradients of
Proof. Let
Hence
We first develop an upper bound on the distance between
the
Applying this inequality recursively, we get
Using the fact that
On the L.H.S., we can see that
Combining this with the previous inequality, we get
This can be rewritten as
as desired.
Applying the upper bound on the subgradient
as desired.
Based on this result, we can provide upper bounds on the estimation error for different step size selection strategies.
11.1.2.1. Constant Step Size#
Corollary 11.1 (Convergence of subgradient method with constant step size)
If
The subgradient method converges to within
Also,
Proof. By putting
Now,
Since
11.1.2.2. Constant Step Length#
Corollary 11.2 (Convergence of subgradient method with constant step length)
Let
The subgradient method converges to within
Proof. By putting
Also, note that
Putting this back, we have
Taking the limit
11.1.2.3. Square Summable Step Sizes#
Corollary 11.3 (Convergence of subgradient method with square summable step sizes)
Let the step sizes satisfy the rule
Further, let
Then we have
The subgradient method converges to
Proof. By putting the step size constraints into (11.3), we obtain
Taking the limit
11.1.2.4. Diminishing Step Sizes#
Corollary 11.4 (Convergence of subgradient method with diminishing step sizes)
Let the step sizes satisfy the rule
Then the subgradient method converges to
Proof. We shall show that the R.H.S. of (11.3)
converges to 0 as
Let
.There exists integer
such that for all since .There also exists an integer
such thatsince
is nonsummable.Let
.Then, for every
, we haveWe can see that
Hence
For every
, we have .Hence
We can now see that
Combining, we have
for every
.Thus, for every
, there exists such that for all , we haveHence
Hence the subgradient method converges.
11.1.2.5. Diminishing Step Lengths#
Corollary 11.5 (Convergence of subgradient method with diminishing step lengths)
Let the step sizes
where
Then the subgradient method converges to
Proof. From (11.2) we have
We have
Hence we have
Following an argument similar to the proof of Corollary 11.4, the R.H.S. converges to 0 as
.Hence
11.1.2.6. On the Suboptimality Bound#
If we run the subgradient method for
A natural question is how tight can this bound be by an appropriate
selection of step sizes
Note that the R.H.S. is a convex and symmetric function of
.Hence, at the optimal value, we must have
.Let
.Then the suboptimality bound reduces to
This is minimized at
.In other words, the finite sequence of step-sizes
that minimizes the suboptimality bound in (11.3) after exactly iterations is given byIt must be noted that this suboptimality bound is dependent on the number of iterations.
This choice of constant step size gives us the bound
Theorem 11.2 (A bound on the suboptimality bound)
Let
where the L.H.S. is the step-size selection dependent suboptimality bound
on
Accordingly, the number of steps required to achieve a guaranteed
accuracy of
Proof. We have already shown that the suboptimality bound is minimized
by the constant step size selection where
Let
be the required guaranteed accuracy.Then
since is the minimum guaranteed suboptimality bound after iterations as per (11.3).Hence we have
.Hence we have
.
We note that the suboptimality bound of
Observation 11.2 (Interpretation of
Initial uncertainty
Assume that
is Lipschitz continuous with the constant .By Lipschitz continuity, we have
Hence
is an estimate of the initial uncertainty in .
Reduction in uncertainty
If our goal is to achieve
, then is an estimate of the final uncertainty in .Hence
is the ratio of initial uncertainty in to the final uncertainty in .By Theorem 11.2, we require a minimum of
iterations to achieve this much reduction in uncertainty.
This analysis shows that the subgradient method is very slow.
To achieve a 1000x reduction in the uncertainty of