11.1. Basic Subgradient Method#

The theory of subgradients for convex functions is discussed in Subgradients. This section discusses basic subgradient method for optimization. Primary references for this section are [15].

Recall that a vector g is a subgradient of f at a point x if

f(y)f(x)+yx,gyRn.

11.1.1. Unconstrained Minimization#

We start with a simple case of minimizing a convex function f:RnR.

  1. The optimal value shall be denoted by f.

  2. We shall assume that the optimal value is finite (i.e., f>).

  3. The set of optimal minimizers shall be denoted by X.

  4. Since f attains its optimal value, hence X is nonempty.

  5. A global minimizer will be denoted by x.

The basic subgradient method uses the following iteration.

(11.1)#xk+1=xktkgk.

In this iteration

  1. xk is the current (k-th) iterate.

  2. gk is any subgradient of f at xk. We have gkf(xk).

  3. tk>0 is the step size for the k-th iteration in the negative subgradient direction gk.

  4. xk+1 is the new (k+1-th) iterate.

  5. tkgk2 denotes the step-length.

Thus, in each iteration of the subgradient method, we take a step in the direction of a negative subgradient.

11.1.1.1. Descent Directions#

Observation 11.1 (Negative subgradient need not be a descent direction)

  1. Recall from Definition 10.22 that a descent direction of f at x is a direction along which the directional derivative is negative; i.e., f(x;d)<0.

  2. If d is a descent direction at x, then there exists a step size t>0 such that

    f(x+td)<f(x).
  3. If f is a differentiable function and d=f(x), then

    f(x;f(x))=f(x)20.
  4. Hence, the negative gradient is always a descent direction if the gradient is nonzero.

  5. However, for a nondifferentiable function f, a negative subgradient may not be a descent direction.

  6. Recall from max formula (Theorem 9.219) that

    f(x;d)=sup{d,g|gf(x)}.
  7. Let g~ be the chosen subgradient at an iteration of the subgradient method.

  8. Then

    f(x;g~)=supgf(x){g~,g}=infgf(x){g~,g}.
  9. It is possible that the quantity g~,g is negative for some other subgradient g at x.

  10. Hence the directional derivative along g~ may be positive.

  11. This argument shows that a negative subgradient is not necessarily a descent direction.

11.1.1.2. Tracking of the Best Estimate#

  1. Since the negative subgradient gk may not be a descent direction, hence it is quite likely that f(xk+1)>f(xk).

  2. Even if the selected gk is a descent direction at xk, it is possible that the step size can be such that f(xk+1)>f(xk).

  3. This happens since typically there is no local line search method invoked to select an appropriate step size in a subgradient method.

  4. Since the negative subgradient itself may not be a descent direction, hence running a local line search will be futile.

  5. The alternative is to track the best estimate of the optimal value so far using the following rule

    fbestk=min{fbestk1,f(xk)}.
  6. If the best estimate has decreased, we also update the estimate of the local minimizer as xbest to xk+1.

  7. It is easy to see that

    fbestk=min{f(x1),,f(xk)}.
  8. We can see that the sequence {fbestk} is a nonincreasing sequence.

  9. Hence it has a limit.

  10. The success of the subgradient method depends on whether the limit equals the optimal value.

11.1.1.3. Step Size Selection#

In a gradient descent method, one typically uses a line search method to select the step size. Hence, it depends on current iterate and the function values in its neighborhood.

In a subgradient method, the step size selection is different. Typically, the step sizes (or step lengths) are determined beforehand and they don’t depend on the data computed during the execution of the algorithm. Following are some of the common methods for step size selection.

  1. Constant step size. tk=t for some positive constant t>0 which is independent of k.

  2. Constant step length.

    tk=cgk2

    where c>0 is a predefined step length. Note that with this choice of step size, we have:

    xk+1xk2=tkgk2=c.
  3. Square summable but not summable. We choose step sizes that satisfy the following constraints:

    tk>0k,k=1tk2<,k=1tk=.

    As an example, fix some a>0 and b0 and let

    tk=ab+k.
  4. Nonsummable diminishing. We choose step sizes that satisfy the following constraints:

    tk>0k,limktk=0,k=1tk=.

    As and example, fix some a>0 and let

    tk=ak.
  5. Nonsummable diminishing step lengths: We choose step sizes as tk=ckgk2 where

    ck>0,limkck=0,k=1ck=.

11.1.1.4. The Subgradient Method#

We now present the overall template for the subgradient method.

Algorithm 11.1 (The subgradient method)

Inputs

  1. f : function to be minimized

  2. x1: initial guess for the minimizer

  3. t1: initial step size

Outputs

  1. fbest1: Best estimate of minimum value of f

  2. xbest1 the best estimate of the minimizer

Initialization

  1. fbest=f(x1).

  2. xbest=x1.

General iteration: for k=1,2,, execute the following steps

  1. Select a subgradient gk from f(xk).

  2. Select a step size: tk.

  3. Update minimizer estimate: xk+1=xktkgk.

  4. k=k+1.

  5. Compute f(xk).

  6. if fbestk1>f(xk):

    1. Update best estimate of optimal value: fbestk=f(xk).

    2. Update best estimate of minimizer: xbestk=xk.

  7. Otherwise, retain current values

    1. fbestk=fbestk1.

    2. xbestk=xbestk1.

  8. If stopping criterion is met, then STOP.

For a concrete implementation, we shall need the following:

  1. A way to compute f(x) at every x.

  2. A way to pick a subgradient gf(x) at every x.

  3. A step size selection strategy.

  4. A stopping criterion.

Note that there is no need to specify the complete subdifferential at every x. The stopping criterion will be developed in the following as part of the convergence analysis of the algorithm.

11.1.2. Convergence Analysis#

Our goal is to show that fbestkf. Towards this, we will establish a suboptimality bound on the estimate of the optimal value after k iterations.

We shall make the following assumptions in the analysis.

  1. f is a real valued convex function.

  2. The optimal value f is finite and the minimizer x exists.

  3. The norm of the subgradients is bounded. i.e., there exists G>0 such that

    g2Ggf(x)x.

    Recall from Theorem 9.232 that if a convex function is Lipschitz continuous, then its subgradients are bounded.

  4. A number R that satisfies

    Rxx12

    for some xX is known beforehand. R can be interpreted as an upper bound on the distance of the initial point to the set of optimal minimizers.

    Rdist(x1,X).

Theorem 11.1 (Suboptimality bound for subgradient method)

Let x1x2R.

After k iterations of the subgradient method, we have

(11.2)#fbestkfR2+i=1kti2gi222i=1kti.

If the subgradients of f are bounded by gG for all gf(x) for every x, then this simplifies to:

(11.3)#fbestkfR2+G2i=1kti22i=1kti.

Proof. Let x be a minimizer of f. From the subgradient inequality, we have

f=f(x)f(xk)+xxk,gk.

Hence

xxk,gkff(xk).

We first develop an upper bound on the distance between the k+1-th iterate and a minimizer.

xk+1x22=xktkgkx22=(xkx)tkgk22=xkx222tkxkx,gk+tk2gk22xkx222tk(f(xk)f)+tk2gk22.

Applying this inequality recursively, we get

xk+1x22x1x222i=1kti(f(xi)f)+i=1kti2gi22.

Using the fact that xk+1x220 and x1x2R, we have

2i=1kti(f(xi)f)R2+i=1kti2gi22.

On the L.H.S., we can see that

i=1kti(f(xi)f)(i=1kti)mini=1,,k(f(xi)f)=(i=1kti)(fbestkf).

Combining this with the previous inequality, we get

2(i=1kti)(fbestkf)R2+i=1kti2gi22.

This can be rewritten as

fbestkfR2+i=1kti2gi222i=1kti

as desired.

Applying the upper bound on the subgradient g2G, we have

fbestkfR2+G2i=1kti22i=1kti

as desired.

Based on this result, we can provide upper bounds on the estimation error for different step size selection strategies.

11.1.2.1. Constant Step Size#

Corollary 11.1 (Convergence of subgradient method with constant step size)

If tk=t for all k, then we have

fbestkfR2+G2t2k2tk.

The subgradient method converges to within G2t/2 of the optimal value f. In other words,

limkfbestkfG2t2.

Also, fbestkfG2t within at most R2/(G2t2) steps.

Proof. By putting tk=t in (11.3), we obtain the desired upper bound. Taking the limit k, we see that

limk(fbestkf)G2t2.

Now,

R2+G2t2k2tkG2tR2+G2t2k2G2t2kR2G2t2kkR2G2t2.

Since fbestk is a nonincreasing sequence, hence for all kR2/(G2t2), we must have

fbestkfG2t.

11.1.2.2. Constant Step Length#

Corollary 11.2 (Convergence of subgradient method with constant step length)

Let c be the constant step length for the subgradient method. Let tk=cgk2 for every k. Then we have

fbestkfR2+c2k2ck/G.

The subgradient method converges to within Gc/2 of the optimal value f. In other words,

limkfbestkfGc2.

Proof. By putting ti=cgi2 in (11.2), we see that

fbestkfR2+i=1kc22i=1kti.

Also, note that

ti=cgi2ticGi=1ktickG1i=1kti1ck/G.

Putting this back, we have

fbestkfR2+c2k2ck/G.

Taking the limit k, we see that

limk(fbestkf)Gc2.

11.1.2.3. Square Summable Step Sizes#

Corollary 11.3 (Convergence of subgradient method with square summable step sizes)

Let the step sizes satisfy the rule

tk>0k,k=1tk2<,k=1tk=.

Further, let T=k=1tk2.

Then we have

fbestkfR2+G2T2i=1ktk.

The subgradient method converges to f.

Proof. By putting the step size constraints into (11.3), we obtain

fbestkfR2+G2T2i=1kti.

Taking the limit k, the denominator grows to while the numerator converges to R2+G2T. Hence

limkfbestkf=0.

11.1.2.4. Diminishing Step Sizes#

Corollary 11.4 (Convergence of subgradient method with diminishing step sizes)

Let the step sizes satisfy the rule

tk>0k,limktk=0,k=1tk=.

Then the subgradient method converges to f.

Proof. We shall show that the R.H.S. of (11.3) converges to 0 as k.

  1. Let ϵ>0.

  2. There exists integer n1 such that tiϵ/G2 for all i>n1 since tk0.

  3. There also exists an integer n2 such that

    i=1n2ti1ϵ(R2+G2i=1n1ti2)

    since ti is nonsummable.

  4. Let n=max{n1,n2}.

  5. Then, for every k>n, we have

    R2+G2i=1kti22i=1kti=R2+G2i=1n1ti22i=1kti+G2i=n1+1kti22i=1n1ti+2i=n1+1kti.
  6. We can see that

    i=1ktii=1n2ti1ϵ(R2+G2i=1n1ti2).
  7. Hence

    R2+G2i=1n1ti22i=1ktiϵ2.
  8. For every i>n1, we have ti<ϵ/G2.

  9. Hence

    G2i=n1+1kti2ϵi=n1+1kti.
  10. We can now see that

    G2i=n1+1kti22i=1n1ti+2i=n1+1kti<G2i=n1+1kti22i=n1+1ktiϵi=n1+1kti2i=n1+1kti=ϵ2.
  11. Combining, we have

    R2+G2i=1kti22i=1kti<ϵ2+ϵ2=ϵ

    for every k>n.

  12. Thus, for every ϵ>0, there exists n such that for all k>n, we have

    R2+G2i=1kti22i=1kti<ϵ.
  13. Hence

    limkR2+G2i=1kti22i=1kti=0.
  14. Hence the subgradient method converges.

11.1.2.5. Diminishing Step Lengths#

Corollary 11.5 (Convergence of subgradient method with diminishing step lengths)

Let the step sizes tk=ckgk2 be chosen such that

ck>0,limkck=0,k=1ck=

where ck is the step length for the k-th iteration.

Then the subgradient method converges to f.

Proof. From (11.2) we have

fbestkfR2+i=1kti2gi222i=1kti=R2+i=1kci22i=1kti.
  1. We have

    i=1kti=i=1kcigi2i=1kciG.
  2. Hence we have

    fbestkfR2+i=1kci2(2/G)i=1kci.
  3. Following an argument similar to the proof of Corollary 11.4, the R.H.S. converges to 0 as k.

  4. Hence

    limkfbestkf=0.

11.1.2.6. On the Suboptimality Bound#

If we run the subgradient method for k iterations, we get a suboptimality bound given by (11.3).

fbestkfR2+G2i=1kti22i=1kti.

A natural question is how tight can this bound be by an appropriate selection of step sizes t1,,tk.

  1. Note that the R.H.S. is a convex and symmetric function of t1,,tk.

  2. Hence, at the optimal value, we must have t1==tk.

  3. Let t=t1==tk.

  4. Then the suboptimality bound reduces to

    R2+G2kt22kt.
  5. This is minimized at t=RGk.

  6. In other words, the finite sequence of step-sizes t1,,tk that minimizes the suboptimality bound in (11.3) after exactly k iterations is given by

    ti=RGk,i=1,,k.
  7. It must be noted that this suboptimality bound is dependent on the number of iterations.

  8. This choice of constant step size gives us the bound

    fbestkfRGk.

Theorem 11.2 (A bound on the suboptimality bound)

Let t1,,tk be any choice of step sizes for the subgradient method for k iterations. Then we must have

R2+G2i=1kti22i=1ktiRGk

where the L.H.S. is the step-size selection dependent suboptimality bound on fbestkf.

Accordingly, the number of steps required to achieve a guaranteed accuracy of fbestkfϵ for some ϵ>0 is at least (RG/ϵ)2 as per the suboptimality bound (11.3) irrespective of the step size selection.

Proof. We have already shown that the suboptimality bound is minimized by the constant step size selection where ti=RGk for every i=1,,k and is given by RGk.

  1. Let ϵ>0 be the required guaranteed accuracy.

  2. Then ϵRGk since RGk is the minimum guaranteed suboptimality bound after k iterations as per (11.3).

  3. Hence we have kRGϵ.

  4. Hence we have k(RGϵ)2.

We note that the suboptimality bound of ϵ can be guaranteed in k iterations only if we use the constant step sizes of t=RGk.

Observation 11.2 (Interpretation of RG)

Initial uncertainty

  1. Assume that f is Lipschitz continuous with the constant G.

  2. By Lipschitz continuity, we have

    f1fGx1x2RG.
  3. Hence RG is an estimate of the initial uncertainty in f.

Reduction in uncertainty

  1. If our goal is to achieve fbestkfϵ, then ϵ is an estimate of the final uncertainty in f.

  2. Hence RG/ϵ is the ratio of initial uncertainty in f to the final uncertainty in f.

  3. By Theorem 11.2, we require a minimum of (RG/ϵ)2 iterations to achieve this much reduction in uncertainty.

This analysis shows that the subgradient method is very slow. To achieve a 1000x reduction in the uncertainty of f, we need a million iterations at least.