10.5. Constrained Optimization I#

In this section, we focus on objective functions of type \(f : \RR^n \to \RR\) which are convex and differentiable. Our goal is to minimize \(f\) over a convex set \(C \subseteq \dom f\).

We also concern ourselves with functions of type \(f : \VV \to \RERL\) where \(\VV\) is a Euclidean space and \(f\) is Fréchet differentiable over some open set \(C \subseteq \dom f\). For those results which concern with Fréchet differentiable functions, we request the readers to revise the concepts of Gateaux and Fréchet differentiability in Differentiation in Banach Spaces.

Main references for this section are [6, 17].

10.5.1. Stationary Points#

We first look at functions which are differentiable over a convex set. Here, we don’t require that the function itself be convex. Thus, we may not characterize the global optimality of a point. However, we can still characterize the local optimality of a point.

In Definition 8.6, we defined stationary points for a real valued function as points where the gradient vanishes; i.e. \(\nabla f(\bx) = \bzero\).

In this section, we wish to restrict the domain of feasible points to a convex set \(C\) and consider the problem

\[\begin{split} & \text{minimize } & & f(\bx) \\ & \text{subject to } & & \bx \in C \end{split}\]

where \(f\) is differentiable over the convex set \(C\).

We now introduce the notion of stationary points for the given optimization problem of minimizing \(f\) over \(C\).

Definition 10.21 (Stationary point for an optimization problem)

Let \(f : \RR^n \to \RR\) be a real valued function which is differentiable over a convex set \(C\).

Then, \(\ba \in C\) is called a stationary point of the problem of minimizing \(f\) over \(C\) if

\[ \nabla f(\ba)^T (\bx - \ba) \geq 0 \Forall \bx \in C. \]

If \(C = \RR^n\), then the condition will reduce to \(\nabla f(\ba) = \bzero\).

Theorem 10.45 (Local minimum points are stationary points)

Let \(f : \RR^n \to \RR\) be a real valued function which is differentiable over a convex set \(C\). Consider the optimization problem

\[\begin{split} & \text{minimize } & & f(\bx) \\ & \text{subject to } & & \bx \in C. \end{split}\]

If \(\ba \in C\) is a local minimum, then \(\ba\) must be a stationary point.

Proof. Let \(\ba\) be a local minimum and for contradiction assume that it is not a stationary point.

  1. Then, there exists \(\bx \in C\) such that \(\nabla f(\ba)^T (\bx - \ba) < 0\).

  2. Let \(t \in [0,1]\) be a parameter.

  3. Let \(\bz_t = t \bx + (1-t) \ba\).

  4. Since \(C\) is convex, hence \(\bz_t \in C\).

  5. Differentiating \(f(\bz_t)\) w.r.t. \(t\) at \(t=0\), we obtain

    \[ \left . \frac{d}{d t} f(\bz_t) \right |_{t=0} = \nabla f(\ba)^T (\bx - \ba) < 0. \]
  6. Thus, for small enough \(t\), \(f(\bz_t) < f(\ba)\).

  7. This contradicts the hypothesis that \(\ba\) is a local minimum.

  8. Thus, all local minimum points must be stationary points.

Example 10.6 (Unconstrained minimization)

Let \(f : \RR^n \to \RR\) be a real valued function which is differentiable. Consider the unconstrained optimization problem

\[ \text{minimize } f(\bx). \]

In this case, the feasible set is \(C = \RR^n\).

  1. If \(\ba\) is a stationary point for this problem, then

    \[ \nabla f(\ba)^T (\bx - \ba) \geq 0 \Forall \bx \in \RR^n. \]
  2. In particular, if we choose \(\bx = \ba - \nabla f(\ba)\), we get

    \[ \nabla f(\ba)^T (\bx - \ba) = \nabla f(\ba)^T (- \nabla f(\ba)) = - \| \nabla f(\ba) \|^2 \geq 0. \]
  3. This is true only if \(\nabla f(\ba) = \bzero\).

  4. Thus, for unconstrained minimization, the gradient vanishes at stationary points.

Theorem 10.46 (Stationary point as an orthogonal projection)

Let \(f : \RR^n \to \RR\) be a real valued function which is differentiable over a convex set \(C\). Consider the optimization problem

\[\begin{split} & \text{minimize } & & f(\bx) \\ & \text{subject to } & & \bx \in C. \end{split}\]

Let \(s > 0\). Then \(\ba \in C\) is a stationary point of the optimization problem if and only if

\[ \ba = P_C (\ba - s \nabla f(\ba)). \]

Proof. Recall from Theorem 10.7 that \(\bz \in C\) is the projection of \(\bx\) if and only if

\[ \langle \by - \bz, \bx - \bz \rangle \leq 0 \Forall \by \in C. \]
  1. Replace \(\bz = \ba\) and \(\bx = \ba - s \nabla f(\ba)\). We get

    \[\begin{split} & \langle \by - \ba, \ba - s \nabla f(\ba) - \ba \rangle \leq 0 \Forall \by \in C\\ & \iff s\langle \by - \ba, \nabla f(\ba) \rangle \geq 0 \Forall \by \in C\\ & \iff \nabla f(\ba)^T (\by - \ba) \geq 0 \Forall \by \in C. \end{split}\]
  2. But this is the same condition as the definition for a stationary point.

10.5.2. First Order Optimality Criteria#

We now pay our attention to the case where \(f\) is convex as well as differentiable. In this case, a point is a global optimal point if and only if it is a stationary point.

Theorem 10.47 (Optimality criterion for differentiable objective function)

Let \(f : \RR^n \to \RR\) be a differentiable convex function. Let \(C \subseteq \dom f\) be a convex set. Consider the minimization problem

\[\begin{split} & \text{minimize } & & f(\bx) \\ & \text{subject to } & & \bx \in C. \end{split}\]

Then, \(\bx \in C\) is an optimal point if and only if

(10.8)#\[\nabla f(\bx)^T (\by - \bx) \geq 0 \Forall \by \in C.\]

In other words, \(\bx\) is optimal if and only if it is a stationary point.

Proof. By Theorem 9.98

\[ f(\by) \geq f(\bx) + \nabla f(\bx)^T (\by - \bx) \]

for every \(\bx, \by \in \dom f\).

Assume that some \(\bx \in C\) satisfies the optimality criterion in (10.8).

  1. Let \(\by \in C\).

  2. Then, by differentiability

    \[ f(\by) \geq f(\bx) + \nabla f(\bx)^T (\by - \bx). \]
  3. By hypothesis \(\nabla f(\bx)^T (\by - \bx) \geq 0\).

  4. Thus, \(f(\by) \geq f(\bx)\).

  5. Since this is true for every \(\by \in C\), hence \(\bx\) is an optimal point for the minimization problem.

Now for the converse, assume that \(\bx \in C\) is an optimal point.

  1. For contradiction, assume that (10.8) doesn’t hold.

  2. Then, there exists \(\by \in C\) such that

    \[ \nabla f(\bx)^T (\by - \bx) < 0. \]
  3. Let \(t \in [0, 1]\) be a parameter.

  4. Let \(\bz(t) = t \by + (1-t) \bx\).

  5. Since \(C\) is convex, hence \(\bz(t) \in C\).

  6. Differentiating \(f(\bz(t))\) w.r.t. \(t\) at \(t=0\), we obtain

    \[ \left . \frac{d}{d t} f(\bz(t)) \right |_{t=0} = \nabla f(\bx)^T (\by - \bx) < 0. \]
  7. Thus, for small enough \(t\), \(f(\bz(t)) < f(\bx)\).

  8. Thus, \(\bx\) cannot be an optimal point for the minimization problem.

  9. This contradicts our hypothesis that \(\bx\) is an optimal point.

  10. Hence, (10.8) must hold for every \(\by \in C\).

Theorem 10.48 (Optimality criterion for unconstrained problem with differentiable objective function)

Let \(f : \RR^n \to \RR\) be a differentiable convex function. Consider the unconstrained minimization problem

\[ \text{minimize } f(\bx) \]

Then, \(\bx \in \RR^n\) is an optimal point if and only if \(\nabla f(\bx) = \bzero\).

Proof. In this case, the set of feasible points is \(C = \dom f\).

  1. Since \(f\) is differentiable, hence \(C\) is an open set.

  2. By Theorem 10.47, \(\bx\) is an optimal point if and only if

    \[ \nabla f(\bx)^T (\by - \bx) \geq 0 \Forall \by \in C. \]
  3. If \(\nabla f(\bx) = \bzero\), then this inequality is satisfied. Hence, \(\bx\) must be an optimal point.

  4. Now, assume that \(\bx\) is an optimal point.

  5. Since \(\bx \in C\) and \(C\) is open, hence there exists a closed ball \(B[\bx, r] \subseteq C\).

  6. Let \(\by = \bx - t \nabla f(\bx)\).

  7. For sufficiently small \(t > 0\), \(\by \in B[\bx, r] \subseteq C\).

  8. Then,

    \[ \nabla f(\bx)^T (\by - \bx) = \nabla f(\bx)^T (-t \nabla f(\bx)) = -t \| \nabla f(\bx) \|_2^2 \geq 0 \]

    must hold true for \(t > 0\).

  9. This means that \(\| \nabla f(\bx) \|_2 \leq 0\) must be true.

  10. Thus \(\nabla f(\bx) = \bzero\) must be true.

10.5.2.1. Nondifferentiability at the Boundary#

There are some specific results available for the unconstrained minimization of a convex function \(f\) is not differentiable everywhere in its domain. If \(\dom f\) is not open, then \(f\) may be differentiable at \(\interior \dom f\) but is not differentiable at the boundary points \((\dom f) \setminus (\interior dom f)\). The key questions are

  1. Under what conditions, the minimizer of \(f\) is a point of differentiability.

  2. Under what conditions, the minimizer of \(f\) may be at a point of nondifferentiability.

Theorem 10.49 (Zero gradient implies minimizer)

Let \(f: \VV \to \RERL\) with \(S = \dom f\) be a proper convex function which is differentiable over some open set \(U \subseteq S\). If \(\nabla f(\bx) = \bzero\) at some \(\bx \in U\), then \(\bx\) is one of the minimizers for \(f\).

Proof. Recall from Theorem 9.220 that \(f\) is differentiable at \(\bx\) if and only if

\[ \partial f(\bx) = \{ \nabla f (\bx) \}. \]
  1. Assume that \(f\) is differentiable at \(\bx \in U\) and \(\nabla f(\bx) = \bzero\).

  2. Then \(\nabla f (\bx)\) is the one and only subgradient of \(f\) at \(\bx\).

  3. Due to subgradient inequality

    \[ f(\by) \geq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle \Forall \by \in \VV. \]
  4. Since \(\nabla f(\bx) = \bzero\), hence

    \[ f(\by) \geq f(\bx) \Forall \by \in \VV. \]
  5. Thus \(f\) attains a minimum at \(\bx\) and \(\bx\) is one of its minimizers.

Next we consider the special case of the minimization of a convex function \(f\) which is differentiable in the interior of its domain but the gradient never vanishes. In this case, if a minimizer for \(f\) exists, it must be at the boundary of its domain.

Theorem 10.50 (Minimizers at points of nondifferentiability)

Let \(f: \VV \to \RERL\) with \(S = \dom f\) be a proper convex function which is differentiable over some open set \(U \subseteq S\) and not differentiable over \(S \setminus U\). Assume that \(f\) attains its minimum value at some \(\ba \in S\); i.e., a minimizer of \(f\) exists. If \(\nabla f(\bx) \neq \bzero\) at every \(\bx \in U\), then the minimizer must be at some point of nondifferentiability. In other words, \(\ba \in S \setminus U\).

Proof. We are given that there exists a is a minimizer of \(f\) and for every \(\bx \in U\), \(\nabla f(\bx) \neq \bzero\).

  1. Pick any \(\bx \in U\).

  2. We need to show that \(\bx\) cannot be a minimizer.

  3. For contradiction, assume that \(\bx\) is a minimizer.

  4. By subgradient inequality

    \[ f(\by) \geq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle \Forall \by \in \VV. \]
  5. Since \(\bx\) is a minimizer, hence we must have

    \[ \langle \by - \bx, \nabla f(\bx) \rangle \geq 0 \Forall \by \in \VV. \]
  6. Let \(\by = \bx - \nabla f(\bx)\).

  7. Then

    \[ \langle \by - \bx, \nabla f(\bx) \rangle = \langle - \nabla f(\bx), \nabla f(\bx) \rangle = - \| \nabla f(\bx) \|^2 < 0. \]
  8. This contradicts our assumption that \(\bx\) is a minimizer.

  9. Hence if \(\ba\) is a minimizer of \(f\) then \(\ba \in S \setminus U\).

We next show how the condition in (10.8) simplifies for specific optimization problem structures.

10.5.2.2. Equality Constraints#

Theorem 10.51 (Differentiable objective minimization with equality constraints)

Let \(f : \RR^n \to \RR\) be a differentiable convex function with \(\dom f = \RR^n\). Consider the minimization problem:

\[\begin{split} & \text{minimize } & & f(\bx) \\ & \text{subject to } & & \bA \bx = \bb \end{split}\]

where \(\bA \in \RR^{p \times n}\) and \(\bb \in \RR^p\) represent \(p\) linear equality constraints. Assume that the problem is feasible.

Then, \(\bx\) is an optimal point for the minimization problem if and only if there exists a vector \(\bv \in \RR^p\) such that

\[ \nabla f (\bx) + \bA^T \bv = \bzero. \]

Proof. The feasible set is given by \(C = \{ \bx \ST \bA \bx = \bb \}\). We recall from Theorem 10.47, that \(\bx\) is an optimal point if and only if \(\nabla f(\bx)^T (\by - \bx) \geq 0 \Forall \by \in C\).

  1. Thus, \(\bx\) is feasible if and only if \(\nabla f(\bx)^T (\by - \bx) \geq 0\) for every \(\by\) satisfying \(\bA \by = \bb\).

  2. Since both \(\bx\) and \(\by\) are feasible, hence \(\bA (\by - \bx) = \bzero\).

  3. Thus, \(\bz = \by - \bx \in \NullSpace(\bA)\).

  4. In fact, \(\by \in C\) if and only if \(\by = \bx + \bz\) for some \(\bz \in \NullSpace(\bA)\).

  5. Thus, the optimality criteria reduces to \(\nabla f(\bx)^T \bz \geq 0\) for every \(\bz \in \NullSpace(\bA)\).

  6. Note that \(\nabla f(\bx)^T \bz\) is a linear function of \(\bz\) as \(\nabla f(\bx)\) is a fixed vector.

  7. If a linear function is nonnegative on a subspace, then it must be identically zero on the subspace.

  8. Thus, \(\nabla f(\bx)^T \bz = 0\) for every \(\bz \in \NullSpace(\bA)\).

  9. In other words, \(\bx\) is optimal if and only if \(\nabla f(\bx) \perp \NullSpace(\bA)\).

  10. Recall that \(\NullSpace(\bA)^{\perp} = \ColSpace(\bA^T)\); i.e., the null space of \(\bA\) is orthogonal complement of the column space (range) of \(\bA^T\).

  11. Thus, \(\bx\) is optimal if and only if \(\nabla f(\bx) \in \ColSpace(\bA^T)\).

  12. In other words, \(\bx\) is optimal if and only if there exists a vector \(\bv \in \RR^p\) such that

    \[ \nabla f (\bx) + \bA^T \bv = \bzero. \]

This result is a Lagrange multiplier optimality condition to be discussed in more detail in later sections.

10.5.2.3. Nonnegative Orthant Constraints#

Example 10.7 (Differentiable objective minimization over nonnegative orthant)

Let \(f : \RR^n \to \RR\) be a differentiable convex function with \(\dom f = \RR^n\). Consider the minimization problem:

\[\begin{split} & \text{minimize } & & f(\bx) \\ & \text{subject to } & & \bx \succeq \bzero. \end{split}\]
  1. The feasible set is the nonnegative orthant \(\RR^n_+\).

  2. \(\bx\) is optimal if and only if \(\bx \succeq \bzero\) and \(\nabla f (\bx)^T (\by - \bx) \geq 0\) for every \(\by \succeq \bzero\).

  3. The term \(\nabla f (\bx)^T \by\) is unbounded below on \(\by \in \RR^n_+\) unless \(\nabla f (\bx) \in \RR^n_+\).

  4. Thus, \(\nabla f (\bx)\) must be nonnegative.

  5. Then, the minimum value for \(\nabla f (\bx)^T \by\) is 0.

  6. Consequently, the optimality condition reduces to \(-\nabla f (\bx)^T \bx \geq 0\) or \(\nabla f (\bx)^T \bx \leq 0\).

  7. But \(\bx \succeq \bzero\) and \(\nabla f (\bx) \succeq \bzero\).

  8. Thus, we must have \(\nabla f (\bx)^T \bx = 0\).`

  9. We note that

    \[ \nabla f (\bx)^T \bx = \sum_{i=1}^n (\nabla f (\bx))_i x_i. \]
  10. Thus, it is a sum of products of nonnegative numbers.

  11. So each term in the sum must be 0.

  12. Thus, \((\nabla f (\bx))_i x_i = 0\) must hold true for every \(i=1,\dots,n\).

  13. Thus, the optimality condition can be rephrased as

    \[ \bx \succeq \bzero \text{ and } \nabla f (\bx) \succeq \bzero \text{ and } (\nabla f (\bx))_i x_i = 0 \Forall i=1,\dots,n. \]

The condition \((\nabla f (\bx))_i x_i = 0\) for every \(i\) is known as complementarity. It means that for every \(i\) either \(x_i\) or \((\nabla f (\bx))_i\) or both must be 0. In other words, both \(x_i\) and \((\nabla f (\bx))_i\) cannot be nonzero at the same time.

Thus, the sparsity patterns of \(\bx\) and \(\nabla f (\bx)\) are complementary. In other words,

\[ \supp (\bx) \cap \supp (\nabla f (\bx)) = \EmptySet \]

where \(\supp (\bx)\) denotes the index set of nonzero entries of \(\bx\).

10.5.2.4. Unit Sum Set Constraint#

Example 10.8 (Minimization over unit sum set)

Let \(f : \RR^n \to \RR\) be a differentiable convex function with \(\dom f = \RR^n\). Consider the minimization problem:

\[\begin{split} & \text{minimize } & & f(\bx) \\ & \text{subject to } & & \bone^T \bx = 1. \end{split}\]
  1. The feasible set is given by

    \[ C = \{ \bx \in \RR^n \ST \bone^T \bx = 1 \} = \{ \bx \in \RR^n \ST \sum_{i=1}^n x_i = 1 \}. \]
  2. This set is also known as the unit sum set.

  3. A point \(\ba \in C\) is optimal if and only if

    \[ \nabla f(\ba)^T (\bx - \ba) \geq 0 \Forall \bx \in C. \]

    Let us call this condition (I).

  4. This is equivalent to the condition:

    \[ \frac{\partial f}{\partial x_1} (\ba) = \frac{\partial f}{\partial x_2} (\ba) = \dots = \frac{\partial f}{\partial x_n} (\ba). \]

    Let us call this condition (II).

  5. We shall show that (I) and (II) are equivalent.

We first show that (II) implies (I).

  1. Assume that some \(\ba \in C\) satisfies (II) with

    \[ \alpha = \frac{\partial f}{\partial x_1} (\ba) = \frac{\partial f}{\partial x_2} (\ba) = \dots = \frac{\partial f}{\partial x_n} (\ba). \]
  2. Then, for any \(\bx \in C\),

    \[\begin{split} \nabla f(\ba)^T (\bx - \ba) &= \sum_{i=1}^n \frac{\partial f}{\partial x_i} (\ba) (x_i - a_i) \\ &= \alpha \sum_{i=1}^n (x_i - a_i) \\ &= \alpha (\sum_{i=1}^n x_i - \sum_{i=1}^n a_i) \\ &= \alpha (1 - 1) = 0. \end{split}\]
  3. Thus, we see that \(\nabla f(\ba)^T (\bx - \ba) \geq 0\) indeed and (I) is satisfied.

Now, assume that (I) is satisfied for some \(\ba \in C\).

  1. For contradiction, assume that \(\ba\) doesn’t satisfy (II).

  2. Then, their exist \(i,j \in [1,\dots,n]\) such that

    \[ \frac{\partial f}{\partial x_i} (\ba) > \frac{\partial f}{\partial x_j} (\ba). \]
  3. Pick a vector \(\bx \in C\) with following definition

    \[\begin{split} x_k = \begin{cases} a_k & k \notin \{i, j \} \\ a_k - 1 & k = i \\ a_k + 1 & k = j. \end{cases} \end{split}\]
  4. Note that

    \[ \sum_{i=1}^n x_k = \sum_{i=1}^n a_k = 1. \]

    Thus, \(\bx \in C\) holds.

  5. Now,

    \[\begin{split} \nabla f(\ba)^T (\bx - \ba) &= \sum_{k=1}^n \frac{\partial f}{\partial x_k} (\ba) (x_k - a_k) \\ \nabla f(\ba)^T (\bx - \ba) &= \frac{\partial f}{\partial x_i} (\ba) (x_i - a_i) + \frac{\partial f}{\partial x_j} (\ba) (x_j - a_j) \\ &= - \frac{\partial f}{\partial x_i} (\ba) + \frac{\partial f}{\partial x_j} (\ba)\\ < 0. \end{split}\]
  6. This violates the hypothesis that (I) holds true.

  7. Thus, (II) must be true.

  8. Thus, (I) implies (II).

10.5.2.5. Unit Ball Constraint#

Example 10.9 (Minimization over unit ball)

Let \(f : \RR^n \to \RR\) be a convex function which is differentiable over \(B[\bzero, 1]\). Consider the minimization problem:

\[\begin{split} & \text{minimize } & & f(\bx) \\ & \text{subject to } & & \| \bx \| \leq 1. \end{split}\]
  1. The feasible set is given by

    \[ C = B[\bzero, 1] = \{ \bx \ST \| \bx \| \leq 1 \}. \]
  2. A point \(\ba \in B[\bzero, 1]\) is an optimal point if and only if

    \[ \nabla f(\ba)^T (\bx - \ba) \geq 0 \Forall \bx \in B[\bzero, 1]. \]
  3. This is equivalent to saying that

    \[ \underset{\| \bx \| \leq 1}{\inf} (\nabla f(\ba)^T \bx - \nabla f(\ba)^T \ba ) \geq 0. \]
  4. Recall from Theorem 8.11 that for any \(\bv \in \RR^n\), the optimal value of the problem

    \[ \inf \{ \bv^T \bx \ST \| \bx \| \leq 1 \} \]

    is \(-\| \bv \|\).

  5. Thus,

    \[ \underset{\| \bx \| \leq 1}{\inf} \nabla f(\ba)^T \bx = - \| \nabla f(\ba) \|. \]
  6. Thus, the inequality simplifies to

    \[ - \nabla f(\ba)^T \ba \geq \| \nabla f(\ba) \|. \]
  7. At the same time, by Cauchy Schwartz inequality,

    \[ - \nabla f(\ba)^T \ba \leq \| \nabla f(\ba) \| \| \ba \| \leq \| \nabla f(\ba) \| \]

    since \(\ba \in B[\bzero, 1]\).

  8. Thus, the inequality must be an equality, giving us, \(\ba\) is an optimal point

    \[ - \nabla f(\ba)^T \ba = \| \nabla f(\ba) \|. \]

We now have following possibilities for this condition.

  1. If \(\nabla f(\ba) = \bzero\), then the condition holds and \(\ba\) is indeed an optimal point.

  2. Otherwise, if \(\nabla f(\ba) \neq \bzero\), then \(\| \ba \| = 1\) must be true.

    1. For contradiction, if we assume that \( \| \ba \| < 1\).

    2. Then, by Cauchy Schwartz inequality

      \[ - \nabla f(\ba)^T \ba \leq \| \nabla f(\ba) \| \| \ba \| < \| \nabla f(\ba) \|, \]

      a contradiction.

  3. Thus, if \(\nabla f(\ba) \neq \bzero\), then \(\ba\) is an optimal point if and only if \( \| \ba \| = 1\) and

    \[ - \nabla f(\ba)^T \ba = \| \nabla f(\ba) \| = \| \nabla f(\ba) \| \| \ba \|. \]
  4. But this is possible only when there exists \(t \leq 0\) such that

    \[ \nabla f(\ba) = t \ba. \]
  5. Thus, if \(\nabla f(\ba) \neq \bzero\), then \(\ba\) is an optimal point if and only if \( \| \ba \| = 1\) and there exists \(t \leq 0\) such that \(\nabla f(\ba) = t \ba\).

10.5.3. Descent Directions Method#

We first consider the problem of unconstrained minimization of a continuously differentiable function \(f\).

Typical iterative algorithms which aim to find the solution \(\bx\) for the minimization problem start with an initial guess \(\bx_0\) and perform a step of the form

\[ \bx_{k+1} = \bx_k + t_k \bd_k \]

where \(\bx_k\) is the current guess (starting from \(\bx_0\)), \(\bd_k\) is a direction in which we move to make the next guess and \(t_k\) is a step size in that direction. \(\bx_{k+1}\) is the next guess obtained from current guess. We say that an algorithm has made progress if \(f(\bx_{k+1}) < f(\bx_k)\).

This brings us to the notion of a descent direction.

Definition 10.22 (Descent direction)

Let \(f : \VV \to \RR\) be a continuously differentiable function over \(\VV\). A nonzero vector \(\bd\) is called a descent direction of \(f\) at \(\bx\) if the directional derivative \(f'(\bx; \bd)\) is negative.

In other words,

\[ f'(\bx; \bd) = \langle \bd, \nabla f(\bx) \rangle < 0. \]

If the directional derivative is negative, then it is clear that for a small enough step in this direction, the value of \(f\) will decrease.

Lemma 10.4 (Descent property of descent direction)

Let \(f : \VV \to \RR\) be a continuously differentiable function over \(\VV\). Let \(\bx \in \VV\). Assume that \(\bd\) is a descent direction for \(f\). Then, there exists \(\epsilon > 0\) such that

\[ f(\bx + t \bd) < f(\bx) \]

for any \(t \in (0, \epsilon]\).

Proof. This follows from the negativity of the directional derivative.

  1. Recall from Definition 9.70 that

    \[ f'(\bx; \bd) = \lim_{t \to 0^+} \frac{f(\bx + t \bd) - f(\bx)}{t}. \]
  2. Since \(\bd\) is a descent direction, hence \(f'(\bx; \bd) < 0\).

  3. Thus,

    \[ \lim_{t \to 0^+} \frac{f(\bx + t \bd) - f(\bx)}{t} < 0. \]
  4. Thus, there exists \(\epsilon > 0\) such that

    \[ \frac{f(\bx + t \bd) - f(\bx)}{t} < 0 \]

    for every \(t \in (0, \epsilon]\).

10.5.3.1. Descent Directions Method#

Algorithm 10.1 (The descent directions method)

Initialization

  1. Pick \(\bx_0 \in \VV\) arbitrarily as initial solution.

General iteration: for \(k=0,1,2,\dots\), execute the following steps

  1. Pick a descent direction \(\bd_k\).

  2. Pick a step size \(t_k\) satisfying \(f(\bx + t_k \bd_k) < f(\bx_k)\).

  3. Update: \(\bx_{k+1} \leftarrow \bx_k + t_k \bd_k\).

  4. Check for convergence: If converged, then STOP.

Return \(\bx_{k+1}\) as the output.

This method is not really an actual algorithm. It can be considered as a template for an actual algorithm. Several aspects of the method need to be carefully chosen to come up with a viable algorithm.

  1. How to select the initial point \(\bx_0\)?

  2. How to choose the descent direction?

  3. How to select the step size?

  4. How to decide when to stop the iterations (stopping criterion)?

The key result that we establish here is to show that if the step size \(t_k\) is sufficient small in the descent direction \(\bd_k\), then there is sufficient decrease in the value of \(f\) going from \(\bx_k\) to \(\bx_{k+1}\).

Theorem 10.52 (Sufficient decrease condition for descent direction method)

Let \(f\) be a continuously differentiable function over \(\RR^n\). Let \(\bx \in \RR^n\). Assume that a nonzero \(\bd\) is a descent direction for \(f\) at \(\bx\). Let \(\alpha \in (0, 1)\). Then, there exists \(\epsilon > 0\) such that the inequality

\[ f(\bx) - f(\bx + t \bd) \geq - \alpha t \langle \bd, \nabla f(\bx) \rangle \]

holds for every \(t \in [0, \epsilon]\).

Proof. We proceed as follows.

  1. Since \(f\) is continuously differentiable, hence due to Theorem 5.1,

    \[ f(\bx + t \bd) = f(\bx) + t \nabla f(\bx)^T \bd + o(t \| \bd \|). \]
  2. Rearranging the terms and introducing an \(\alpha \in (0, 1)\), we obtain

    \[ f(\bx) - f(\bx + t \bd) = - \alpha t \nabla f(\bx)^T \bd - (1- \alpha) \nabla f(\bx)^T \bd - o(t \| \bd \|). \]
  3. Since \(\bd\) is a descent direction of \(f\), hence \(f(\bx; \bd) = \nabla f(\bx)^T \bd < 0\).

  4. Thus,

    \[ \lim_{t \to 0^+} \frac{(1- \alpha) \nabla f(\bx)^T \bd + o(t \| \bd \|)}{t} = (1- \alpha) \nabla f(\bx)^T \bd < 0. \]
  5. Hence, there exists \(\epsilon > 0\) such that for every \(t \in [0, \epsilon]\),

    \[ (1- \alpha) \nabla f(\bx)^T \bd + o(t \| \bd \| < 0. \]
  6. Thus, from the previous equation:

    \[ f(\bx) - f(\bx + t \bd) \geq - \alpha t \nabla f(\bx)^T \bd \]

    for every \(t \in [0, \epsilon]\).

10.5.3.2. Step Size Selection#

Following are common methods for step size selection. Each method has has advantages as well as drawbacks.

  1. Constant step size

  2. Exact line search

  3. Backtracking

Constant step size uses \(t_k = \bar{t}\) for every iteration.

  • A large step size might cause algorithm to take non-decreasing steps.

  • The algorithm may never converge with a large step size.

  • A small constant step size may lead to very slow convergence.

Exact line search solves the minimization problem

\[ \text{ minimize } f \text{ along the ray } \bx_k + t \bd_k. \]

The optimization variable is the step size parameter \(t \in \RR_+\). The solution is the value \(t_k \geq 0\).

  • Minimizing \(f\) along the ray may not be straight-forward.

  • Any closed form or algorithmic solution for the exact line search may not be available.

Backtracking is a compromise between these two approaches.

  1. Input parameters \(s > 0\), \(\alpha \in (0,1)\), \(\beta \in (0, 1)\).

  2. Initialize \(t_k = s\).

  3. While

    \[ f(\bx_k) - f(\bx_k + t_k \bd_k) < - \alpha t_k \nabla f(\bx_k)^T \bd \]
    1. Set \(t_k \leftarrow \beta t_k\).

    2. Continue.

  4. Return step size \(t_k\).

Essentially, we are reducing the step size by a factor \(\beta\) iteratively till \(f\) shows sufficient decrease as stipulated by Theorem 10.52.

  • There is no exact line search.

  • Does find a good enough step size satisfying the sufficient decrease condition.

  • Involves multiple evaluations of \(f\).

  • \(s\) should be chosen carefully.

10.5.4. Gradient Method#

The gradient method is a descent direction method in which the descent direction is always chosen to be the negative of the gradient at the current point (solution).

\[ \bd_k = - \nabla f(\bx_k). \]

Lemma 10.5 (Negative of gradient is a descent direction)

Let \(f\) be a continuously differentiable function over an open set \(S\). At any point \(\bx \in S\), the negative of the gradient \(\bd = - \nabla f(\bx)\) is a descent direction whenever \(\nabla f(\bx) \neq \bzero\).

Proof. We just need to compute the directional derivative.

\[ f'(\bx; \bd) = \langle \bd, \nabla f(\bx) \rangle = \langle -\nabla f(\bx), \nabla f(\bx) \rangle = - \| \nabla f(\bx) \|^2 < 0. \]

The last strict inequality is valid since \(\nabla f(\bx) \neq \bzero\).

10.5.5. Gradient Projection Method#

In this subsection, we present a method to solve the optimization problem

\[\begin{split} & \text{minimize } & & f(\bx) \\ & \text{subject to } & & \bx \in C \end{split}\]

where \(C\) is a convex set and \(f\) is differentiable over \(C\).

Recall from Theorem 10.46, that \(\ba\) is a stationary point for the problem of minimizing \(f\) over a convex set \(C\) if and only if

\[ \ba = P_C (\ba - s \nabla f(\ba)). \]

This stationarity condition is the basis for the gradient projection method presented below.

Algorithm 10.2 (The gradient projection method)

Inputs

  1. \(\epsilon > 0\) - tolerance parameter

Initialization

  1. Pick \(\bx_0 \in C\) arbitrarily.

General iteration: for \(k=0,1,2,\dots\), execute the following steps

  1. Pick a step size \(t_k\) by a line search procedure.

  2. Update: \(\bx_{k+1} \leftarrow P_C (\bx_k - t_k \nabla f(\bx_k))\).

  3. Check for convergence: If \(\| \bx_{k+1} - \bx_k \| \leq \epsilon\), then STOP.

Return \(\bx_{k+1}\) as the output.

In the case of unconstrained minimization:

  1. \(C = \RR^n\)

  2. \(P_C (\bx_k - t_k \nabla f(\bx_k)) = \bx_k - t_k \nabla f(\bx_k)\).

  3. \(\bx_{k+1} = \bx_k - t_k \nabla f(\bx_k)\).

  4. We see that gradient projection reduces to gradient descent.

Another way to look at the algorithm is:

  1. \(\by_{k+1} = \bx_k - t_k \nabla f(\bx_k)\) computes the next candidate solution assuming no constraints.

  2. \(\bx_{k+1} = P_C(\by_{k+1})\) step projects the next candidate solution back to the feasible set \(C\).

  3. Thus, we have a gradient step followed by a projection step.

10.5.5.1. Convergence#

If we can establish conditions under which each iteration of the gradient projection algorithm leads to sufficient decrease in the value of the objective function, then we can guarantee that the algorithm will converge in a finite number of steps.

Lemma 10.6 (Sufficient decrease lemma for constrained problems)

Consider the optimization problem:

\[\begin{split} & \text{minimize } & & f(\bx) \\ & \text{subject to } & & \bx \in C. \end{split}\]

Assume that \(C\) is a nonempty closed convex set, \(f\) is continuously differentiable over \(C\), and \(\nabla f\) is Lipschitz continuous with a constant \(L\) over \(C\), Then, for any \(\bx \in C\), and \(t \in (0, \frac{2}{L})\), the following inequality holds.

\[ f(\bx) - f(\by) \geq t \left ( 1 - \frac{L t}{2} \right ) \left \| \frac{1}{t}(\bx - \by) \right \|^2. \]

where \(\by = P_C(\bx - t \nabla f(\bx))\).

Proof.