10.8. Lagrange Multipliers#

Main references for this section are [6, 9, 17].

The material in this section builds up on the material from previous sections. While the material in Constrained Optimization I and Constrained Optimization II doesn’t make specific assumptions on the structure of the constraint set (beyond say convexity and closedness), the material in Linear Constraints deals with a specific structure where the constraint set consists of a system of linear inequalities and equalities. This section focuses on the case where the constraint set consists of a system of smooth inequalities and equalities.

The necessary and/or sufficient conditions for the optimization problems presented in this section admit the existence of a set of nonnegative (for inequality constraints) and real (for equality constraints) scalars known as Lagrange multipliers satisfying a specific system of equations.

  1. We generalize the linear inequality and equality constraints in Linear Constraints to allow for smooth inequality and equality constraints.

  2. We first consider problems involving minimization of a smooth function over a set of smooth inequalities.

    1. We describe the notion of feasible descent directions.

    2. We show that at local minimizers, there are no feasible descent directions.

    3. We then develop the necessary Fritz-John conditions for the existence of a local minimizer.

    4. We add further constraint qualifications to develop the necessary KKT conditions for the existence of a local minimizer.

  3. We then consider the problems involving minimization of a smooth function over a set of sooth inequalities and equalities. We present the KKT conditions for the existence of a local minimizer.

  4. We then add convexity in the mix for the cost function and constraint functions.

10.8.1. Inequality Constrained Problems#

We start by developing the KKT conditions for the problem of minimizing a smooth function over a set of inequality constraints.

The problem is given as

(10.18)#\[\begin{split}& \text{minimize } & & f(\bx) & \\ & \text{subject to } & & g_i(\bx) \leq 0, i=1,\dots,m &\end{split}\]

where \(f, g_1, \dots, g_m : \VV \to \RR\) are continuously differentiable functions over \(\VV\).

The constraint set is given by

\[ C = \{\bx \in \VV \ST g_i(\bx) \leq 0, i=1,\dots,m \}. \]

10.8.1.1. Feasible Descent Directions#

Recall from Definition 10.22 that a descent direction is a direction along which the directional derivative of the cost function is negative; i.e., \(f'(\bx; \bd) < 0\). If \(f\) is continuously differentiable, this translates to

\[ f'(\bx; \bd) = \langle \bd, \nabla f(\bx) \rangle < 0. \]

Recall from Definition 10.23 that given a vector \(\bx \in C\), a direction \(\bd \in \VV\) is said to be a feasible direction of \(C\) at \(\bx\) if there exists a \(\overline{t} > 0\) such that

\[ \bx + t \bd \in C \text{ for every } t \in [0, \overline{t}]. \]

We now introduce the notion of a feasible descent direction.

Definition 10.28 (Feasible descent direction)

Consider the problem of minimizing a cost function \(f : \VV \to \RERL\) over a constraint set \(C \subseteq \dom f\). A nonzero vector \(\bd\) is called a feasible descent direction at \(\bx \in C\) if \(f'(\bx; \bd) < 0\) and there exists \(\overline{t} > 0\) such that

\[ \bx + t \bd \in C \text{ for every } t \in [0, \overline{t}]. \]

In other words, a feasible descent direction at \(\bx \in C\) is a feasible direction and a decent direction.

  1. If \(f\) is continuously differentiable then we must have \(\langle \bd, \nabla f(\bx) \rangle < 0\).

  2. If \(C\) is convex, then we just need to have \(\bx + \overline{t}\bd \in C\). By virtual of convexity of \(C\), every \(\by \in [\bx, \bx + \overline{t}\bd ] \in C\).

Lemma 10.7 (Local minimum and feasible descent directions)

Consider the problem of minimizing a cost function \(f : \VV \to \RERL\) over a constraint set \(C \subseteq \dom f\). If \(\bx^*\) is a local minimizer then there are no feasible descent directions at \(\bx^*\).

Proof. We prove this by contradiction.

  1. Let \(\bx^*\) be a local minimizer.

  2. Assume that \(\bd \neq \bzero\) is a feasible descent direction.

  3. Then there is an \(\epsilon_1 > 0\) such that \(\bx^* + t \bd \in C\) for every \(t \in [0, \epsilon_1]\).

  4. Also, \(f'(\bx; \bd) < 0\).

  5. By Definition 9.70, there exists \(\epsilon_2 > 0\) such that

    \[ \frac{f(\bx^* + t \bd) - f(\bx^*)}{t} < 0 \quad \Forall 0 < t < \epsilon_2. \]
  6. Equivalently, \(f(\bx^* + t \bd) < f(\bx^*)\) for all \(0 < t < \epsilon_2\).

  7. Let \(\epsilon = \min(\epsilon_1, \epsilon_2)\).

  8. Then for every \(t \in (0, \epsilon)\), we have \(\bx + t \bd \in C\) and \(f(\bx^* + t \bd) < f(\bx^*)\).

  9. This contradicts the hypothesis that \(\bx^*\) is a local minimizer.

  10. Hence, there are no feasible descent directions at \(\bx^*\).

10.8.1.2. Necessary Conditions for Local Optimality#

Revising the problem (10.18):

  1. A constraint \(g_i\) is called active at \(\bx\) if \(g_i(\bx) = 0\).

  2. A constraint \(g_i\) is called inactive at \(\bx\) if \(g_i(\bx) < 0\).

  3. The set of active constraints at a point \(\bx\) is denoted by

    \[ \AAA(\bx) = \{ i \in 1,\dots,m \ST g_i(\bx) = 0 \}. \]

We first restate the Lemma 10.7 for the minimization with inequality constraints problem (10.18). The key idea is the lack of feasible descent directions at a local minimizer.

  1. \(f'(\bx; \bd) < 0\) will indicate that \(\bd\) is a feasible direction.

  2. If a constraint is inactive, then it remains valid in the neighborhood of the local minimizer due to continuity of \(g_i\).

  3. If a constraint is active, then moving in some directions will lead to invalidation of the constraint while moving in some directions will keep the constraint valid.

  4. In particular, if \(g_i'(\bx; \bd) < 0\), then moving along \(\bd\) keeps the \(i\)-th active constraint valid.

  5. Hence, along a feasible descend direction, the directional derivatives of the cost function and the active constraint functions must be negative.

Lemma 10.8 (Local minimum and feasible descent directions for inequality constraints)

Let \(\bx^*\) be a local minimizer of the optimization problem (10.18):

\[\begin{split} & \text{minimize } & & f(\bx) & \\ & \text{subject to } & & g_i(\bx) \leq 0, i=1,\dots,m & \end{split}\]

where \(f, g_1, \dots, g_m : \VV \to \RR\) are continuously differentiable functions over \(\VV\). Let \(\AAA(\bx^*)\) denote the set of active constraints:

\[ \AAA(\bx^*) = \{ i \ST g_i(\bx^*) = 0 \}. \]

Then there doesn’t exist a vector \(\bd \in \VV\) such that

\[\begin{split} & \langle \bd, \nabla f(\bx^*) \rangle = f'(\bx^*; \bd) < 0,\\ & \langle \bd, \nabla g_i(\bx^*) \rangle = g_i'(\bx^*; \bd) < 0, \quad i \in \AAA(\bx^*). \end{split}\]

This result states that local optimality is equivalent to the infeasibility of a certain system of strict inequalities.

Proof. We prove this by contradiction.

  1. Let \(\bx^*\) be a local minimizer.

  2. Assume that \(\bd \neq \bzero\) be a direction satisfying the constraints above.

  3. Then there exists an \(\epsilon_0 > 0\) such that \(f(\bx^* + t \bd) < f(\bx^*)\) for every \(t \in (0, \epsilon_0)\).

  4. Similarly, there exist \(\epsilon_i > 0\) such that \(g_i(\bx^* + t \bd) < g_i(\bx^*) = 0\) for every \(t \in (0, \epsilon_i)\) for every \(i \in \AAA(\bx^*)\).

  5. Let \(\epsilon = \min\{\epsilon_0, \dots, \epsilon_m \}\).

  6. Then for every \(t \in (0, \epsilon)\), we have \(f(\bx^* + t \bd) < f(\bx^*)\) and \(g_i (\bx^* + t \bd) < 0\) for every \(i \in \AAA(\bx^*)\).

  7. By the continuity of \(g_i\) for all \(i\), and the fact that \(g_i(\bx^*) < 0\) for every \(i \notin \AAA(\bx^*)\), there exists a \(\delta > 0\) such that for every \(t \in (0, \delta)\), \(g_i(\bx^* + t \bd) < 0\) for every \(i \notin \AAA(\bx^*)\).

  8. Hence, we conclude that for every \(t \in (0, \min(\epsilon, \delta))\), we have \(f(\bx^* + t \bd) < f(\bx^*)\) and \(g_i (\bx^* + t \bd) < 0\) for every \(i \in 1,\dots,m\).

  9. But this contradicts the local optimality of \(\bx^*\).

10.8.1.3. Fritz-John Conditions#

Recall that Farkas’ and Gordan’s theorems of the alternative present different pairs of systems where if one system is infeasible then the other must be feasible and vice versa. We can apply Gordan’s theorem to the infeasible system of strict inequalities in Lemma 10.8 to develop the so-called Fritz-John conditions.

Theorem 10.77 (Fritz-John conditions)

Let \(\bx^*\) be a local minimizer of the optimization problem

\[\begin{split} & \text{minimize } & & f(\bx) & \\ & \text{subject to } & & g_i(\bx) \leq 0, i=1,\dots,m & \end{split}\]

where \(f, g_1, \dots, g_m : \VV \to \RR\) are continuously differentiable functions over \(\VV\). Then there exist nonnegative scalar multipliers \(t_0, t_1, \dots, t_m \geq 0\) which are not all zero such that

\[\begin{split} & t_0 \nabla f(\bx^*) + \sum_{i=1}^m t_i \nabla g_i(\bx^*) = \bzero, \\ & t_i g_i(\bx^*) = 0, i=1, \dots, m. \end{split}\]

Proof. From Lemma 10.8, we have that the following system is infeasible.

\[\begin{split} & \langle \bd, \nabla f(\bx^*) \rangle < 0,\\ & \langle \bd, \nabla g_i(\bx^*) \rangle < 0, \quad i \in \AAA(\bx^*). \end{split}\]
  1. Let \(n = \dim \VV\).

  2. Let there be an isomorphic mapping between \(\VV\) and \(\RR^n\).

  3. For every \(\bx \in \VV\), let \(\bx\) also denote the corresponding vector in \(\RR^n\).

  4. Assume that there are \(k\) active constraints at \(\bx^*\).

  5. Construct a \(k+1 \times n\) matrix \(\bA\) as follows:

    \[\begin{split} \bA = \begin{bmatrix} \nabla f(\bx^*)^T \\ \nabla f_{i_1}(\bx^*)^T \\ \vdots \\ \nabla f_{i_k}(\bx^*)^T \end{bmatrix} \end{split}\]

    where \(i_1, \dots, i_k\) denote the indices of active constraints.

  6. Then the above system of strict inequalities can be stated as \(\bA \bd \prec \bzero\).

  7. This system of equations is infeasible.

  8. Then by Gordan's theorem, the system

    \[ \bt \neq \bzero, \bA^T \bt = \bzero, \bt \succeq \bzero \]

    where \(\bt \in \RR^{k+1}\) is feasible.

  9. We write \(\bt =(t_0, t_{i_1}, \dots, t_{i_k})\).

  10. The equation \(\bA^T \bt = \bzero\) expands to

    \[ t_0 \nabla f(\bx^*) + \sum_{i \in \AAA(\bx^*)} t_i \nabla g_i(\bx^*) = \bzero. \]
  11. \(\bt \neq \bzero\) means that at least one of \(t_0, t_{i_1}, \dots, t_{i_k} \neq 0\).

  12. \(\bt \succeq \bzero\) means that \(t_0, t_{i_1}, \dots, t_{i_k} \geq 0\).

  13. Now, let \(t_i = 0\) for all remaining \(i \notin \AAA(\bx^*)\).

  14. Then for active constraints, we have \(g_i(\bx^*) = 0\) and for inactive constraints, we have \(t_i = 0\).

  15. Hence for all constraints, we have \(t_i g_i(\bx^*) = 0\).

  16. Hence there exist nonnegative scalar multipliers \(t_0, t_1, \dots, t_m \geq 0\) which are not all zero such that

    \[\begin{split} & t_0 \nabla f(\bx^*) + \sum_{i=1}^m t_i \nabla g_i(\bx^*) = \bzero, \\ & t_i g_i(\bx^*) = 0, i=1, \dots, m. \end{split}\]

The key issue with Fritz-John conditions is that it allows \(t_0 = 0\). The case \(t_0 = 0\) is not particularly useful since it leads to

\[ \sum_{i \in \AAA(\bx^*)} t_i \nabla g_i(\bx^*) = \bzero. \]

with \(t_i \geq 0\) and not all \(t_i\) being zero. This means that the gradients of the active constraints are linearly dependent.

  1. The case of linearly dependent gradients has nothing to do with the objective function.

  2. A number of points might satisfy the Fritz-John conditions and yet not be local minimum points.

  3. We can modify the Fritz-John conditions and insist that the gradients of the active constraints be linearly independent.

  4. This leads to what are called the KKIT conditions.

10.8.1.4. KKT Conditions#

Theorem 10.78 (KKT conditions)

Let \(\bx^*\) be a local minimizer of the optimization problem

\[\begin{split} & \text{minimize } & & f(\bx) & \\ & \text{subject to } & & g_i(\bx) \leq 0, i=1,\dots,m & \end{split}\]

where \(f, g_1, \dots, g_m : \VV \to \RR\) are continuously differentiable functions over \(\VV\). Let \(\AAA(\bx^*)\) denote the set of active constraints:

\[ \AAA(\bx^*) = \{ i \ST g_i(\bx^*) = 0 \}. \]

Assume that the gradients of the active constraints \(\{\nabla g_i(\bx^*) \}_{i \in \AAA(\bx^*)}\) are linearly independent.

Then there exist nonnegative scalar multipliers \(t_1, \dots, t_m \geq 0\) which are not all zero such that

\[\begin{split} & \nabla f(\bx^*) + \sum_{i=1}^m t_i \nabla g_i(\bx^*) = \bzero, \\ & t_i g_i(\bx^*) = 0, i=1, \dots, m. \end{split}\]

Proof. This is a simple extension of Theorem 10.77.

By Fritz-John conditions, there exist nonnegative scalar multipliers \(r_0, r_1, \dots, r_m \geq 0\) which are not all zero such that

\[\begin{split} & r_0 \nabla f(\bx^*) + \sum_{i=1}^m r_i \nabla g_i(\bx^*) = \bzero, \\ & r_i g_i(\bx^*) = 0, i=1, \dots, m. \end{split}\]
  1. If \(r_0 = 0\), then the set of gradients of active constraints will become linearly dependent.

  2. Hence, we must have \(r_0 > 0\).

  3. Let \(t_i = \frac{r_i}{r_0}\) for every \(i=1,\dots,m\).

  4. The result follows.

10.8.2. Inequality and Equality Constrained Problems#

We now generalize the KKT conditions to include problems of the form which include both inequality constraints and equality constraints

(10.19)#\[\begin{split}& \text{minimize } & & f(\bx) \\ & \text{subject to } & & g_i(\bx) \leq 0, & \quad i=1,\dots,m\\ & & & h_j(\bx) = 0, & \quad j=1,\dots,p\end{split}\]

where \(f, g_1, \dots, g_m, h_1, \dots, h_p : \VV \to \RR\) are continuously differentiable functions over \(\VV\).

The constraint set is given by

\[ C = \{\bx \in \VV \ST g_i(\bx) \leq 0, i=1,\dots,m \text{ and } h_j(\bx) = 0, j=1,\dots,p \}. \]
  1. An equality constraint must always be met at a feasible point. Hence there is no need to distinguish between active and inactive equality constraints. All inequality constraints are active.

  2. A constraint of the form \(h_j(\bx) = 0\) can be converted into two inequality constraints \(h_j(\bx) \leq 0\) and \(-h_j(\bx)\leq 0\).

Theorem 10.79 (KKT conditions for problems with smooth inequality and equality constraints)

Let \(\bx^*\) be a local minimizer of the optimization problem (10.19)

\[\begin{split} & \text{minimize } & & f(\bx) \\ & \text{subject to } & & g_i(\bx) \leq 0, & \quad i=1,\dots,m\\ & & & h_j(\bx) = 0, & \quad j=1,\dots,p \end{split}\]

where \(f, g_1, \dots, g_m, h_1, \dots, h_p : \VV \to \RR\) are continuously differentiable functions over \(\VV\). Let \(\AAA(\bx^*)\) denote the set of active inequality constraints:

\[ \AAA(\bx^*) = \{ i \ST g_i(\bx^*) = 0 \}. \]

Assume that the gradients of the active inequality constraints \(\{\nabla g_i(\bx^*) \}_{i \in \AAA(\bx^*)}\) and all the equality constraints \(\{ \nabla h_j(\bx^*) \}_{j=1,\dots,p}\) are linearly independent.

Then there exist nonnegative scalar multipliers \(t_1, \dots, t_m \geq 0\) and real scalar multipliers \(r_1, \dots, r_p \in \RR\) which are not all zero such that

\[\begin{split} & \nabla f(\bx^*) + \sum_{i=1}^m t_i \nabla g_i(\bx^*) + \sum_{j=1}^p r_j \nabla h_j(\bx^*) = \bzero, \\ & t_i g_i(\bx^*) = 0, i=1, \dots, m. \end{split}\]

10.8.2.1. KKT Points and Regular Points#

All the results up to this point define a set of necessary conditions in the form of a system of equations on the constraint functions and their gradients which must be satisfied by every local minimizer of the optimization problem. Besides the local minimizers, other points may also satisfy this system of equations. We now introduce the notion of KKT points which satisfy these equations.

Definition 10.29 (KKT point)

Consider the optimization problem (10.19) where \(f, g_1, \dots, g_m, h_1, \dots, h_p : \VV \to \RR\) are continuously differentiable functions over \(\VV\).

A feasible point \(\bx^*\) is called a KKT point if there exist nonnegative scalar multipliers \(t_1, \dots, t_m \geq 0\) and real scalar multipliers \(r_1, \dots, r_p \in \RR\) which are not all zero such that

\[\begin{split} & \nabla f(\bx^*) + \sum_{i=1}^m t_i \nabla g_i(\bx^*) + \sum_{j=1}^p r_j \nabla h_j(\bx^*) = \bzero, \\ & t_i g_i(\bx^*) = 0, i=1, \dots, m. \end{split}\]

All the necessary KKT conditions so far can be simply restarted as a local minimizer must be a KKT point if the gradients of active inequality constraints and all equality constraints at the point are linearly independent. We introduce the notion of regularity to capture the linear independence aspect.

Definition 10.30 (Regularity)

Consider the optimization problem (10.19) where \(f, g_1, \dots, g_m, h_1, \dots, h_p : \VV \to \RR\) are continuously differentiable functions over \(\VV\).

A feasible point \(\bx^*\) is called regular if the gradients of the active inequality constraints \(\{\nabla g_i(\bx^*) \}_{i \in \AAA(\bx^*)}\) and all the equality constraints \(\{ \nabla h_j(\bx^*) \}_{j=1,\dots,p}\) are linearly independent.

With the terminology of these definitions, Theorem 10.79 reduces to: if a regular point is a local minimizer then it must be a KKT point.

The notion of regularity is a kind of constraint qualification.

Example 10.15 (Regular points, KKT points, Local minimizers)

Consider the problem:

\[\begin{split} & \text{minimize } & & x_1 + x_2 \\ & \text{subject to } & & x_1^2 + x_2^2 = 1. \end{split}\]

The problem structure

  1. The ambient space is \(\RR^2\).

  2. We have the cost function: \(f(\bx) = x_1 + x_2\).

  3. The cost function is smooth and convex, in fact linear.

  4. We don’t have any inequality constraints.

  5. We have one equality constraint.

  6. The equality constraint function is given by \(h(\bx) = x_1^2 + x_2^2 - 1\).

  7. The equality constraint is \(h(\bx) = 0\).

  8. \(h\) is a smooth function but it is not a convex function.

  9. The constraint set is a contour of \(h\).

  10. The set of feasible points is given by \(C = \{ \bx \ST h(\bx ) = 0 \}\).

  11. The constraint set is not convex.

  12. Hence, it is not a convex optimization problem.

  13. However, the constraint set is compact.

  14. Hence, due to Theorem 3.85, the function \(f\) indeed attains a minimum as well as a maximum value on the \(C\).

Gradients

  1. We have \(\nabla f(\bx) = (1, 1)\).

  2. We have \(\nabla h(\bx) = (2 x_1, 2 x_2)\).

Irregular points

  1. The KKT conditions are applicable only on regular points.

  2. We first identify the points which are irregular.

  3. The irregular points are points at which the gradients of all active inequality constraints and equality constraints are linearly dependent.

  4. Since, we have a single equality constraint, hence the irregular points are those points at which \(\nabla h(\bx) = \bzero\).

  5. This is given by a single point \((x_1, x_2) = (0, 0)\).

  6. But \((0, 0) \notin C\) since \(h((0,0)) = -1 \neq 0\).

  7. Hence, the constraint set \(C\) doesn’t contain any irregular points.

  8. In other words, every feasible point is a regular point.

  9. Hence the KKT conditions are necessary for local optimality.

  10. In other words, if a point is a local minimizer then it must be a KKT point.

KKT points

  1. To identify the KKT points, we form the Lagrangian

    \[ L(\bx, r) = f(\bx) + r h(\bx) = x_1 + x_2 + r (x_1^2 + x_2^2 -1). \]
  2. The KKT conditions are:

    \[\begin{split} & \nabla_x L(\bx, r) = \nabla f(\bx) + r \nabla h(\bx) = 0,\\ & h(\bx) = 0. \end{split}\]
  3. They expand to

    \[\begin{split} & 1 + 2 r x_1 = 0,\\ & 1 + 2 r x_2 = 0,\\ & x_1^2 + x_2^2 - 1 = 0. \end{split}\]
  4. From the first two equations, we have \(r \neq 0\) and \(x_1 = x_2 = \frac{-1}{2r}\).

  5. Plugging it into the third equation, we get

    \[ \left ( \frac{-1}{2r} \right )^2 + \left ( \frac{-1}{2r} \right )^2 = 1. \]
  6. This simplifies to \(r^2 = \frac{1}{2}\).

  7. Hence, we have \(r = \pm \frac{1}{\sqrt{2}}\).

  8. This gives us two different KKT points \((\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})\) and \((-\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}})\).

The optimal solution

  1. By compactness, we know that the minimizer does exist.

  2. By regularity, we know that the minimizer must be a KKT point.

  3. We have two candidates available.

  4. We have \(f((-\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}})) = -\sqrt{2}\) and \(f((\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})) = \sqrt{2}\).

  5. Hence the minimizer is given by

    \[ \bx^* = \left (-\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}} \right ) \]

    as it has the smaller value of \(f\).

10.8.3. The Convex Case#

We now restrict our attention to the case where the cost and constraint functions are convex. In this case, the KKT conditions are also sufficient.

Theorem 10.80 (Sufficient KKT conditions for convex problems (smooth and convex cost and inequality constraints, affine equality constraints))

Let \(\bx^*\) be a feasible solution of the optimization problem

\[\begin{split} & \text{minimize } & & f(\bx) \\ & \text{subject to } & & g_i(\bx) \leq 0, & \quad i=1,\dots,m\\ & & & h_j(\bx) = 0, & \quad j=1,\dots,p \end{split}\]

where \(f, g_1, \dots, g_m : \VV \to \RR\) are continuously differentiable convex functions over \(\VV\) and \(h_1, \dots, h_p : \VV \to \RR\) are affine functions.

Suppose that there exist nonnegative scalar multipliers \(t_1, \dots, t_m \geq 0\) and real scalar multipliers \(r_1, \dots, r_p \in \RR\) which are not all zero such that

\[\begin{split} & \nabla f(\bx^*) + \sum_{i=1}^m t_i \nabla g_i(\bx^*) + \sum_{j=1}^p r_j \nabla h_j(\bx^*) = \bzero, \\ & t_i g_i(\bx^*) = 0, i=1, \dots, m. \end{split}\]

Then \(\bx^*\) is an optimal solution of the minimization problem above.

Proof. We are given that \(\bx^*\) is a feasible point satisfying the KKT conditions.

  1. Define the function

    \[ s(\bx) = f(\bx) + \sum_{i=1}^m t_i g_i(\bx) + \sum_{j=1}^p r_j h_j(\bx). \]
  2. Since \(f\) and \(g_i\) are convex and \(h_j\) are affine, hence \(s\) is convex.

  3. Since all of them are continuously differentiable, hence \(s\) is also continuously differentiable.

  4. We have

    \[ \nabla s(\bx) = \nabla f(\bx) + \sum_{i=1}^m t_i \nabla g_i(\bx) + \sum_{j=1}^p r_j \nabla h_j(\bx). \]
  5. We are given that \(\nabla s(\bx^*) = 0\).

  6. By Theorem 10.48, \(\bx^*\) is a minimizer of \(s\) over \(\VV\).

  7. Hence \(s(\bx^*) \leq s(\bx)\) for every \(\bx \in \VV\).

  8. By hypothesis \(t_i g_i(\bx^*) = 0\) for every \(i=1,\dots,m\).

  9. Hence \(\sum_{i=1}^m t_i g_i(\bx^*) = 0\).

  10. Since \(\bx^*\) is a feasible point, hence \(h_j(\bx^*) = 0\) for every \(j=1,\dots,p\).

  11. Hence \(\sum_{j=1}^p h_j(\bx^*) = 0\).

  12. Hence

    \[\begin{split} f(\bx^*) &= f(\bx^*) + \sum_{i=1}^m t_i g_i(\bx^*) + \sum_{j=1}^p h_j(\bx^*)\\ &= s(\bx^*) \\ &\leq s(\bx)\\ &= f(\bx) + \sum_{i=1}^m t_i g_i(\bx) + \sum_{j=1}^p r_j h_j(\bx)\\ &\leq f(\bx). \end{split}\]

    The last inequality comes from the fact that \(t_i \geq 0\), \(g_i(\bx) \leq 0\) and \(h_j(\bx) = 0\) for every feasible \(\bx\).

  13. Hence for every feasible \(\bx\), we have \(f(\bx^*) \leq f(\bx)\).

  14. Hence \(\bx^*\) is an optimal point.

10.8.3.1. Slater’s Conditions#

In Theorem 10.79, we saw that KKT conditions become necessary for the local optimality of a feasible point only if the feasible point is regular. The regularity was a constraint qualification for the nonconvex smooth optimization problem.

In the convex case, a different condition than regularity can guarantee the necessity of KKT conditions. They are known as Slater’s conditions.

Definition 10.31 (Slater’s conditions)

Let \(g_1, \dots, g_m : \VV \to \RR\) be convex. We say that the Slater’s condition is satisfied for a set of convex inequalities

\[ g_i(\bx) \leq 0, \quad i=1,\dots,m \]

if there exists a point \(\widehat{\bx} \in \VV\) such that

\[ g_i(\widehat{\bx}) < 0, \quad i=1,\dots,m. \]

In other words, the Slater’s condition requires the existence of a point which strictly satisfies all the convex inequality constraints.

Slater’s condition is much easier to check since it requires the existence of a single point which strictly satisfies all the convex inequalities.

Theorem 10.81 (Necessity of KKT conditions under Slater’s condition)

Let \(\bx^*\) be an optimal solution of the optimization problem

\[\begin{split} & \text{minimize } & & f(\bx) \\ & \text{subject to } & & g_i(\bx) \leq 0, & \quad i=1,\dots,m \end{split}\]

where \(f, g_1, \dots, g_m : \VV \to \RR\) are continuously differentiable functions over \(\VV\). In addition, assume that \(g_1, \dots, g_m\) are convex. Suppose that there exists a point \(\widehat{\bx} \in \VV\) such that

\[ g_i(\widehat{\bx}) < 0, \quad i=1,\dots,m. \]

Then there exist nonnegative scalar multipliers \(t_1, \dots, t_m \geq 0\) which are not all zero such that

\[\begin{split} & \nabla f(\bx^*) + \sum_{i=1}^m t_i \nabla g_i(\bx^*) = \bzero, \\ & t_i g_i(\bx^*) = 0, i=1, \dots, m. \end{split}\]

Proof. This is also derived from Fritz-John conditions Theorem 10.77.

By Fritz-John conditions, there exist nonnegative scalar multipliers \(r_0, r_1, \dots, r_m \geq 0\) which are not all zero such that

\[\begin{split} & r_0 \nabla f(\bx^*) + \sum_{i=1}^m r_i \nabla g_i(\bx^*) = \bzero, \\ & r_i g_i(\bx^*) = 0, i=1, \dots, m. \end{split}\]

We need to show that \(r_0 > 0\). After that we can pick \(t_i = \frac{r_i}{r_0}\) for every \(i=1,\dots,m\) to get the desired result.

  1. For contradiction, assume that \(r_0 = 0\).

  2. Then we have

    \[ \sum_{i=1}^m r_i \nabla g_i(\bx^*) = \bzero. \]
  3. By the gradient inequality, we have

    \[ g_i(\bx) \geq g_i(\bx^*) + \langle \bx - \bx^*, \nabla g_i(\bx^*) \rangle, \quad i=1,\dots,m. \]
  4. Specifically, for the point \(\widehat{\bx}\), we have

    \[ 0 > g_i(\widehat{\bx}) \geq g_i(\bx^*) + \langle \widehat{\bx} - \bx^*, \nabla g_i(\bx^*) \rangle, \quad i=1,\dots,m. \]
  5. Multiplying the \(i\)-th inequality by \(r_i \geq 0\) and summing over \(i=1,\dots,m\), we get

    \[ 0 > \sum_{i=1}^m r_i g_i(\bx^*) + \langle \widehat{\bx} - \bx^*, \sum_{i=1}^m r_i \nabla g_i(\bx^*) \rangle. \]

    This inequality is strict since not all \(r_i\) are \(0\) and \(r_0 = 0\).

  6. Since \(\sum_{i=1}^m r_i \nabla g_i(\bx^*) = \bzero\), it reduces to

    \[ 0 > \sum_{i=1}^m r_i g_i(\bx^*). \]
  7. But \(r_i g_i(\bx^*) = 0\) for every \(i=1,\dots,m\). Hence we must have

    \[ \sum_{i=1}^m r_i g_i(\bx^*) = 0. \]
  8. A contradiction. Hence \(r_0 > 0\) must be true.

10.8.4. A Tangent Cones Perspective#

Consider an optimization problem of the form

\[\begin{split} & \text{minimize } & & f(\bx)\\ & \text{subject to } & & h_i(\bx) = 0, i=1,\dots,m. \end{split}\]

Assume that \(f: \VV \to \RR\) and \(h_i : \VV \to \RR\) for \(i=1,\dots,m\) are smooth functions.

The constraint set can be written as

\[ C = \{ \bx \ST h_i(\bx) = 0, i=1,\dots,m \}. \]
  1. Assume that \(\bx^*\) is a minimizer of this problem.

  2. By Theorem 10.72, we must have

    \[ - \nabla f(\bx^*) \in T_C(\bx^*)^{\circ}. \]

Let us now motivate the Lagrange multipliers using a simple problem of linear inequalities.

  1. Consider the specific case where

    \[ h_i(\bx) = \langle \bx, \ba_i \rangle - b_i. \]
  2. Consider a matrix \(\bA\) which consists of \(\ba_1^T, \dots, \ba_m^T\) as rows.

  3. Put together \(b_1, \dots, b_m\) as a vector \(\bb\).

  4. Then the constraint set can be expressed as

    \[ C = \{ \bx \ST \bA \bx = \bb \}. \]
  5. Assume that \(\bx^* \in C\) is the local minimizer of \(f\).

  6. By Example 10.10,

    \[ T_C(\bx^*) = \{ \bx \ST \bA \bx = \bzero \} \]

    which is the nullspace of \(\bA\).

  7. By Example 9.18,

    \[ T_C(\bx^*)^{\circ} = T_C(\bx^*)^{\perp} = \range \bA^T. \]
  8. Hence, by the optimality condition, we have

    \[ - \nabla f(\bx^*) \in \range \bA^T. \]
  9. Hence, there exists \(\bt^* \in \RR^m\) such that

    \[ \nabla f(\bx^*) + \bA^T \bt^* = 0. \]
  10. This is equivalent to

    \[ \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \ba_i = 0. \]
  11. Since \(\nabla h_i (\bx^*) = \ba_i\), hence this is equivalent to

    \[ \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla h_i (\bx^*) = 0. \]

This motivates us to define the Lagrangian of \(f\) as

\[ L(\bx, \bt) = f(\bx) + \sum_{i=1}^m t_i h_i(\bx). \]

The basic Lagrangian theorem states that under suitable conditions, if \(\bx^*\) is a local minimum of \(f\) under the constraint set \(C\) then there exist scalars \(t_1^*, \dots, t_m^*\) called Lagrangian multipliers such that

\[ \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla h_i (\bx^*) = 0. \]
  1. There are \(n\) unknowns in \(\bx^*\) and \(m\) unknowns in \(\bt^*\).

  2. Thus, we have a total of \(m + n\) unknowns.

  3. The relation above gives us a system of \(n\) equations.

  4. Together with the \(m\) equalities \(h_i(\bx^*) = 0\), we have a system of \(m + n\) equations with \(m + n\) unknowns.

  5. Thus, the problem of solving a constrained optimization problem is transformed into a problem of solving a system of nonlinear equations.

  6. Now, suppose that the tangent cone at \(\bx^*\) can be written as

    \[ T_C(\bx^*) = \{\bx \ST \langle \bx, \nabla h_i(\bx^*) \rangle = 0, i=1,\dots,m \}. \]
  7. Letting \(\ba_i = \nabla h_i(\bx^*)\) and following the argument above, we must have

    \[ \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla h_i (\bx^*) = 0. \]
  8. Thus, if the tangent cone can be represented as above, then if \(\bx^*\) is a local minimizer, then the Lagrangian multipliers \(t_1^*, \dots, t_m^*\) must exist.

  9. The admittance of Lagrange multipliers at a given \(\bx^* \in C\) is the property of the constraint set \(C\). It is not a property of the optimization problem itself. If \(C\) admits Lagrange multipliers at \(\bx^*\), then there exists a Lagrange multiplier vector at \(\bx^*\) for every smooth cost function \(f\) if \(\bx^*\) is a local minimizer of \(f\).

10.8.5. Enhanced Fritz-John Conditions#

We now introduce a more difficult optimization problem

(10.20)#\[\begin{split}& \text{minimize } & & f(\bx)\\ & \text{subject to } & & \bx \in C\end{split}\]

where the constraint set \(C\) consists of equality and inequality constraints as well as an additional abstract set constraint \(X\):

(10.21)#\[C = X \cap \{\bx \ST g_i(\bx) \leq 0, i=1,\dots,m\} \cap \{\bx \ST h_j(\bx) = 0, j=1,\dots,p\}.\]

We assume that \(f\), \(g_i\) and \(h_j\) are smooth functions from \(\VV\) to \(\RR\) and \(X\) is a nonempty closed set.

We also introduce \(g: \VV \to \RR^m\) as

\[ g(\bx) = (g_1(\bx), \dots, g_m(\bx)) \]

and \(h : \VV \to \RR^p\) as

\[ h(\bx) = (h_1(\bx), \dots, h_p(\bx)). \]

For every \(i=1,\dots,m\), we define

\[ g_i^+(\bx) = \max \{0, g_i(\bx) \}. \]

Correspondingly, we define

\[ g^+(\bx) = (g_1^+(\bx), \dots, g_m^+(\bx)). \]

Definition 10.32 (Lagrangian function)

For the optimization problem (10.20), the Lagrangian function is defined as

\[ L(\bx, \bt, \br) = f(\bx) + \sum_{i=1}^m t_i g_i(\bx) + \sum_{j=1}^p r_j h_j(\bx) \]

where \(\bt \in \RR^m\) and \(\br \in \RR^p\).

10.8.5.1. Lagrange Multiplier Vectors#

Definition 10.33 (Lagrange multiplier vectors)

We say that a constraint set \(C\) as defined in (10.21) admits Lagrange multipliers at a point \(\bx^* \in C\) if for every smooth cost function \(f\) for which \(\bx^*\) is a local minimum of the problem (10.20), there exist vectors \(\bt^* = (t_1^*, \dots, t_m^*)\) and \(\br^* = (r_1^*, \dots, r_p^*)\) that satisfy the following conditions:

(10.22)#\[\left \langle \by, \left ( \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla g_i(\bx^*) + \sum_{j=1}^p r_j^* \nabla h_j(\bx^*) \right ) \right \rangle \geq 0, \quad \Forall \by \in T_X(\bx^*),\]
(10.23)#\[t_i^* \geq 0, \Forall i=1,\dots,m,\]
(10.24)#\[t_i^* = 0, \Forall i \ST g_i(\bx^* ) < 0.\]

A pair \((\bt^*, \br^*)\) satisfying these conditions is called a Lagrange multiplier vector corresponding to \(f\) and \(\bx^*\).

  1. We also call the Lagrange multiplier vector as simply Lagrange multipliers.

  2. The condition (10.23) is the nonnegativity condition of the Lagrangian multipliers for the inequality constraints.

  3. The condition (10.24) is the complementary slackness condition.

  4. From (10.22), we can see that for each \(\by \in T_X(\bx^*)\), the set of Lagrange multiplier vectors corresponding to a given \(f\) and \(\bx^*\) is a closed half-space.

  5. Hence, the set of Lagrange multiplier vectors corresponding to a given \(f\) and \(\bx^*\) is an intersection of closed half spaces.

  6. Hence the set of Lagrange multiplier vectors is closed and convex. Although it may possibly be empty.

  7. The condition (10.22) is referred to as the Lagrangian stationarity condition. It states that the gradient of the Lagrangian function is nonnegative along all tangent directions at the local minimizer.

  8. It can be viewed as the necessary condition for \(\bx^*\) to be a local minimizer of the function \(L(\bx, \bt^*, \br^*)\). See Theorem 10.72.

  9. When \(X = \VV\), then \(T_X(\bx^*) = \VV\), and this condition reduces to

    \[ \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla g_i(\bx^*) + \sum_{j=1}^p r_j^* \nabla h_j(\bx^*) = \bzero. \]
  10. When \(X\) is convex, then (10.22) reduces to

    \[ \left \langle \bx - \bx^*, \left ( \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla g_i(\bx^*) + \sum_{j=1}^p r_j^* \nabla h_j(\bx^*) \right ) \right \rangle \geq 0, \quad \Forall \bx \in X. \]
    1. \(X\) is convex. Hence \(\alpha (\bx - \bx^*)\) for \(\alpha > 0\) is a feasible direction for every \(\bx \in X\).

    2. Hence, if this inequality holds, then (10.22) holds for every \(\bd \in F_C(\bx)\).

    3. Since \(X\) is convex, hence \(\closure F_C(\bx) = T_C(\bx)\).

    4. If the inequality (10.22) holds for every \(\bd \in F_C(\bx)\), then it will also hold for a closure point of \(F_C(\bx)\).

    5. In other words, if \(\langle \bx, \ba \rangle \geq 0\) for every \(\bx \in A\), then for any convergent sequence \(\{ \bx_k \}\) of \(A\), we have \(\lim_{k \to \infty} \langle \bx_k, \ba \rangle \geq 0\). Hence the inequality holds for every closure point also.

  11. The Lagrangian stationary condition can also be equivalently written as

    \[ -\left ( \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla g_i(\bx^*) + \sum_{j=1}^p r_j^* \nabla h_j(\bx^*) \right ) \in T_X(\bx^*)^{\circ}. \]

    In other words, the negative gradient of the Lagrangian function must lie in the polar cone of the tangent cone of \(X\) at \(\bx^*\).

  12. Recall from Theorem 10.68 that

    \[ T_X(\bx^*)^{\circ} \subseteq \tilde{N}_X(\bx^*) \]

    where \(\tilde{N}_X(\bx^*)\) is the normal cone of \(X\) at \(\bx^*\) (in the sense of [9]).

  13. Hence the negative gradient of the Lagrangian function must be a normal direction (Definition 10.26) at \(\bx^*\).

  14. If \(X\) is regular at \(\bx^*\) (Theorem 10.70), then we also have

    \[ T_X(\bx^*)^{\circ} = \tilde{N}_X(\bx^*). \]

10.8.5.2. Enhanced Fritz-John Conditions#

We are now ready to present the Enhanced Fritz-John conditions as the necessary conditions for the existence of the local minimizer of the problem (10.20).

Theorem 10.82 (Enhanced Fritz-John conditions)

Let \(\bx^*\) be a local minimizer of the problem (10.20)-(10.21). Then there exist scalars \(t_0^*,t_1^*, \dots, t_m^*, r_1^*,\dots, r_p^*\) satisfying the following conditions.

  1. The gradients satisfy the relation:

    \[ -\left ( t_0^* \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla g_i(\bx^*) + \sum_{j=1}^p r_j^* \nabla h_j(\bx^*) \right ) \in \tilde{N}_X(\bx^*). \]
  2. Nonnegativity: \(t_i^* \geq 0\) for every \(i=0,\dots,m\).

  3. The scalars \(t_0^*,t_1^*, \dots, t_m^*, r_1^*,\dots, r_p^*\) are not equal to \(0\).

  4. Complementary violation condition: If the index set \(I \cup J\) is not empty, where

    \[ I = \{i > 0 \ST t_i^* > 0 \}, \quad J = \{j \ST r_j^* \neq 0 \}, \]

    there exists a sequence \(\{ \bx_k \}\) of \(X\) that converges to \(\bx^*\) and is such that for all \(k\)

    \[\begin{split} & f(\bx_k) < f(\bx^*),\\ & t_i^* g_i (\bx_k) > 0 \Forall i \in I,\\ & r_j^* h_j(\bx_k) > 0 \Forall j \in J,\\ & g_i^+(\bx_k) = o(w(\bx_k)) \Forall i \notin I,\\ & |h_j(\bx_k) | = o(w(\bx_k)) \Forall j \notin J,\\ \end{split}\]

    where \(g_i^+(\bx) = \max \{ 0, g_i(\bx) \}\) and

    \[ w(\bx) = \min \left \{ \min_{i \in I} g_i^+(\bx), \min_{j \in J} |h_j(\bx) | \right \}. \]

Proof. The proof is based on a quadratic penalty function approach. For each \(k \in \Nat\), we define a penalty function as

\[ F^k(\bx) = f(\bx) + \frac{k}{2} \sum_{i=1}^m (g_i^+(\bx))^2 + \frac{k}{2} \sum_{j=1}^p (h_j(\bx))^2 + \frac{1}{2} \| \bx - \bx^* \|^2. \]
  1. At a feasible point \(g_i^+(\bx) = 0\) for every \(i=1,\dots,m\) since \(g_i(\bx) \leq 0\).

  2. At a feasible point \(h_j(\bx) = 0\) for every \(j=1,\dots,p\).

  3. Hence \(F^k(\bx) = f(\bx) + \frac{1}{2} \| \bx - \bx^* \|^2\) at every feasible point \(\bx \in C\).

  4. The term \(\frac{1}{2} \| \bx - \bx^* \|^2\) is a quadratic penalty term penalizing how far we are from the local minimum \(\bx^*\).

  5. The term \(\frac{1}{2} (g_i^+(\bx))^2\) is a penalty term denoting how strongly the \(i\)-th inequality constraint is violated.

  6. The term \(\frac{1}{2} (h_j(\bx))^2\) is a penalty term denoting how strongly the \(j\)-th equality constraint is violated.

  7. We have \(F^k(\bx) \leq f(\bx)\) for every \(\bx \in \VV\).

  8. At the local minimizer, we have

    \[ F^k(\bx^*) = f(\bx^*). \]
  9. \(F^k\) is a continuously differentiable function.

  10. We note that

    \[\begin{split} \nabla (g_i^+(\bx))^2 = 2 g_i^+(\bx) \nabla g_i(\bx);\\ \nabla (h_j(\bx))^2 = 2 h_j(\bx) \nabla h_j(\bx). \end{split}\]
  11. Hence

    \[ \nabla F^k(\bx) = \nabla f(\bx) + k \sum_{i=1}^m g_i^+(\bx) \nabla g_i(\bx) + k \sum_{j=1}^p h_j(\bx) \nabla h_j(\bx) + (\bx - \bx^*). \]

We now introduce the penalized problem

\[\begin{split} & \text{minimize } & & F^k(\bx)\\ & \text{subject to } & & \bx \in X \cap S \end{split}\]

where

\[ S = \{ \bx \ST \|\bx - \bx^* \| \leq \epsilon \} \]

and \(\epsilon > 0\) is a positive scalar such that \(f(\bx^*) \leq f(\bx)\) for all feasible \(\bx \in C\) with \(\bx \in S\). Such a positive scalar exists since \(\bx^*\) is a local minimizer of \(f\).

  1. The set \(S\) is compact and the set \(X\) is closed.

  2. Hence \(X \cap S\) is compact.

  3. Hence there exists an optimal minimizer of the above problem for every \(k\).

  4. Let \(\bx^k\) be a minimizer of \(F^k(\bx)\) over \(X \cap S\).

  5. Then we have \(F^k(\bx^k) \leq F^k(\bx^*)\) for every \(k\) since \(\bx^* \in X \cap S\).

  6. This is equivalent to

    \[ F^k(\bx^k) = f(\bx^k) + \frac{k}{2} \sum_{i=1}^m (g_i^+(\bx^k))^2 + \frac{k}{2} \sum_{j=1}^p (h_j(\bx^k))^2 + \frac{1}{2} \| \bx^k - \bx^* \|^2 \leq f(\bx^*) \]

    for every \(k\).

  7. Since \(f\) is continuous and \(X \cap S\) is compact, hence \(f(\bx^k)\) is bounded for every \(k\).

  8. It follows that

    \[\begin{split} & \lim_{k \to \infty} g_i^+(\bx^k) = 0, \quad i=1,\dots,m,\\ & \lim_{k \to \infty} |h_j(\bx^k)| = 0, \quad j=1,\dots,p, \end{split}\]

    otherwise the term on the L.H.S. of the previous inequality will become unbounded and tend to \(\infty\) as \(k \to \infty\).

  9. Hence every limit point \(\tilde{\bx}\) of the sequence \(\{ \bx^k \}\) is feasible; i.e., \(\tilde{\bx} \in C\).

  10. Also, since \(X \cap S\) is compact, hence every limit point \(\tilde{\bx}\) of the sequence \(\{ \bx^k \}\) belongs to \(X \cap S\).

  11. From the inequality \(F^k(\bx^k) \leq f(\bx^*)\), we can also see that

    \[ f(\bx^k) + \frac{1}{2} \| \bx^k - \bx^* \|^2 \leq f(\bx^*) \quad \Forall k. \]
  12. Taking the limit as \(k \to \infty\), we obtain

    \[ f(\tilde{\bx}) + \frac{1}{2} \| \tilde{\bx} - \bx^* \|^2 \leq f(\bx^*) \]

    for every limit point \(\tilde{\bx}\).

  13. Since \(\tilde{\bx} \in S\) (near local minimizer) and \(\tilde{\bx} \in C\) (feasible), we have

    \[ f(\bx^*) \leq f(\tilde{\bx}). \]
  14. Combining with the previous inequality, it gives us

    \[ \frac{1}{2} \| \tilde{\bx} - \bx^* \|^2 = 0. \]
  15. Hence, we must have \(\bx^* = \tilde{\bx}\). Hence, the sequence \(\{ \bx_k \}\) has a only one limit point.

  16. Thus, the sequence \(\{ \bx^k \}\) converges to \(\bx^*\).

  17. By the definition of the closed ball \(S\), \(\bx^*\) is an interior point of \(S\).

  18. Since \(\lim \bx^k = \bx^*\) it follows that \(\bx^k\) is an interior point of \(S\) for every \(k\) greater than some \(k_0\).

  19. Hence, due to Theorem 10.72,

    \[ - \nabla F^k(\bx^k) \in T_C(\bx^k)^{\circ} \]

    holds true for every \(k > k_0\).

  20. We can write \(\nabla F^k (\bx^k)\) as

    \[ \nabla F^k(\bx^k) = \nabla f(\bx^k) + \sum_{i=1}^m \chi^k_i \nabla g_i(\bx^k) + \sum_{j=1}^p \xi^k_j \nabla h_j(\bx^k) + (\bx^k - \bx^*) \]

    where \(\chi^k_i = k g_i^+(\bx^k)\) and \(\xi^k_j = k h_j(\bx^k)\).

  21. Note that by definition \(\chi^k_i \geq 0\) for every \(i=1,\dots,m\).

  22. Accordingly, we have

    \[ - \left ( \nabla f(\bx^k) + \sum_{i=1}^m \chi^k_i \nabla g_i(\bx^k) + \sum_{j=1}^p \xi^k_j \nabla h_j(\bx^k) + (\bx^k - \bx^*) \right ) \in T_C(\bx^k)^{\circ} \]

    for every \(k > k_0\).

  23. We define

    \[ \delta^k = \sqrt{1 + \sum_{i=1}^m (\chi^k_i)^2 + \sum_{j=1}^p (\xi^k_j)^2 }. \]
  24. By definition \(\delta^k \geq 1\).

  25. We now introduce

    \[\begin{split} & t_0^k = \frac{1}{\delta^k},\\ & t_i^k = \frac{\chi^k_i}{\delta^k}, i=1,\dots,m,\\ & r_j^k = \frac{\xi^k_j}{\delta^k}, j=1,\dots,p. \end{split}\]
  26. By dividing by \(\delta^k\) in the previous relation, we obtain

    \[ \bz^k = - \left ( t_0^k \nabla f(\bx^k) + \sum_{i=1}^m t_i^k \nabla g_i(\bx^k) + \sum_{j=1}^p r_j^k \nabla h_j(\bx^k) + \frac{1}{\delta_k}(\bx^k - \bx^*) \right ) \in T_C(\bx^k)^{\circ} \]

    for every \(k > k_0\) since \(T_C(\bx^k)^{\circ}\) is a cone.

  27. Note that by construction, we have

    \[ (t_0^k)^2 + \sum_{i=1}^m (t_i^k)^2 + \sum_{j=1}^p (r_j^k)^2 = 1. \]
  28. Hence the sequence \(\{ (t_0^k, t_1^k, \dots, t_m^k, r_1^k, \dots, r_p^k) \}\) is a bounded sequence of \(\RR^{1 + m + p}\).

  29. Hence, it must have a subsequence that converges to some limit \(\{ (t_0^*, t_1^*, \dots, t_m^*, r_1^*, \dots, r_p^*) \}\).

  30. Let

    \[ \bz^* = - \left ( t_0^* \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla g_i(\bx^*) + \sum_{j=1}^p r_j^* \nabla h_j(\bx^*) \right ). \]
  31. Along this subsequence, we have \(\bx^k \to \bx^*\), \(\bz^k \to \bz^*\) and \(\bz^k \in T_X(\bx^k)^{\circ}\) for every \(k > k_0\).

  32. Hence, following the definition of the normal cone (Definition 10.27), after disregarding the first \(k_0\) terms of the sequences,

    \[ \bz^* \in \tilde{N}_X(\bx^*). \]

(2) Nonnegativity

  1. Since \(\delta^k \geq 1\), hence \(0 < t_0^k = \frac{1}{\delta^k} \leq 1\).

  2. Hence \(t_0^* \geq 0\).

  3. Since \(\chi^k_i \geq 0\) for every \(i=1,\dots,m\), hence \(t_i^k \geq 0\) for every \(i=1,\dots,m\).

  4. Hence \(t_i^* \geq 0\) for every \(i=1,\dots,m\).

(3) Not all zero

  1. We have established that for every \(k\),

    \[ (t_0^k)^2 + \sum_{i=1}^m (t_i^k)^2 + \sum_{j=1}^p (r_j^k)^2 = 1. \]
  2. Taking the limit over the subsequence, we have

    \[ (t_0^*)^2 + \sum_{i=1}^m (t_i^*)^2 + \sum_{j=1}^p (r_j^*)^2 = 1. \]
  3. Hence, all of them cannot be zero.

(4) Complementary violation conditions

  1. Assume that \(I \cup J\) is not empty.

  2. Let \(\bKKK = \{ k_1, k_2, \dots, \}\) denote the index set of the convergent subsequence of \(\{ (t_0^k, t_1^k, \dots, t_m^k, r_1^k, \dots, r_p^k) \}\).

  3. Let \(i \in I\).

    1. We have \(t_i^* > 0\).

    2. Then for all sufficient large \(k\) in \(\bKKK\), we have \(t_i^k t_i^* > 0\) as \(t_i^k\) should be sufficiently close to \(t_i^*\).

    3. From the definitions of \(t_i^k\) and \(\chi_i^k\), we must have \(t_i^k g_i^+(\bx^k) > 0\) for all sufficient large \(k\) in \(\bKKK\).

    4. By definition, \(g_i^+(\bx^k) \geq 0\) and \(g_i^+(\bx^k) > 0\) when \(g_i(\bx^k) > 0\).

    5. Hence we must have \(t_i^k g_i(\bx^k) > 0\) for all sufficient large \(k\) in \(\bKKK\).

  4. Let \(j \in J\).

    1. We have \(r_j^* \neq 0\).

    2. Then for all sufficiently large \(k\) in \(\bKKK\), we have \(r_j^k \approx r_j^*\). Hence \(r_j^k \neq 0\) and \(r_j^k\) has the same sign as \(r_j^*\).

    3. Hence for all sufficiently large \(k\) in \(\bKKK\), we have \(r_j^k r_j^* > 0\).

    4. from the definitions of \(r_j^k\) and \(\xi_j^k\), we see that for all sufficiently large \(k\) in \(\bKKK\), we must have \(r_j^k h_j(\bx^k) > 0\).

  5. Hence for all sufficiently large \(k\) in \(\bKKK\), we have

    \[ t_i^k g_i(\bx^k) > 0 \Forall i \in I \text{ and } r_j^k h_j(\bx^k) > 0 \Forall j \in J. \]
  6. This means that \( g_i^+(\bx^k) > 0\) for every \(i \in I\) and \(h_j(\bx^k) \neq 0\) for every \(j \in J\) for all sufficiently large \(k\) in \(\bKKK\).

  7. This establishes that for all sufficiently large \(k\) in \(\bKKK\), we have \(\bx^k \neq \bx^*\). Otherwise the inequality and equality constraints cannot be violated.

  8. Recall that we established that

    \[ F^k(\bx^k) = f(\bx^k) + \frac{k}{2} \sum_{i=1}^m (g_i^+(\bx^k))^2 + \frac{k}{2} \sum_{j=1}^p (h_j(\bx^k))^2 + \frac{1}{2} \| \bx^k - \bx^* \|^2 \leq f(\bx^*) \]

    for every \(k\).

  9. Hence for all sufficiently large \(k\) in \(\bKKK\), we must have

    \[ f(\bx^k) < f(\bx^*) \]

    since at least one of \((g_i^+(\bx^k))^2\) and \((h_j(\bx^k))^2\) is positive for every sufficiently large \(k\).

  10. We form the desired sequence \(\{ \bx_l \}\) satisfying all the necessary criteria by picking up all the entries corresponding to sufficiently large \(k\) in \(\bKKK\) from \(\{ \bx^k \}\).

  11. It remains to show the magnitude property of terms \(h_j(\bx^k)\) and \(g_i^+(\bx^k)\).

  12. For the remaining argument, without loss of generality, we shall assume that \(\{ \bx^k \}\) is the subsequence chosen above.

TBD the argument below is not complete.

  1. We see that

    \[\begin{split} (\delta^k)^2 &= 1 + \sum_{i=1}^m (\chi^k_i)^2 + \sum_{j=1}^p (\xi^k_j)^2 \\ &= 1 + \sum_{i=1}^m (k g_i^+(\bx^k))^2 + \sum_{j=1}^p (k h_j(\bx^k))^2 \\ & \geq 1 + k w(\bx^k)^2. \end{split}\]
  2. For every \(i \notin I\), we have \(t_i^* = 0\).

  3. Hence \(\lim_{k \to \infty} t_i^k = 0\).

  4. Hence \(\lim_{k \to \infty} \frac{\chi^k_i}{\delta^k} = 0\).

  5. Hence

    \[ \lim_{k \to \infty} \frac{(\chi^k_i)^2}{1 + \sum_{i=1}^m (\chi^k_i)^2 + \sum_{j=1}^p (\xi^k_j)^2} = 0. \]
  6. Hence

    \[ \lim_{k \to \infty} \frac{(k g_i^+(\bx^k))^2}{1 + \sum_{i=1}^m (k g_i^+(\bx^k))^2 + \sum_{j=1}^p (h_j(\bx^k))^2} = 0. \]
  7. We have

    \[ 0 \leq \frac{(k g_i^+(\bx^k))^2}{1 + \sum_{i=1}^m (k g_i^+(\bx^k))^2 + \sum_{j=1}^p (h_j(\bx^k))^2} \leq \frac{(k g_i^+(\bx^k))^2}{1 + k w(\bx^k)^2}. \]

The complementary violation condition (CV) in (4) is stronger than the traditional complementary slackness (CS) condition.

  1. If \(t_i^* > 0\) for some inequality constraint \(g_i\), then the inequality \(g_i(\bx) \leq 0\) must be violated arbitrarily close to \(\bx^*\).

  2. In contrast, the CS condition only states that \(g_i(\bx^*) = 0\). It doesn’t say anything about the violation of the constraint near the local optimal.

  3. CV states that there exists a sequence of \(X\) converging to \(\bx^*\) such that the equality and inequality constraints are violated at every point in the sequence corresponding to the nonzero Lagrangian multipliers.

  4. This sequence also has the property that the amount of violation in the constraints corresponding to zero Lagrange multipliers is negligible via the following conditions.

    \[\begin{split} g_i^+(\bx_k) = o(w(\bx_k)) \Forall i \notin I,\\ & |h_j(\bx_k) | = o(w(\bx_k)) \Forall j \notin J.\\ \end{split}\]

Corollary 10.6 (Enhanced Fritz-John conditions regular case)

In Theorem 10.82, if \(X\) is regular at \(\bx^*\), then

\[ -\left ( t_0^* \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla g_i(\bx^*) + \sum_{j=1}^p r_j^* \nabla h_j(\bx^*) \right ) \in T_X(\bx^*)^{\circ}. \]

Equivalently

\[ \left \langle \by, \left ( t_0^* \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla g_i(\bx^*) + \sum_{j=1}^p r_j^* \nabla h_j(\bx^*) \right) \right \rangle \geq 0 \Forall \by \in T_X(\bx^*). \]

10.8.5.3. Lagrangian Stationarity#

Observation 10.10 (Lagrangian stationarity)

In Theorem 10.82, assume that \(X\) is regular at \(\bx^*\). Further assume that \(t_0^* > 0\) is positive.

By normalization of the scalars, the gradient inequality reduces to

\[ -\left ( \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla g_i(\bx^*) + \sum_{j=1}^p r_j^* \nabla h_j(\bx^*) \right ) \in T_X(\bx^*)^{\circ}. \]

Equivalently

\[ \left \langle \by, \left ( \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla g_i(\bx^*) + \sum_{j=1}^p r_j^* \nabla h_j(\bx^*) \right ) \right \rangle \geq 0 \Forall \by \in T_X(\bx^*). \]

This is identical to the Lagrangian stationarity condition (1) in Definition 10.33.

Hence if \(X\) is regular at \(\bx^*\) and \(t_0^* = 1\), then the vector \((\bt^*, \br^*) = \{t_1^*, \dots, t_m^*, r_1^*, \dots, r_m^* \}\) is a Lagrangian multiplier vector that satisfies the CV condition.

10.8.5.4. LICQ#

Observation 10.11 (Linear independence constraint qualification)

In Theorem 10.82, assume that \(X = \VV\). Then \(X\) is normal and \(T_X(\bx^*) = \VV\).

  1. Condition (1) reduces to

    \[ t_0^* \nabla f(\bx^*) + \sum_{i=1}^m t_i^* \nabla g_i(\bx^*) + \sum_{j=1}^p r_j^* \nabla h_j(\bx^*) = \bzero. \]
  2. Further, add the linear independence constraint qualification (LICQ) assumption at \(\bx^*\); i.e., the gradients \(\nabla g_i(\bx^*), i \in \AAA(\bx^*)\) and \(h_j(\bx^*), j=1,\dots,p\) are linearly independent;

  3. Then \(t_0^* \neq 0\) must be true.

  4. Otherwise, the linear independence condition would be violated.

  5. It follows that Lagrangian stationarity is satisfied and there exists a Lagrangian multiplier vector.

  6. We thus obtain the classical Lagrangian multiplier theorem based on KKT conditions Theorem 10.79.

10.8.6. Informative Lagrange Multipliers#

Lagrange multiplier vectors may provide a significant amount of sensitivity information by indicating which constraints to violate in order to effect a reduction in objective function value. In general, if the multipliers are informative, then the larger the coefficient for a constraint, the more sensitive is the objective function to the cost.

Definition 10.34 (Informative Lagrange multiplier vector)

A Lagrange multiplier vector (as defined in Definition 10.33 for the problem (10.20)) is called an informative multiplier vector if it satisfies the complementary violation condition of enhanced Fritz-John conditions Theorem 10.82.

Definition 10.35 (Minimal Lagrange multiplier vector)

Let \(\bx^*\) be a local minimum of the problem (10.20). Let \((\bt^*, \br^*)\) be a Lagrange multiplier vector (as defined in Definition 10.33). Let \(I = \{ i \ST t_i^* > 0 \}\) and \(J = \{ j \ST r_j \neq 0 \}\). Then \((\bt^*, \br^*)\) is called a minimal Lagrange multiplier vector if there is no other Lagrange multiplier vector with a support that is strictly contained in \(I \cup J\).

Definition 10.36 (Strong Lagrange multiplier vector)

Let \(\bx^*\) be a local minimum of the problem (10.20). Let \((\bt^*, \br^*)\) be a Lagrange multiplier vector (as defined in Definition 10.33). Let \(I = \{ i \ST t_i^* > 0 \}\) and \(J = \{ j \ST r_j \neq 0 \}\). Then \((\bt^*, \br^*)\) is called a strong Lagrange multiplier vector if it satisfies the following condition:

If the set \(I \cup J\) is nonempty, then there exists a sequence \(\{ \bx_k \}\) of \(X\) such that \(\bx_k \to \bx^*\) and for all \(k\) we have:

\[\begin{split} & f(\bx^k) < f(\bx^*); \\ & t_i^* g_i(\bx^k) > 0, \Forall i \in I; \\ & r_j^* h_j(\bx^k) > 0, \Forall j \in J. \end{split}\]

This condition is slightly weaker than the complementary violation condition. It makes no provisions for ensuring that the constraints corresponding to zero multipliers have negligibly small violation at each \(\bx^k\).

  1. Every informative multiplier vector is a strong multiplier vector.

  2. Every minimal multiplier vector is a strong multiplier vector.

  3. Minimal multipliers are not necessarily informative.

  4. Some multiplier vectors are both informative and minimal.

Theorem 10.83 (Existence of informative multipliers)

Let \(\bx^*\) be a local minimum of the problem (10.20). Assume that the tangent cone \(T_X(\bx^*)\) is convex and that the set of Lagrange multiplier vectors is nonempty. Then

  1. The set of informative Lagrange multiplier vectors is nonempty.

  2. The Lagrange multiplier vector which has the minimum norm is informative.

  3. Every minimal Lagrange multiplier vector is strong.

10.8.7. Pseudonormality#

The key issue with enhanced Fritz-John conditions is that it allows for \(t_0^*\) to be zero.

  1. We now introduce the notion of pseudonormal points in the constraint set \(C\).

  2. We show that if a local minimizer \(\bx^*\) is a pseudonormal point, then \(t_0^* \neq 0\).

  3. Hence, at pseudonormal points, the constraint set \(C\) admits Lagrange multiplier vectors.

Definition 10.37 (Pseudonormal vector)

We say that a feasible vector \(\bx^*\) for the problem (10.20) is pseudonormal if one cannot find scalars \(t_1, \dots, t_m, r_1, \dots, r_p\) and a sequence \(\{\bx_k \}\) of \(X\) such that

  1. The gradients of constraint functions at \(\bx^*\) satisfy:

    \[ -\left(\sum_{i=1}^m t_i \nabla g_i(\bx^*) + \sum_{j=1}^p r_j \nabla h_j(\bx^*) \right ) \in \tilde{N}_X(\bx^*). \]
  2. Inequality multipliers are nonnegative: \(t_i \geq 0\) for every \(i=1,\dots,m\).

  3. Multipliers for inactive inequality constraints are zero. \(t_i = 0\) for every \(i \notin \AAA(\bx^*)\) where \(\AAA(\bx^*)\) denotes the set of active inequality constraints at \(\bx^*\);

    \[ \AAA(\bx^*) = \{ i \in 1,\dots,m \ST g_i(\bx^*) = 0 \}. \]
  4. The sequence \(\{ \bx_k \}\) converges to \(\bx^*\) (i.e., \(\bx_k \to \bx^*\)) and

    \[ \sum_{i=1}^m t_i g_i(\bx_k) + \sum_{j=1}^p r_j h_j(\bx_k) > 0, \quad \Forall k. \]

Theorem 10.84 (Pseudonormality and EFJ scalars)

Let \(\bx^*\) be a local minimizer for the problem (10.20). Let \(t_0^*,t_1^*, \dots, t_m^*, r_1^*,\dots, r_p^*\) be the enhanced Fritz-John scalars satisfying the conditions in Theorem 10.82.

If \(\bx^*\) is a pseudonormal vector of the constraint set \(C\) defined in (10.21), then \(t_0^* \neq 0\).

Proof. We shall show this by contradiction.

  1. Assume that \(\bx^*\) is a local minimizer for the problem (10.20).

  2. Then there exist EFJ scalars \(t_0^*,t_1^*, \dots, t_m^*, r_1^*,\dots, r_p^*\) satisfying the conditions in Theorem 10.82.

  3. For contradiction assume that \(\bx^*\) is a pseudonormal vector and \(t_0^* = 0\).

  4. Since \(t_0^*,t_1^*, \dots, t_m^*, r_1^*,\dots, r_p^*\) are EFJ scalars, hence

    \[ -\left ( \sum_{i=1}^m t_i^* \nabla g_i(\bx^*) + \sum_{j=1}^p r_j^* \nabla h_j(\bx^*) \right ) \in \tilde{N}_X(\bx^*). \]
  5. \(t_i^* \geq 0\) for every \(i=1,\dots,m\).

  6. The scalars \(t_1^*, \dots, t_m^*, r_1^*,\dots, r_p^*\) are not equal to \(0\).

  7. Hence the index set \(I \cup J\) is not empty, where

    \[ I = \{i > 0 \ST t_i^* > 0 \}, \quad J = \{j \ST r_j^* \neq 0 \}. \]
  8. Hence there exists a sequence \(\{ \bx_k \}\) of \(X\) that converges to \(\bx^*\) and is such that for all \(k\)

    \[\begin{split} & f(\bx_k) < f(\bx^*),\\ & t_i^* g_i (\bx_k) > 0 \Forall i \in I,\\ & r_j^* h_j(\bx_k) > 0 \Forall j \in J,\\ & g_i^+(\bx_k) = o(w(\bx_k)) \Forall i \notin I,\\ & |h_j(\bx_k) | = o(w(\bx_k)) \Forall j \notin J,\\ \end{split}\]

    where \(g_i^+(\bx) = \max \{ 0, g_i(\bx) \}\) and

    \[ w(\bx) = \min \left \{ \min_{i \in I} g_i^+(\bx), \min_{j \in J} |h_j(\bx) | \right \}. \]
  9. Hence the sequence \(\{ \bx_k \}\) with \(\bx_k \to \bx\) satisfies

    \[ \sum_{i=1}^m t_i^* g_i (\bx_k) + \sum_{j=1}^p r_j^* h_j(\bx_k) > 0 \Forall k \]

    since at least one of these multipliers is nonzero.

  10. Multiplier for every inactive inequality constraint must be zero. Otherwise, the complementary violation condition will be violated.

  11. Hence \(t_1^*, \dots, t_m^*, r_1^*,\dots, r_p^*\) are the scalars and \(\{ \bx_k \}\) is a sequence of \(X\) satisfying all the conditions in Definition 10.37.

  12. Hence \(\bx^*\) cannot be a pseudonormal vector of \(C\).

10.8.8. Constraint Qualifications#

We now introduce different constraint qualifications which ensure that a point \(\bx^*\) is pseudonormal.

10.8.8.1. Linearly Independent Constraint Qualifications#

Criterion 10.1 (CQ1: linearly independent constraint qualifications)

\(X = \VV\) and \(\bx^*\) satisfies LICQ; i.e., the active inequality constraint gradients \(\nabla g_i (\bx^*)\) for all \(i in \AAA(\bx^*)\) and the equality constraint gradients \(\nabla h_j(\bx^*)\) for all \(j=1,\dots,p\) are linearly independent.

10.8.8.2. AHU/MF Constraint Qualifications#

10.8.8.3. Affine Equality and Concave Inequality Constraints#