7.18. Smoothness#

Primary references for this section are [4].

7.18.1. L-Smooth Functions#

Definition 7.80 (L-smooth functions)

For some \(L \geq 0\), a function \(f : \VV \to \RERL\) is called \(L\)-smooth over a set \(D \subseteq \VV\) if it is differentiable over \(D\) and satisfies

\[ \| \nabla f(\bx) - \nabla f(\by)\|_* \leq L \| \bx - \by \| \Forall \bx, \by \in D. \]

The constant \(L\) is called the smoothness parameter.

  • Since \(f\) is differentiable over \(D\), hence \(D \subseteq \interior \dom f\).

  • If \(f\) is \(L\)-smooth over the entire \(\VV\), we simply say that \(f\) is \(L\)-smooth.

  • \(L\)-smooth functions are also known as functions with Lipschitz gradient with constant \(L\).

  • The class of functions which are \(L\)-smooth over a set \(D\) is denoted by \(C_L^{1,1}(D)\).

  • When \(D = \VV\), then the class is simply denoted as \(C_L^{1,1}\).

  • The class of functions which are \(L\)-smooth for some \(L \geq 0\) (but \(L\) may not be known), is denoted by \(C^{1,1}\).

  • By definition, if a function is \(L_1\)-smooth, then it is \(L_2\)-smooth for every \(L_2 \geq L_1\).

  • Thus, it is often useful to identify the smallest possible value of \(L\) for which a function is \(L\)-smooth.

Example 7.82 (Zero smoothness of affine functions)

Let \(\bb \in \VV^*\) and \(c \in \RR\). Let \(f : \VV \to \RR\) be given by:

\[ f(\bx) = \langle \bx, \bb \rangle + c. \]

Then, \(f\) is \(0\)-smooth.

To see this, we note that \(\nabla f(\bx) = \bb\). Consequently,

\[ \| \nabla f(\bx) - \nabla f(\by) \|_* = \| \bb - \bb \|_* = \| \bzero \|_* = 0 \leq 0 \| \bx - \by \|. \]

Theorem 7.263 (Smoothness of quadratic functions)

Let \(\bA \in \SS^n\), \(\bb \in \RR^n\) and \(c \in \RR\). Assume that \(\RR^n\) is endowed with \(\ell_p\)-norm for some \(1 \leq p \leq \infty\). Let \(f : \RR^n \to \RR\) be given by:

\[ f(\bx) = \frac{1}{2} \bx^T \bA \bx + \bb^T \bx + c. \]

Then, \(f\) is \(L\)-smooth with \(L = \| \bA \|_{p, q}\) where \(q \in [1,\infty]\) is the conjugate exponent satisfying

\[ \frac{1}{p} + \frac{1}{q} = 1 \]

and \(\| \bA \|_{p, q}\) is the induced norm given by

\[ \| \bA \|_{p, q} = \sup \{ \| \bA \bx \|_q \ST \| \bx \|_p \leq 1 \}. \]

Proof. We note that the dual norm of \(\ell_p\) norm is the \(\ell_q\) norm. Now,

\[\begin{split} & \| \nabla f(\bx) - \nabla f(\by)\|_* \\ &= \| \nabla f(\bx) - \nabla f(\by)\|_q \\ &= \| \bA \bx + \bb - \bA \by - \bb \|_q \\ &= \| \bA \bx - \bA \by \|_q \\ &\leq \| \bA \|_{p, q} \| \bx - \by \|_p. \end{split}\]

Thus, \(f\) is \(\| \bA \|_{p, q}\) smooth.

We next show that \(\| \bA \|_{p, q}\) is the smallest smoothness parameter for \(f\).

  1. Assume that \(f\) is \(L\) smooth for some \(L\).

  2. By definition of \(\| \bA \|_{p, q}\), there exists a vector \(\tilde{\bx}\) such that \(\| \tilde{\bx} \|_p = 1\) and

    \[ \| \bA \tilde{\bx} \|_q = \| \bA \|_{p, q} \| \tilde{\bx} \|_p = \| \bA \|_{p, q}. \]
  3. Then,

    \[\begin{split} \| \bA \|_{p, q} &= \| \bA \tilde{\bx} \|_q\\ &= \| \bA \tilde{\bx} + \bb - \bA \bzero - \bb \|_q\\ &= \| \nabla f(\tilde{\bx}) - \nabla f(\bzero) \|_q \\ &\leq L \| \tilde{\bx} - \bzero \|_p = L. \end{split}\]
  4. Thus, \(\| \bA \|_{p, q} \leq L\).

Thus, \(\| \bA \|_{p, q}\) is indeed the smallest smoothness parameter for \(L\).

7.18.1.1. Descent Lemma#

Theorem 7.264 (Descent lemma)

Let \(f : \VV \to \RERL\) be \(L\)-smooth for some \(L \geq 0\) over some convex set \(D\). Then for any \(\bx, \by \in D\),

\[ f(\by) \leq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{L}{2} \| \bx - \by \|^2. \]

Proof. we proceed as follows:

  1. By the fundamental theorem of calculus

    \[ f(\by) - f(\bx) = \int_0^1 \langle \by - \bx, \nabla f(\bx + t(\by - \bx)) \rangle dt. \]
  2. By adding and subtracting \(\langle \by - \bx, \nabla f(\bx) \rangle\), we get:

    \[ f(\by) - f(\bx) = \langle \by - \bx, \nabla f(\bx) \rangle + \int_0^1 \langle \by - \bx, \nabla f(\bx + t(\by - \bx)) - \nabla f(\bx) \rangle dt. \]
  3. This gives us

    \[\begin{split} & | f(\by) - f(\bx) - \langle \by - \bx, \nabla f(\bx) \rangle | \\ &= \left | \int_0^1 \langle \by - \bx, \nabla f(\bx + t(\by - \bx)) - \nabla f(\bx) \rangle dt \right | \\ &\leq \int_0^1 | \langle \by - \bx, \nabla f(\bx + t(\by - \bx)) - \nabla f(\bx) \rangle | dt \\ &\leq \int_0^1 \| \by - \bx \| \| \nabla f(\bx + t(\by - \bx)) - \nabla f(\bx) \|_* dt & \text{ (a) } \\ &\leq \int_0^1 \| \by - \bx \| tL \| \by - \bx \| dt & \text { (b) } \\ &= \int_0^1 tL \| \by - \bx \|^2 dt \\ &= L \| \by - \bx \|^2 \int_0^1 t dt \\ &= \frac{L}{2} \| \by - \bx \|^2. \end{split}\]
    • (a) is an application of Generalized Cauchy Schwartz inequality (Theorem 4.108}).

    • (b) is the application of \(L\)-smoothness of \(f\) (Definition 7.80).

  4. Thus,

    \[\begin{split} & | f(\by) - f(\bx) - \langle \by - \bx, \nabla f(\bx) \rangle | \leq \frac{L}{2} \| \by - \bx \|^2 \\ & \implies f(\by) - f(\bx) - \langle \by - \bx, \nabla f(\bx) \rangle \leq \frac{L}{2} \| \by - \bx \|^2 \\ & \implies f(\by) \leq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{L}{2} \| \by - \bx \|^2. \end{split}\]

7.18.1.2. Characterization of \(L\)-Smooth Functions#

Theorem 7.265 (Characterization of \(L\)-smooth functions)

Let \(f : \VV \to \RR\) be convex and differentiable over \(\VV\). Let \(L > 0\). The following claims are equivalent:

  1. \(f\) is \(L\)-smooth.

  2. \(f(\by) \leq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{L}{2} \| \bx - \by \|^2 \Forall \bx, \by \in \VV\).

  3. \(f(\by) \geq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{1}{2L} \| \nabla f (\bx) - \nabla f(\by) \|_*^2 \Forall \bx, \by \in \VV\).

  4. \(\langle \bx - \by, \nabla f (\bx) - \nabla f(\by) \rangle \geq \frac{1}{L} \| \nabla f (\bx) - \nabla f(\by) \|_*^2 \Forall \bx, \by \in \VV\).

  5. \(f(t \bx + (1-t) \by) \geq t f(\bx) + (1-t) f(\by) - \frac{L}{2} t (1 - t) \| \bx - \by \|^2 \Forall \bx, \by \in \VV, t \in [0, 1]\).

Proof. (1) \(\implies\) (2). This is a direct implication of the descent lemma (Theorem 7.264).

(2) \(\implies\) (3)

  1. We are given that (2) is satisfied.

  2. If \(\nabla f(\bx) = \nabla f(\by)\), then the inequality is trivial due to the convexity of \(f\). Hence, we consider the case where \(\nabla f(\bx) \neq \nabla f(\by)\).

  3. Fix a \(\bx \in \VV\).

  4. Consider a function \(g_{\bx} : \VV \to \RR\) given by

    \[ g_{\bx}(\by) = f(\by) - f(\bx) - \langle \by - \bx, \nabla f(\bx) \rangle. \]
  5. Then,

    \[ \nabla g_{\bx}(\by) = \nabla f(\by) - \nabla f(\bx). \]
  6. By hypothesis in property (2), for any \(\bz \in \VV\)

    \[ f(\bz) \leq f(\by) + \langle \bz - \by, \nabla f(\by) \rangle + \frac{L}{2} \| \bz - \by \|^2. \]
  7. Now,

    \[\begin{split} & g_{\bx}(\bz) \\ &= f(\bz) - f(\bx) - \langle \bz - \bx, \nabla f(\bx) \rangle \\ &\leq f(\by) + \langle \bz - \by, \nabla f(\by) \rangle + \frac{L}{2} \| \bz - \by \|^2 - f(\bx) - \langle \bz - \bx, \nabla f(\bx) \rangle\\ &= f(\by) - f(\bx) - \langle \bz - \bx, \nabla f(\bx) \rangle + \langle \bz - \by, \nabla f(\bx) \rangle - \langle \bz - \by, \nabla f(\bx) \rangle\\ &+ \langle \bz - \by, \nabla f(\by) \rangle + \frac{L}{2} \| \bz - \by \|^2\\ &= f(\by) - f(\bx) - \langle \by - \bx, \nabla f(\bx) \rangle + \langle \bz - \by, \nabla f(\by) - \nabla f(\bx) \rangle + \frac{L}{2} \| \bz - \by \|^2\\ &= g_{\bx}(\by) + \langle \bz - \by, \nabla g_{\bx}(\by) \rangle + \frac{L}{2} \| \bz - \by \|^2. \end{split}\]
  8. Thus, \(g_{\bx}\) also satisfies the inequality in property (2).

  9. We note in particular that \(\nabla g_{\bx} (\bx) = \nabla f(\bx) - \nabla f(\bx) = \bzero\).

  10. Since \(g_{\bx}\) is convex, hence \(\bx\) is the global minimizer of \(g_{\bx}\).

  11. In other words,

    \[ g_{\bx}(\bx) \leq g_{\bx}(\bz) \Forall \bz \in \VV. \]
  12. We can also see that \(g_{\bx}(\bx) = f(\bx) - f(\bx) - \langle \bx - \bx, \nabla f(\bx) \rangle = 0\).

  13. Let \(\by \in \VV\)

  14. Let \(\bv \in \VV\) be the unit norm vector satisfying \(\| \nabla g_{\bx}(\by) \|_* = \langle \bv , \nabla g_{\bx}(\by) \rangle\).

  15. Choose

    \[ \bz = \by - \frac{\| \nabla g_{\bx}(\by) \|_*}{L} \bv. \]
  16. Then,

    \[ 0 = g_{\bx}(\bx) \leq g_{\bx}(\bz) = g_{\bx}\left (\by - \frac{\| \nabla g_{\bx}(\by) \|_*}{L} \bv \right ). \]
  17. Using property (2) on \(g_{\bx}(\bz)\), we get

    \[\begin{split} 0 &\leq g_{\bx}(\bz) \\ &\leq g_{\bx}(\by) + \langle \bz - \by, \nabla g_{\bx}(\by) \rangle + \frac{L}{2} \| \bz - \by \|^2 \\ &= g_{\bx}(\by) - \frac{\| \nabla g_{\bx}(\by) \|_*}{L} \langle \bv, \nabla g_{\bx}(\by) \rangle + \frac{L}{2} \left \| \frac{\| \nabla g_{\bx}(\by) \|_*}{L} \bv \right \|^2 \\ &= g_{\bx}(\by) - \frac{\| \nabla g_{\bx}(\by) \|_*}{L} \| \nabla g_{\bx}(\by) \|_* + \frac{1}{2 L} \| \nabla g_{\bx}(\by) \|_*^2 \\ &= g_{\bx}(\by) - \frac{1}{2 L} \| \nabla g_{\bx}(\by) \|_*^2 \\ &= f(\by) - f(\bx) - \langle \by - \bx, \nabla f(\bx) \rangle - \frac{1}{2 L} \| \nabla f(\by) - \nabla f(\bx) \|_*^2. \end{split}\]
  18. Simplifying this, we get

    \[ f(\by) \geq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{1}{2 L} \| \nabla f(\by) - \nabla f(\bx) \|_*^2 \]

    as desired.

(3) \(\implies\) (4)

  1. For \(\bx, \by\), the property (3) gives us:

    \[ f(\by) \geq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{1}{2 L} \| \nabla f(\by) - \nabla f(\bx) \|_*^2. \]
  2. For \(\by, \bx\), the property (3) gives us:

    \[ f(\bx) \geq f(\by) + \langle \bx - \by, \nabla f(\by) \rangle + \frac{1}{2 L} \| \nabla f(\bx) - \nabla f(\by) \|_*^2. \]
  3. Adding the two inequalities and canceling the term \(f(\bx) + f(\by)\) gives us

    \[ 0 \geq \langle \bx - \by, \nabla f(\by) - f(\bx) \rangle + \frac{1}{L} \| \nabla f(\bx) - \nabla f(\by) \|_*^2. \]
  4. Rearranging, we get

    \[ \langle \bx - \by, \nabla f(\bx) - f(\by) \rangle \geq \frac{1}{L} \| \nabla f(\bx) - \nabla f(\by) \|_*^2 \]

    as desired.

(4) \(\implies\) (1)

  1. When \(\nabla f(\bx) = \nabla f(\by)\), then the Lipschitz condition in (1) is trivial. Hence, we consider the case where \(\nabla f(\bx) \neq \nabla f(\by)\).

  2. By generalized Cauchy Schwartz inequality (Theorem 4.108)

    \[ \langle \bx - \by, \nabla f(\bx) - f(\by) \rangle \leq \| \bx - \by \| \| f(\bx) - f(\by) \|_*. \]
  3. Thus, combining with hypothesis (4), we obtain

    \[ \frac{1}{L} \| \nabla f(\bx) - \nabla f(\by) \|_*^2 \leq \| \bx - \by \| \| f(\bx) - f(\by) \|_*. \]
  4. Since \(\nabla f(\bx) \neq \nabla f(\by)\), hence \(\| f(\bx) - f(\by) \|_* > 0\).

  5. Canceling it from both sides, we get

    \[ \| \nabla f(\bx) - \nabla f(\by) \|_* \leq L \| \bx - \by \| \]

    as desired.

We have shown so far that (1), (2), (3) and (4) are equivalent statements. We are left with showing that (5) is equivalent to the other statements.

(2) \(\implies\) (5)

  1. Pick \(\bx, \by \in \VV\) and \(t \in [0,1]\).

  2. Let \(\bz = t \bx + (1-t) \by\).

  3. By hypothesis (2),

    \[\begin{split} & f(\bx) \leq f(\bz) + \langle \bx - \bz, \nabla f(\bz) \rangle + \frac{L}{2} \| \bx - \bz \|^2;\\ & f(\by) \leq f(\bz) + \langle \by - \bz, \nabla f(\bz) \rangle + \frac{L}{2} \| \by - \bz \|^2. \end{split}\]
  4. Note that \(\bx - \bz = (1-t) (\bx - \by)\) and \(\by - \bz = t (\by - \bx)\).

  5. Thus, the previous two inequalities are same as

    \[\begin{split} & f(\bx) \leq f(\bz) + (1-t)\langle \bx - \by, \nabla f(\bz) \rangle + \frac{L (1-t)^2}{2} \| \bx - \by \|^2;\\ & f(\by) \leq f(\bz) + t\langle \by - \bx, \nabla f(\bz) \rangle + \frac{L t^2}{2} \| \bx - \by \|^2. \end{split}\]
  6. Multiplying the first inequality by \(t\), the second by \((1-t)\) and adding, we get

    \[ t f(\bx) + (1-t) f(\by) \leq f(\bz) + \frac{L t(1-t)}{2} \| \bx - \by \|^2. \]
  7. Rearranging, we get

    \[ f(t \bx + (1-t) \by) = f(\bz) \geq t f(\bx) + (1-t) f(\by) - \frac{L }{2} t(1-t) \| \bx - \by \|^2. \]

(5) \(\implies\) (2)

  1. Pick \(\bx, \by \in \VV\) and \(t \in (0,1)\).

  2. By hypothesis in inequality (5)

    \[ f(t \bx + (1-t) \by) \geq t f(\bx) + (1-t) f(\by) - \frac{L }{2} t(1-t) \| \bx - \by \|^2. \]
  3. Rearranging the terms, we obtain

    \[\begin{split} & (1-t) f(\by) \leq f(t \bx + (1-t) \by) - t f(\bx) + \frac{L }{2} t(1-t) \| \bx - \by \|^2 \\ &\iff (1-t) f(\by) \leq f(t \bx + (1-t) \by) - f(\bx) + (1 - t) f(\bx) + \frac{L }{2} t(1-t) \| \bx - \by \|^2 \\ &\iff f(\by) \leq f(\bx) + \frac{f(t \bx + (1-t) \by) - f(\bx)}{1-t} + \frac{L }{2} t \| \bx - \by \|^2. \end{split}\]

    Division by \((1-t)\) is fine since \((1-t) \in (0, 1)\).

  4. Recalling the definition of directional derivative (Definition 7.70):

    \[\begin{split} &\lim_{t \to 1^-} \frac{f(t \bx + (1-t) \by) - f(\bx)}{1-t} \\ &= \lim_{s \to 0^+} \frac{f( (1-s) \bx + s \by) - f(\bx)}{s} \\ &= \lim_{s \to 0^+} \frac{f( \bx + s (\by - \bx) ) - f(\bx)}{s}\\ &= f'(\bx; \by - \bx). \end{split}\]
  5. Since the previous inequality is valid for every \(t \in (0,1)\), taking the limit to \(t \to 1^-\) on the R.H.S., we obtain

    \[ f(\by) \leq f(\bx) + f'(\bx; \by - \bx) + \frac{L }{2} \| \bx - \by \|^2. \]
  6. Recall from Theorem 7.203 that \(f'(\bx; \by - \bx) = \langle \by - \bx, \nabla f(\bx) \rangle\).

  7. Thus, we get:

    \[ f(\by) \leq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{L }{2} \| \bx - \by \|^2. \]

    as desired.

7.18.1.3. Second Order Characterization#

We now restrict our attention to the vector space \(\RR^n\) equipped with an \(\ell_p\) norm with \(p \geq 1\).

Theorem 7.266 (\(L\)-smoothness and the boundedness of the Hessian)

Let \(f : \RR^n \to \RR\) be a twice continuously differentiable function over \(\RR^n\). Then, for any \(L \geq 0\), the following claims are equivalent:

  1. \(f\) is \(L\)-smooth w.r.t. the \(\ell_p\)-norm (\(p \in [1, \infty]\)).

  2. \(\| \nabla^2 f(\bx)\|_{p, q} \leq L\) for any \(\bx \in \RR^n\) where \(q \geq 1\) satisfies \(\frac{1}{p} + \frac{1}{q} = 1\).

Proof. (2) \(\implies\) (1)

  1. We are given that \(\| \nabla^2 f(\bx)\|_{p, q} \leq L\) for any \(\bx \in \RR^n\).

  2. By the fundamental theorem of calculus

    \[\begin{split} \nabla f(\by) - \nabla f(\bx) &= \int_0^1 \nabla^2 f(\bx + t(\by - \bx)) (\by - \bx) dt \\ &= \left (\int_0^1 \nabla^2 f(\bx + t(\by - \bx)) dt \right ) (\by - \bx). \end{split}\]
  3. Taking the (dual)-norm on both sides

    \[\begin{split} \| \nabla f(\by) - \nabla f(\bx) \|_q &= \left \| \left ( \int_0^1 \nabla^2 f(\bx + t(\by - \bx)) dt \right ) (\by - \bx) \right \|_q \\ &\leq \left \| \int_0^1 \nabla^2 f(\bx + t(\by - \bx)) dt \right \|_{p, q} \| \by - \bx \|_p \\ &\leq \left ( \int_0^1 \| \nabla^2 f(\bx + t(\by - \bx)) \|_{p, q} dt \right ) \| \by - \bx \|_p \\ &\leq \left ( \int_0^1 L dt \right ) \| \by - \bx \|_p \\ &= L \| \by - \bx \|_p. \end{split}\]
  4. Thus, \(\| \nabla f(\by) - \nabla f(\bx) \|_q \leq L \| \by - \bx \|_p\) as desired.

(1) \(\implies\) (2)

  1. We are given that \(f\) is \(L\) smooth with \(\ell_p\) norm.

  2. By fundamental theorem of calculus, for any \(\bd \in \RR^n\) and \(s > 0\),

    \[ \nabla f(\bx + s \bd) - \nabla f(\bx) = \int_0^s \nabla^2 f(\bx + t \bd) \bd dt. \]
  3. Taking \(q\) norm on both sides

    \[ \left \| \left ( \int_0^s \nabla^2 f(\bx + t \bd) dt \right ) \bd \right \|_q = \| \nabla f(\bx + s \bd) - \nabla f(\bx) \|_q \leq L \|\bx + s \bd - \bx \|_p = s L \| \bd \|_p. \]
  4. Dividing by \(s\) on both sides and taking the limit \(s \to 0^+\), we get

    \[ \| \nabla^2 f(\bx) \bd \|_q \leq L \| \bd \|_p. \]
  5. Since this is valid for every \(\bd \in \RR^n\), hence

    \[ \| \nabla^2 f(\bx) \|_{p,q} \leq L. \]

Corollary 7.27 (\(L\)-smoothness and largest eigenvalue of Hessian)

Let \(f : \RR^n \to \RR\) be a twice continuously differentiable convex function over \(\RR^n\). Then \(f\) is \(L\)-smooth w.r.t. \(\ell_2\)-norm if and only if

\[ \lambda_{\max}( \nabla^2 f(\bx)) \leq L \Forall \bx \in \RR^n. \]

Proof. Since \(f\) is convex, hence it follows that \(\nabla^2 f(\bx) \succeq \ZERO\) for every \(\bx\). Thus,

\[ \|\nabla^2 f(\bx) \|_{2,2} = \sqrt{\lambda_{\max}( \nabla^2 f(\bx)^2 )} = \lambda_{\max}( \nabla^2 f(\bx)). \]

From Theorem 7.266, \(f\) is \(L\)-smooth is equivalent to the condition that

\[ \lambda_{\max}( \nabla^2 f(\bx)) = \|\nabla^2 f(\bx) \|_{2,2} \leq L. \]

7.18.2. Strong Convexity#

Definition 7.81 (Strong convexity)

A function \(f : \VV \to \RERL\) is called \(\sigma\)-strongly convex for \(\sigma > 0\) if \(\dom f\) is convex and the following holds for any \(\bx, \by \in \dom f\) and \(t \in [0,1]\):

(7.10)#\[f(t \bx + (1 - t)\by) \leq t f(\bx) + (1-t)f(\by) - \frac{\sigma}{2} t (1 - t) \| \bx - \by \|^2. \]

7.18.2.1. Strong Convexity \(\implies\) Convexity#

Strongly convex functions are convex. In fact, we have a stronger result available.

Theorem 7.267 (Strong convexity and convexity)

Assume that the ambient space \(\VV\) is Euclidean.

A function \(f : \VV \to \RERL\) is \(\sigma\)-strongly convex if and only if the function \(f(\cdot) - \frac{\sigma}{2} \| \cdot \|^2\) is convex.

Proof. Let us define a function \(g : \VV \to \RERL\) as

\[ g(\bx) = f(\bx) = \frac{\sigma}{2} \| \bx \|^2. \]

We need to show that \(f\) is \(\sigma\)-strongly convex if and only if \(g\) is convex.

  1. We first note that \(\dom g = \dom f\).

  2. Thus, \(\dom g\) is convex if and only if \(\dom f\) is convex.

  3. Now, \(g\) is convex if and only if \(\dom g = \dom f\) is convex and for any \(\bx, \by \in \dom f\) and \(t \in (0, 1)\)

    \[ g(t \bx + (1-t) \by) \leq t g(\bx) + (1-t) g(\by). \]
  4. Now,

    \[\begin{split} & g(t \bx + (1-t) \by) \leq t g(\bx) + (1-t) g(\by) \\ \iff & f(t \bx + (1-t) \by) - \frac{\sigma}{2} \| t \bx + (1-t) \by \|^2 \\ & \leq t f(\bx) + (1-t) f(\by) - \frac{\sigma}{2} [t \| \bx \|^2 + (1-t) \| \by \|^2] \\ \iff & f(t \bx + (1-t) \by) \leq t f(\bx) + (1-t) f(\by) \\ & + \frac{\sigma}{2} [ \| t \bx + (1-t) \by \|^2 - t \| \bx \|^2 - (1-t) \| \by \|^2]. \end{split}\]
  5. Since the norm is Euclidean, hence

    \[\begin{split} & \| t \bx + (1-t) \by \|^2 - t \| \bx \|^2 - (1-t) \| \by \|^2 \\ &= \langle t \bx + (1-t) \by, t \bx + (1-t) \by \rangle - t \| \bx \|^2 - (1-t) \| \by \|^2 \\ &= t^2 \| \bx \|^2 + (1-t)^2 \| \by \|^2 + 2 t (1-t)\langle \bx, \by \rangle - t \| \bx \|^2 - (1-t) \| \by \|^2 \\ &= - t(1-t) \| \bx \|^2 - t(1-t) \| \by \|^2 + 2 t (1-t)\langle \bx, \by \rangle \\ &= - t(1-t) \left ( \| \bx \|^2 + \| \by \|^2 - 2 \langle \bx, \by \rangle \right ) \\ &= - t(1-t) \| \bx - \by \|^2. \end{split}\]
  6. Thus, the convexity inequality for \(g\) is equivalent to

    \[ f(t \bx + (1-t) \by) \leq t f(\bx) + (1-t) f(\by) - \frac{\sigma}{2}t(1-t) \| \bx - \by \|^2 \]

    which is nothing but the \(\sigma\)-strong convexity condition of \(f\).

7.18.2.2. Quadratic Functions#

Theorem 7.268 (Strong convexity of quadratic functions)

Let \(\bA \in \SS^n\), \(\bb \in \RR^n\) and \(c \in \RR\). Let \(f : \RR^n \to \RR\) be given by:

\[ f(\bx) = \frac{1}{2} \bx^T \bA \bx + \bb^T \bx + c. \]

Then \(f\) is \(\sigma\)-strongly convex if and only if \(\bA\) is positive definite and \(\sigma \leq \lambda_{\min}(\bA)\).

Proof. Due to Theorem 7.267, \(f\) is strongly convex with \(\sigma > 0\) if and only if \(g(\bx) = f(\bx) - \frac{\sigma}{2} \| \bx \|^2\) is convex.

  1. We note that

    \[\begin{split} g(\bx) &= \frac{1}{2} \bx^T \bA \bx + \bb^T \bx + c - \frac{\sigma}{2} \| \bx \|^2\\ &= \frac{1}{2} \bx^T (\bA - \sigma \bI) \bx + \bb^T \bx + c. \end{split}\]
  2. As shown in Example 7.40, \(g\) is convex if and only if \(\bA - \sigma \bI \succeq \ZERO\).

  3. This is equivalent to \(\sigma \leq \lambda_{\min}(\bA)\).

7.18.2.3. Coerciveness#

Theorem 7.269 (Strong convexity and coerciveness)

Assume that the ambient space \(\VV\) is Euclidean. Assume that \(f: \VV \to \RERL\) is a Fréchet-differentiable function. If \(f\) is \(\sigma\)-strongly convex, then it is coercive.

Proof. We proceed as follows.

  1. Define

    \[ g(\bx) = f(\bx) - \frac{\sigma}{2} \| \bx \|^2. \]
  2. By Theorem 7.267, \(g\) is convex.

  3. Since \(f\) is differentiable, hence \(g\) is also differentiable.

  4. Specifically, \(\nabla g(\bx) = \nabla f(\bx) - \sigma \bx\).

  5. Fix some \(\bx \in \interior \dom f\).

  6. Then \(\partial g(\bx) = \{ \nabla g(\bx) \}\).

  7. By subgradient inequality, for any \(\by \in \VV\),

    \[ g(\by) \geq g(\bx) + \langle \by - \bx, \nabla g(\bx) \rangle. \]
  8. Expanding \(g\) and \(\nabla g\):

    \[ f(\by) - \frac{\sigma}{2} \| \by \|^2 \geq f(\bx) - \frac{\sigma}{2} \| \bx \|^2 + \langle \by - \bx, \nabla f(\bx) - \sigma \bx \rangle. \]
  9. Let \(\bv = f(\bx) - \sigma \bx\).

  10. Rearranging terms

    \[ f(\by) \geq \frac{\sigma}{2} \| \by \|^2 + \langle \by, \bv \rangle + K_{\bx} \]

    where \(K_{\bx} = f(\bx) - \frac{\sigma}{2} \| \bx \|^2 - \langle \bx, \bv \rangle\).

  11. We note that the term \(K_{\bx}\) depends solely on \(\bx\) which is fixed. Hence \(K_{\bx}\) is a fixed quantity.

  12. By Cauchy-Schwarz inequality

    \[ \langle \by, \bv \rangle \geq - \| \bv \| \| \by \|. \]
  13. Hence

    \[ f(\by) \geq \frac{\sigma}{2} \| \by \|^2 - \| \bv \| \| \by \| + K_{\bx}. \]
  14. It is easy to see that, the R.H.S. goes to \(\infty\) as \(\| \by \| \to \infty\).

  15. Hence \(f\) is coercive.

7.18.2.4. Sum Rule#

Theorem 7.270 (Sum of strongly convex and convex functions)

Let \(f\) be \(\sigma\)-strongly convex and \(g\) be convex. Then \(f+g\) is \(\sigma\)-strongly convex.

Proof. Since both \(f\) and \(g\) are convex, hence their domains are convex. Hence, \(\dom (f + g) = \dom f \cap \dom g\) is also convex.

We further need to show that \(f+g\) satisfies (7.10).

  1. Let \(\bx, \by \in \VV\) and \(t \in (0,1)\).

  2. Since \(f\) is \(\sigma\)-strongly convex, hence

    \[ f(t \bx + (1 - t)\by) \leq t f(\bx) + (1-t)f(\by) - \frac{\sigma}{2} t (1 - t) \| \bx - \by \|^2. \]
  3. Since \(g\) is convex, hence

    \[ g(t \bx + (1 - t)\by) \leq t g(\bx) + (1-t)g(\by). \]
  4. Then,

    \[\begin{split} & (f + g)(t \bx + (1 - t)\by) \\ &= f(t \bx + (1 - t)\by) + g((t \bx + (1 - t)\by)) \\ &\leq f(t \bx + (1 - t)\by) \leq t f(\bx) + (1-t)f(\by) - \frac{\sigma}{2} t (1 - t) \| \bx - \by \|^2 + t g(\bx) + (1-t)g(\by)\\ &= t (f + g) (\bx) + (1-t) (f + g)(\by) - \frac{\sigma}{2} t (1 - t) \| \bx - \by \|^2. \end{split}\]
  5. Thus, \(f+g\) is also \(\sigma\)-strongly convex.

Example 7.83 (Strong convexity of \(\frac{1}{2}| \cdot |^2 + I_C\))

Let \(\VV\) be a Euclidean space.

  1. The function \(\frac{1}{2}\| \bx \|^2\) is 1-strongly convex due to Theorem 7.268.

  2. Let \(C\) be a convex set.

  3. Then, the indicator function \(I_C\) is convex.

  4. Due to Theorem 7.270, the function

    \[ g(\bx) = \frac{1}{2}\| \bx \|^2 + I_C(\bx) \]

    is also 1-strongly convex.

7.18.2.5. First Order Characterization#

Recall that \(\dom (\partial f)\) denotes the set of points at which \(f\) is subdifferentiable.

Theorem 7.271 (First order characterization of strong convexity)

Let \(f: \VV \to \RERL\) be a proper closed and convex function. For a given \(\sigma > 0\), the following statements are equivalent.

  1. \(f\) is \(\sigma\)-strongly convex.

  2. For every \(\bx \in \dom (\partial f)\), \(\by \in \dom f\) and \(\bg \in \partial f(\bx)\), the following holds true

    \[ f(\by) \geq f(\bx) + \langle \by - \bx, \bg \rangle + \frac{\sigma}{2} \| \by - \bx \|^2. \]
  3. For any \(\bx, \by \in \dom (\partial f)\) and \(\bg_{\bx} \in \partial f(\bx)\), \(\bg_{\by} \in \partial f(\bx)\), the following holds true:

    \[ \langle \bx - \by, \bg_{\bx} - \bg_{\by} \rangle \geq \sigma \| \bx - \by \|^2. \]

Proof. We shall prove the equivalence of these statements in the following order. \((2) \implies (1)\), \((1) \implies (3)\), \((3) \implies (2)\).

(2) \(\implies\) (1)

  1. We assume that (2) is true.

  2. Let \(\bx, \by \in \dom f\) and \(t \in (0,1)\).

  3. We need to show that (7.10) holds for \(f\).

  4. Since \(\dom f\) is convex, its relative interior is not empty (see Theorem 7.142).

  5. Let \(\bz \in \relint \dom f\).

  6. Choose some \(\alpha \in (0, 1]\).

  7. Let \(\tilde{\bx} = (1-\alpha) \bx + \alpha \bz\).

  8. By the line segment property (Theorem 7.143), \(\tilde{\bx} \in \relint \dom f\).

  9. Let \(\bx_t = t \tilde{\bx} + (1-t)\by\).

  10. Again, by the line segment property, \(\bx_t \in \relint \dom f\).

  11. Since \(f\) is a proper convex function, hence the subdifferential of \(f\) at relative interior points is nonempty (Theorem 7.217).

  12. Thus, \(\partial f(\bx_t) \neq \EmptySet\) and \(\bx_t \in \dom (\partial f)\).

  13. Take some \(\bg \in \partial f(\bx_t)\).

  14. By hypothesis (2)

    \[ f(\tilde{\bx}) \geq f(\bx_t) + \langle \tilde{\bx} - \bx_t, \bg \rangle + \frac{\sigma}{2} \| \tilde{\bx} - \bx_t \|^2. \]
  15. Substituting \(\bx_t = t \tilde{\bx} + (1-t)\by\), we have \(\tilde{\bx} - \bx_t = (1-t) (\tilde{\bx} - \by)\). Thus,

    \[ f(\tilde{\bx}) \geq f(\bx_t) + (1-t)\langle \tilde{\bx} - \by, \bg \rangle + \frac{\sigma (1-t)^2}{2} \| \tilde{\bx} - \by \|^2. \]
  16. Similarly, by hypothesis (2)

    \[ f(\by) \geq f(\bx_t) + \langle \by - \bx_t, \bg \rangle + \frac{\sigma}{2} \| \by - \bx_t \|^2. \]
  17. \(\by - \bx_t = \by - t \tilde{\bx} - (1-t)\by = t (\by - \tilde{\bx})\).

  18. This gives us,

    \[ f(\by) \geq f(\bx_t) + t \langle \by - \tilde{\bx}, \bg \rangle + \frac{\sigma t^2}{2} \| \by - \tilde{\bx} \|^2. \]
  19. Multiplying the first inequality by \(t\) and the second one by \((1-t)\) and adding them together, we get

    \[ t f(\tilde{\bx}) + (1-t)f(\by) \geq f(\bx_t) + \frac{\sigma t(1-t)}{2} \| \tilde{\bx} - \by \|^2. \]
  20. Thus,

    \[ f(t \tilde{\bx} + (1-t)\by) = f(\bx_t) \leq t f(\tilde{\bx}) + (1-t)f(\by) - \frac{\sigma t(1-t)}{2} \|\tilde{\bx} - \by \|^2. \]
  21. Expanding \(\tilde{\bx}\),

    \[\begin{split} t \tilde{\bx} + (1-t)\by &= t ((1-\alpha) \bx + \alpha \bz) + (1-t)\by\\ &= t(1-\alpha) \bx + (1-t) \by + t \alpha \bz. \end{split}\]
  22. Define \(g_1(\alpha) = f(t \tilde{\bx} + (1-t)\by) = f(t(1-\alpha) \bx + (1-t) \by + t \alpha \bz)\).

  23. Define \(g_2(\alpha) = f(\tilde{\bx}) = f((1-\alpha) \bx + \alpha \bz)\).

  24. Substituting these into the previous inequality, we obtain

    \[ g_1(\alpha) \leq t g_2(\alpha)+ (1-t)f(\by) - \frac{\sigma t(1-t)}{2} \|(1-\alpha) \bx + \alpha \bz - \by \|^2. \]
  25. The functions \(g_1\) and \(g_2\) are one dimensional, proper, closed and convex functions.

  26. By Theorem 7.173, both \(g_1\) and \(g_2\) are continuous on their domain.

  27. Therefore, taking the limit \(\alpha \to 0^+\), it follows that

    \[ g_1(0) \leq t g_2(0) + (1-t)f(\by) - \frac{\sigma t(1-t)}{2} \|\bx - \by \|^2. \]
  28. Now \(g_1(0) = f(t\bx + (1-t) \by)\) and \(g_2(0) = f(\bx)\).

  29. Thus,

    \[ f(t\bx + (1-t) \by) \leq t f(\bx) + (1-t)f(\by) - \frac{\sigma t(1-t)}{2} \|\bx - \by \|^2. \]
  30. This establishes that \(f\) is indeed \(\sigma\)-strongly convex.

(1) \(\implies\) (3)

  1. We are given that \(f\) is \(\sigma\)-strongly convex.

  2. Let \(\bx, \by \in \dom (\partial f)\).

  3. Pick any \(\bg_{\bx} \in \partial f(\bx)\) and \(\bg_{\by} \in \partial f(\by)\).

  4. Let \(t \in [0, 1)\) and denote \(\bx_t = t \bx + (1-t) \by\).

  5. By the hypothesis

    \[ f(\bx_t) \leq t f(\bx) + (1-t) f(\by) - \frac{\sigma t (1-t) }{2} \| \bx - \by \|^2. \]
  6. This is same as

    \[ f(\bx_t) - f(\bx) \leq (1-t) [f(\by) - f(\bx)] - \frac{\sigma t (1-t) }{2} \| \bx - \by \|^2. \]
  7. We can see that \((1-t) \in (0, 1]\).

  8. Dividing both sides of inequality by \((1-t)\), we obtain

    \[ \frac{f(\bx_t) - f(\bx)}{1-t} \leq f(\by) - f(\bx) - \frac{\sigma t }{2} \| \bx - \by \|^2. \]
  9. Since \(\bg_{\bx} \in \partial f(\bx)\), hence by subgradient inequality

    \[ f(\bx_t) \geq f(\bx) + \langle \bx_t - \bx, \bg_{\bx} \rangle. \]
  10. We can rewrite this as

    \[ \frac{f(\bx_t) - f(\bx)}{1-t} \geq \frac{\langle \bx_t - \bx, \bg_{\bx} \rangle}{1-t}. \]
  11. Note that \(\bx_t - \bx = (1-t)(\by - \bx)\).

  12. Thus,

    \[ \frac{f(\bx_t) - f(\bx)}{1-t} \geq \langle \by - \bx, \bg_{\bx} \rangle. \]
  13. Thus,

    \[ \langle \by - \bx, \bg_{\bx} \rangle \leq f(\by) - f(\bx) - \frac{\sigma t }{2} \| \bx - \by \|^2. \]
  14. This inequality holds for every \(t \in [0, 1)\).

  15. Taking the limit \(t \to 1^-\), we obtain

    \[ \langle \by - \bx, \bg_{\bx} \rangle \leq f(\by) - f(\bx) - \frac{\sigma}{2} \| \bx - \by \|^2. \]
  16. An identical reasoning by switching the roles of \(\bx\) and \(\by\), gives us

    \[ \langle \bx - \by, \bg_{\by} \rangle \leq f(\bx) - f(\by) - \frac{\sigma}{2} \| \by - \bx \|^2. \]
  17. Adding these two inequalities gives us

    \[ \langle \bx - \by, \bg_{\by} - \bg_{\bx} \rangle \leq - \sigma \| \bx - \by \|^2. \]
  18. Multiplying both sides by \(-1\) (and switching the inequality accordingly), we get

    \[ \langle \bx - \by, \bg_{\bx} - \bg_{\by} \rangle \geq \sigma \| \bx - \by \|^2 \]

    as desired.

(3) \(\implies\) (2)

  1. We are given that (3) is satisfied.

  2. Let \(\bx \in \dom (\partial f)\), \(\by \in \dom f\) and \(\bg \in \partial f(\bx)\).

  3. Pick any \(\bz \in \relint \dom f\).

  4. Pick some \(\alpha \in (0, 1)\).

  5. Define \(\tilde{\by} = (1 - \alpha) \by + \alpha \bz\).

  6. By line segment property \(\tilde{\by} \in \relint \dom f\).

  7. Define \(\bx_t = (1-t) \bx + t \tilde{\by}\).

  8. Consider the 1D function

    \[ \varphi(t) = f(\bx_t), \Forall t \in [0, 1]. \]
  9. Pick any \(t \in (0, 1)\).

  10. Then, by line segment principle \(\bx_t \in \relint \dom f\).

  11. Due to (Theorem 7.217), \(\partial f(\bx_t) \neq \EmptySet\) and \(\bx_t \in \dom (\partial f)\).

  12. Take some \(\bg_t \in \partial f(\bx_t)\).

  13. By subgradient inequality

    \[ f(\bz) \geq f(\bx_t) + \langle \bz - \bx_t, \bg_t \rangle \Forall \bz \in \VV. \]
  14. In particular, for \(\bx_s = (1-s) \bx + s \tilde{\by}\), we have

    \[\begin{split} & f(\bx_s) \geq f(\bx_t) + \langle (1-s) \bx + s \tilde{\by} - (1-t) \bx - t \tilde{\by}, \bg_t \rangle \\ & \implies \varphi(s) \geq \varphi(t) + \langle (s-t) (\tilde{\by} - \bx), \bg_t \rangle \\ &\implies \varphi(s) \geq \varphi(t) + (s-t) \langle \tilde{\by} - \bx, \bg_t \rangle. \end{split}\]
  15. Since this is valid for every \(s\), hence \(\langle \tilde{\by} - \bx, \bg_t \rangle \in \partial \varphi(t)\).

  16. Applying the mean value theorem (Theorem 7.234)

    \[ f(\tilde{\by}) - f(\bx) = \varphi(1) - \varphi(0) = \int_0^1 \langle \tilde{\by} - \bx, \bg_t \rangle dt. \]
  17. Since \(\bg \in \partial f(\bx)\) and \(\bg_t \in \partial f(\bx_t)\), hence applying the hypothesis (3), we get

    \[ \langle \bx_t - \bx, \bg_t - \bg \rangle \geq \sigma \| \bx_t - \bx \|^2. \]
  18. But \(\bx_t - \bx = t (\tilde{\by} - \bx)\).

  19. Hence

    \[ t \langle \tilde{\by} - \bx, \bg_t - \bg \rangle \geq \sigma t^2 \| \tilde{\by} - \bx \|^2. \]
  20. This simplifies to

    \[ \langle \tilde{\by} - \bx, \bg_t \rangle \geq \langle \tilde{\by} - \bx, \bg \rangle + \sigma t \| \tilde{\by} - \bx \|^2. \]

    Canceling \(t\) on both sides doesn’t change the sign of inequality since \(t > 0\).

  21. Applying the inequality to the integral above

    \[ f(\tilde{\by}) - f(\bx) \geq \int_0^1 \left [ \langle \tilde{\by} - \bx, \bg \rangle + \sigma t \| \tilde{\by} - \bx \|^2 \right ] d t. \]
  22. Integrating, we get

    \[ f(\tilde{\by}) - f(\bx) \geq \langle \tilde{\by} - \bx, \bg \rangle + \frac{\sigma}{2}\| \tilde{\by} - \bx \|^2. \]
  23. Expanding for \(\tilde{\by}\) for any \(\alpha \in (0,1)\), we have

    \[ f((1 - \alpha) \by + \alpha \bz) \geq f(\bx) + \langle (1 - \alpha) \by + \alpha \bz - \bx, \bg \rangle + \frac{\sigma}{2}\| (1 - \alpha) \by + \alpha \bz - \bx \|^2. \]
  24. The 1D function \(g(\alpha) = f((1 - \alpha) \by + \alpha \bz)\) is continuous again due to Theorem 7.173.

  25. Taking the limit \(\alpha \to 0^+\) on both sides, we obtain

    \[ f(\by) \geq f(\bx) + \langle \by - \bx, \bg \rangle + \frac{\sigma}{2}\| \by - \bx \|^2 \]

    which is the desired result.

7.18.2.6. Minimization#

Theorem 7.272 (Existence and uniqueness of a a minimizer of closed strongly convex function)

Let \(f: \VV \to \RERL\) be a proper, closed and \(\sigma\)-strongly convex function with \(\sigma > 0\). Then,

  1. \(f\) has a unique minimizer \(\ba \in \dom f\) such that \(f(\bx) > f(\ba)\) for every \(\bx \in \dom f\) and \(\bx \neq \ba\).

  2. The increase in the value of \(f\) w.r.t. its minimum satisfies

    \[ f(\bx) - f(\ba) \geq \frac{\sigma}{2} \| \bx - \ba \|^2 \]

    where \(\ba \in \dom f\) is the unique minimizer of \(f\).

Proof. (1) Existence of the minimizer

  1. Since \(f\) is proper and convex, hence \(\dom f\) is nonempty and convex.

  2. Since \(\dom f\) is nonempty and convex, hence its relative interior is nonempty (Theorem 7.142).

  3. Pick \(\by \in \relint \dom f\).

  4. By Theorem 7.214, \(\partial f(\by)\) is nonempty.

  5. Pick some \(\bg \in \partial f(\by)\).

  6. Then, by property 2 of Theorem 7.271,

    \[ f(\bx) \geq f(\by) + \langle \bx - \by, \bg \rangle + \frac{\sigma}{2} \| \bx - \by \|^2 \]

    holds true for every \(\bx \in \VV\).

  7. Let \(\| \cdot \|_2 \triangleq \sqrt{\langle \cdot, \cdot \rangle}\) denote the Euclidean norm associated with the inner product of the space \(\VV\). This might be different from the endowed norm \(\| \cdot \|\).

  8. Since all norms in a finite dimensional space are equivalent, hence, there exists a constant \(C > 0\) such that

    \[ \| \bz \| \geq \sqrt{C} \| \bz \|_2 \]

    for every \(\bz \in \VV\).

  9. Therefore,

    \[ f(\bx) \geq f(\by) + \langle \bx - \by, \bg \rangle + \frac{\sigma C}{2} \| \bx - \by \|_2^2 \Forall \bx \in \VV. \]
  10. This in turn is same as

    \[ f(\bx) \geq f(\by) - \frac{1}{2 C \sigma} \| \bg \|_2^2 + \frac{C \sigma}{2} \left \| \bx - \left (\by - \frac{1}{C \sigma} \bg \right ) \right \|_2^2 \Forall \bx \in \VV. \]
  11. Let \(S_t\) denote the sublevel set \(\{ \bx \ST f(\bx) \leq t \}\).

  12. Consider the sublevel set \(S_{f(\by)}\).

  13. Let \(\bx \in S_{f(\by)}\).

  14. Then, \(f(\bx) = f(\by) - r\) for some \(r \geq 0\).

  15. But then

    \[ f(\by) - r \geq f(\by) - \frac{1}{2 C \sigma} \| \bg \|_2^2 + \frac{C \sigma}{2} \left \| \bx - \left (\by - \frac{1}{C \sigma} \bg \right ) \right \|_2^2. \]
  16. This simplifies to

    \[ r \leq \frac{1}{2 C \sigma} \| \bg \|_2^2 - \frac{C \sigma}{2} \left \| \bx - \left (\by - \frac{1}{C \sigma} \bg \right ) \right \|_2^2. \]
  17. Since \(r\) must be nonnegative, hence the R.H.S. must be nonnegative also.

  18. Thus, we require that

    \[ \frac{1}{2 C \sigma} \| \bg \|_2^2 \geq \frac{C \sigma}{2} \left \| \bx - \left (\by - \frac{1}{C \sigma} \bg \right ) \right \|_2^2. \]
  19. This simplifies to

    \[ \left \| \bx - \left (\by - \frac{1}{C \sigma} \bg \right ) \right \|_2 \leq \frac{1}{C \sigma} \| \bg \|_2. \]
  20. In other words, \(\bx\) must belong to an \(\ell_2\) closed ball given by

    \[ B_{\| \cdot \|_2}\left [ \by - \frac{1}{C \sigma} \bg, \frac{1}{C \sigma} \| \bg \|_2 \right ]. \]
  21. Since this is valid for every \(\bx \in S_{f(\by)}\), hence

    \[ S_{f(\by)} \subseteq B_{\| \cdot \|_2}\left [ \by - \frac{1}{C \sigma} \bg, \frac{1}{C \sigma} \| \bg \|_2 \right ]. \]
  22. Since \(f\) is closed, hence all its sublevel sets are closed.

  23. since \(S_{f(\by)}\) is contained in a ball, hence \(S_{f(\by)}\) is bounded.

  24. Thus, \(S_{f(\by)}\) is closed and bounded.

  25. Since \(\VV\) is finite dimensional, hence \(S_{f(\by)}\) is compact.

  26. \(S_{f(\by)}\) is also nonempty since \(\by \in S_{f(\by)}\).

  27. Thus, the problem of minimizing \(f\) over \(\dom f\) reduces to the problem of minimizing \(f\) over the nonempty compact set \(S_{f(\by)}\).

  28. Since \(f\) is closed, it is also lower semicontinuous.

  29. By Theorem 3.121, \(f\) attains a minimum on \(S_{f(\by)}\) at some point \(\ba \in S_{f(\by)}\).

  30. Thus, we have established the existence of a minimizer of \(f\) at some \(\ba \in S_{f(\by)} \subseteq \dom f\).

(1) Uniqueness of the minimizer

  1. To show the uniqueness, for contradiction, assume that \(\bu\) and \(\bv\) are two different minimizers of \(f\) with \(f(\bu) = f(\bv) = p^*\), the optimal value.

  2. Let \(\bw = \frac{1}{2} \bu + \frac{1}{2} \bv\).

  3. We must have \(f(\bw) \geq p^*\).

  4. By strong convexity of \(f\),

    \[ f(\bw) \leq \frac{1}{2} f(\bu) + \frac{1}{2} f(\bv) - \frac{\sigma}{2}\frac{1}{2}\frac{1}{2} \| \bu - \bv \|^2 = p^* - \frac{\sigma}{8}\| \bu - \bv \|^2. \]
  5. If \(\bu \neq \bv\), then \(f(\bw) < p^*\); a contradiction.

  6. Hence, the minimizer must be unique.

(2) Increase in value of \(f\)

  1. Let \(\ba\) be the unique minimizer of \(f\).

  2. By Fermat’s optimality condition \(\bzero \in \partial f(\ba)\).

  3. Since \(f\) is \(\sigma\)-strongly convex, hence by property (2) in the Theorem 7.271,

    \[ f(\bx) - f(\ba) \geq \langle \bx - \ba, \bzero \rangle + \frac{\sigma}{2} \| \bx - \ba \|^2 = \frac{\sigma}{2} \| \bx - \ba \|^2 \]

    holds true for any \(\bx \in \dom f\).

7.18.3. Smoothness and Strong Convexity#

7.18.3.1. The Conjugate Correspondence Theorem#

The idea of smoothness and strong convexity is connected. Roughly speaking, a function is strongly convex if and only if its conjugate is smooth.

Theorem 7.273 (Conjugate correspondence theorem)

Let \(\sigma > 0\). Then

  1. If \(f : \VV \to \RR\) is a \(\frac{1}{\sigma}\)-smooth convex function, then \(f^*\) is \(\sigma\)-strongly convex w.r.t. the dual norm \(\| \cdot \|_*\).

  2. If \(f: \VV \to \RERL\) is a proper, closed \(\sigma\)-strongly convex function, then \(f^* : \VV^* \to \RR\) is \(\frac{1}{\sigma}\)-smooth.

Proof. (1) Smooth convex to strongly convex conjugate

  1. We are given that \(f: \VV \to \RR\) is a \(\frac{1}{\sigma}\)-smooth convex function.

  2. Due to Theorem 7.239, \(f^*\) is closed and convex.

  3. Since \(f\) is proper and convex, hence due to Theorem 7.240, \(f^*\) is proper.

  4. Thus \(f^*\) is a proper, closed and convex function.

  5. Pick any \(\by_1, \by_2 \in \dom (\partial f^*)\).

  6. Let \(\bv_1 \in \partial f^*(\by_1)\) and \(\bv_2 \in \partial f^*(\by_2)\).

  7. Since \(f\) is proper and convex, hence by conjugate subgradient theorem (Theorem 7.246)

    \[ \by_1 \in \partial f(\bv_1) \text{ and } \by_2 \in \partial f(\bv_2). \]
  8. Since \(f\) is smooth, hence it is differentiable. Hence due to Theorem 7.220,

    \[ \by_1 = \nabla f(\bv_1) \text{ and } \by_2 = \nabla f(\bv_2). \]
  9. Following characterization of smoothness (Theorem 7.265), by its property 4,

    \[ \langle \bv_1 - \bv_2, \by_1 - \by_2 \rangle \geq \sigma \| \by_1 - \by_2 \|^2_*. \]
  10. Since the last inequality holds for any \(\by_1, \by_2 \in \dom (\partial f^*)\) and any \(\bv_1 \in \partial f^*(\by_1), \bv_2 \in \partial f^*(\bv_2)\), hence following the first order characterization of strong convexity in Theorem 7.271, \(f^*\) is a \(\sigma\)-strongly convex function.

(2) Strongly convex to smooth conjugate

  1. We are given that \(f\) is proper, closed and \(\sigma\)-strongly convex.

  2. Pick any \(\by \in \VV^*\).

  3. The conjugate is given by

    \[ f^*(\by) = \sup_{\bx \in \VV} \{ \langle \bx, \by \rangle - f(\by) \}. \]
  4. Define \(g(\bx) = f(\bx) - \langle \bx, \by \rangle\).

  5. We can see that

    \[ f^*(\by) = - \inf_{\bx \in \VV} g(\bx). \]
  6. Due to the sum rule (Theorem 7.270), \(g\) is \(\sigma\)-strongly convex.

  7. Due to Theorem 7.272, \(g\) has a unique minimizer.

  8. Hence \(f^*(\by)\) is finite.

  9. Since this is valid for any \(\by \in \VV^*\), hence \(\dom f^* = \VV^*\).

  10. This justifies the signature for \(f^*\) as \(f^* : \VV^* \to \RR\) being real valued.

  11. Let’s continue with any \(\by\).

  12. Since \(\dom f^* = \VV^*\), hence \(\by \in \interior \dom f^*\).

  13. Now, by the second formulation of conjugate subgradient theorem (Corollary 7.26),

    \[ \partial f^*(\by) = \argmax_{\bx \in \VV} \{ \langle \bx, \by \rangle - f(\bx) \}. \]
  14. We can see that

    \[ \partial f^*(\by) = - \argmin_{\bx \in \VV} g(\bx). \]
  15. Since \(g\) has a unique minimizer, hence \(\partial f^*(\by)\) is a singleton.

  16. Due to Theorem 7.220, \(f^*\) is differentiable at \(\by\).

  17. Since \(\by\) is arbitrary, hence \(f^*\) is differentiable over entire \(\VV^*\).

  18. We now pickup two points \(\by_1, \by_2 \in \VV^*\) and denote \(\bv_1 = \nabla f^*(\by_1), \bv_2 = \nabla f^*(\by_2)\).

  19. By conjugate subgradient theorem (Theorem 7.246), this is equivalent to \(\by_1 \in \partial f(\bv_1)\) and \(\by_2 \in \partial f(\bv_2)\).

  20. Following the first order characterization of strong convexity in Theorem 7.271,

    \[ \langle \bv_1 - \bv_2, \by_1 - \by_2 \rangle \geq \sigma \| \bv_1 - \bv_2 \|^2. \]
  21. In other words

    \[ \langle \nabla f^*(\by_1) - \nabla f^*(\by_2), \by_1 - \by_2 \rangle \geq \sigma \| \nabla f^*(\by_1) - \nabla f^*(\by_2) \|^2. \]
  22. By generalized Cauchy Schwartz inequality (Theorem 4.108)

    \[ \langle \nabla f^*(\by_1) - \nabla f^*(\by_2), \by_1 - \by_2 \rangle \leq \| \nabla f^*(\by_1) - \nabla f^*(\by_2) \| \| \by_1 - \by_2 \|_*. \]
  23. Thus the previous inequality simplifies to

    \[ \| \nabla f^*(\by_1) - \nabla f^*(\by_2) \| \leq \frac{1}{\sigma}\| \by_1 - \by_2 \|_*. \]
  24. This establishes that \(f^*\) is \(\frac{1}{\sigma}\)-smooth.

7.18.4. Examples#

Example 7.84 (Smoothness of \(\sqrt{1 + | \cdot |_2^2}\))

Let \(f : \RR^n \to \RR\) be given by

\[ f(\bx) = \sqrt{1 + \| \bx \|_2^2}. \]

\(f\) is 1-smooth w.r.t. the \(\ell_2\)-norm.

  1. Note that for any \(\bx \in \RR^n\), the gradient is given by

    \[ \nabla f(\bx) = \frac{\bx}{ \sqrt{1 + \| \bx \|_2^2}}. \]
  2. The Hessian is given by

    \[ \nabla^f (\bx)= \frac{\bI}{ \sqrt{1 + \| \bx \|_2^2}} - \frac{\bx \bx^T}{ (1 + \| \bx \|_2^2)^{\frac{3}{2}}} \preceq \frac{\bI}{ \sqrt{1 + \| \bx \|_2^2}} \preceq \bI. \]
  3. Therefore, \(\lambda_{\max}( \nabla^2 f(\bx)) \leq 1\) for every \(\bx \in \RR^n\).

  4. Hence, by Corollary 7.27, \(f\) is 1-smooth w.r.t. the \(\ell_2\)-norm.

7.18.4.1. Log-Sum-Exp#

Example 7.85 (Smoothness of log-sum-exp)

Consider the log-sum-exp function \(f : \RR^n \to \RR\) given by

\[ f(\bx) = \ln \left ( \sum_{i=1}^n e^{x_i}\right ). \]

\(f\) is 1-smooth w.r.t. \(\ell_2\) and \(\ell_{\infty}\) norms.

Smoothness w.r.t. \(\ell_2\) norm

  1. The partial derivatives of \(f\) are

    \[ \frac{\partial f}{\partial x_i} (\bx) = \frac{e^{x_i}}{\sum_{k=1}^n e^{x_k} }. \]
  2. The second order partial derivatives are

    \[\begin{split} \frac{\partial^2 f}{\partial x_i \partial x_j} (\bx) = \begin{cases} - \frac{e^{x_i} e^{x_j}}{\left (\sum_{k=1}^n e^{x_k} \right )^2}, & i \neq j; \\ - \frac{e^{x_i} e^{x_i}}{\left (\sum_{k=1}^n e^{x_k} \right )^2} + \frac{e^{x_i}}{\sum_{k=1}^n e^{x_k}}, & i = j. \end{cases} \end{split}\]
  3. The Hessian can be written as

    \[ \nabla^2 f(\bx) = \diag (\bw) - \bw \bw^T \]

    where \(w_i = \frac{e^{x_i}}{\sum_{k=1}^n e^{x_k}}\).

  4. We can now see that

    \[ \nabla^2 f(\bx) = \diag (\bw) - \bw \bw^T \preceq \diag (\bw) \preceq \bI. \]
  5. Hence \(\lambda_{\max}( \nabla^2 f(\bx)) \leq 1\) for every \(\bx \in \RR^n\).

  6. Hence, by Corollary 7.27, \(f\) is 1-smooth w.r.t. the \(\ell_2\)-norm.

Smoothness w.r.t. \(\ell_{\infty}\) norm

  1. We first show that for any \(\bv \in \VV\)

    \[ \bv^T \nabla^2 f(\bx) \bv \leq \| \bv \|_{\infty}^2. \]
  2. To see this, we expand the L.H.S. as

    \[\begin{split} \bv^T \nabla^2 f(\bx) \bv &= \bv^T (\diag (\bw) - \bw \bw^T) \bv \\ &= \bv^T \diag (\bw) \bv - (\bw^T \bv)^2 \\ &\leq \bv^T \diag (\bw) \bv\\ &= \sum_{i=1}^n w_i v_i^2 \\ &\leq \| \bv \|_{\infty}^2 \sum_{i=1}^n w_i \\ &= \| \bv \|_{\infty}^2. \end{split}\]
  3. Since \(f\) is twice differentiable over \(\RR^n\), hence by linear approximation theorem (Theorem 5.5), for any \(\bx, \by \in \RR^n\), there exists \(\bz \in [\bx, \by]\) such that

    \[ f(\by) - f(\bx) = \nabla f(\bx)^T (\by - \bx) + \frac{1}{2} (\by - \bx)^T \nabla^2 f(\bz) (\by - \bx). \]
  4. Let \(\bv = \by - \bx\).

  5. Then from above,

    \[ (\by - \bx)^T \nabla^2 f(\bz) (\by - \bx) \leq \| \bv \|_{\infty}^2. \]
  6. Putting this back in the approximation, we have

    \[ f(\by) \leq f (\bx) + \nabla f(\bx)^T (\by - \bx) + \frac{1}{2} \| \by - \bx \|_{\infty}^2. \]
  7. Following characterization of smoothness (Theorem 7.265), \(f\) is indeed 1-smooth w.r.t. the \(\ell_{\infty}\)-norm.

7.18.4.2. Negative Entropy#

Example 7.86 (Strong convexity of negative entropy over the unit simplex)

Let \(f : \RR^n \to \RERL\) be given by:

\[\begin{split} f(\bx) \triangleq \begin{cases} \sum_{i=1}^n x_i \ln x_i & \bx \in \Delta_n\\ \infty & \text{ otherwise } \end{cases}. \end{split}\]

\(f\) is 1-strongly convex for both \(\ell_1\) and \(\ell_2\) norms.

  1. By Theorem 7.257, its conjugate is given by

    \[ f^*(\by) = \ln \left ( \sum_{j=1}^n e^{y_j} \right ) \]

    which is the log sum exp function.

  2. By Example 7.85, the log-sum-exp function is 1-smooth w.r.t. both \(\ell_2\) and \(\ell_{\infty}\) norms.

  3. Hence by conjugate correspondence theorem Theorem 7.273, \(f\) is 1-strongly convex for both \(\ell_1\) and \(\ell_2\) norms.