5.1. Differentiation#

We consider functions from \(\RR^n\) to \(\RR^m\).

5.1.1. Differentiability and Jacobian#

Definition 5.1 (Differentiability at a point)

Let \(f : \RR^n \to \RR^m\). Let \(\bx \in \interior \dom f\). The function \(f\) is differentiable at \(\bx\) if there exists a matrix \(Df(\bx) \in \RR^{m \times n}\) that satisfies

(5.1)#\[\underset{\bz \in \dom f, \bz \neq \bx, \bz \to \bx}{\lim} \frac{\| f(\bz) - f(\bx) - Df(\bx) (\bz - \bx) \|_2}{\| \bz - \bx \|_2} = 0.\]

Such a matrix \(Df(\bx)\) is called the derivative (or Jacobian) of \(f\) at \(\bx\).

There can be at most one \(Df(\bx)\) satisfying the limit in (5.1).

Observation 5.1

If we write \(\bz = \bx + \bh\) then an alternative form for (5.1) is given by:

\[ \underset{\bx + \bh \in \dom f, \bh \neq \bzero, \bh \to \bzero}{\lim} \frac{\| f(\bx + \bh) - f(\bx) - Df(\bx) \bh \|_2}{\| \bh \|_2} = 0. \]

The matrix \(Df(\bx)\) can be obtained from the partial derivatives:

\[ Df(\bx)_{ij} = \frac{\partial f_i(\bx)}{\partial x_j}, \quad i=1,\dots,m, \quad j=1,\dots,n. \]
\[\begin{split} Df(\bx) = \begin{bmatrix} \frac{\partial f_1(\bx)}{\partial x_1} & \frac{\partial f_1(\bx)}{\partial x_2} & \dots & \frac{\partial f_1(\bx)}{\partial x_n}\\ \frac{\partial f_2(\bx)}{\partial x_1} & \frac{\partial f_2(\bx)}{\partial x_2} & \dots & \frac{\partial f_2(\bx)}{\partial x_n}\\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m(\bx)}{\partial x_1} & \frac{\partial f_m(\bx)}{\partial x_2} & \dots & \frac{\partial f_m(\bx)}{\partial x_n} \end{bmatrix}. \end{split}\]
  1. The Jacobian \(Df(\bx)\) is an \(m \times n\) real matrix.

  2. Partial derivatives of each component of \(f\) (i.e., \(f_i\)) line up on the \(i\)-th row.

  3. Partial derivatives for one coordinate \(x_j\) line up on the \(j\)-th column.

  4. If \(f\) is single valued, then the Jacobian \(Df(\bx)\) is a row vector.

Example 5.1 (Jacobian of identity function)

Let \(f: \RR^n \to \RR^n\) be defined as:

\[ f(\bx) = \bx. \]

Then, \(f_i(\bx) = x_i\). Hence,

\[ \frac{\partial f_i(\bx)}{\partial x_j} = \delta(i, j). \]

Thus

\[ D f(\bx) = \bI_n \]

the \(n\times n\) identity matrix.

Example 5.2 (Jacobian of linear transformation)

Let \(f: \RR^n \to \RR^m\) be defined as:

\[ f(\bx) = \bA \bx \]

where \(\bA = (a_{i j})\) is an \(m \times n\) real matrix.

Then, \(f_i(\bx) = \sum_{j=1}^n a_{i j} x_j\). Hence,

\[ \frac{\partial f_i(\bx)}{\partial x_j} = a_{i j}. \]

Thus

\[ D f(\bx) = \bA. \]

Example 5.3 (Jacobian of affine transformation)

Let \(f: \RR^n \to \RR^m\) be defined as:

\[ f(\bx) = \bA \bx + \bb \]

where \(\bA = (a_{i j}) \in \RR^{m \times n}\) and \(\bb \in \RR^m\).

Then, \(f_i(\bx) = \sum_{j=1}^n a_{i j} x_j + b_i\). Hence,

\[ \frac{\partial f_i(\bx)}{\partial x_j} = a_{i j}. \]

Thus

\[ D f(\bx) = \bA. \]

The vector \(\bb\) is a constant offset. It has no impact on the derivative.

Definition 5.2 (Differentiable function)

A function \(f\) is called differentiable if its domain \(\dom f\) is open and it is differentiable at every point of \(\dom f\).

Definition 5.3 (First order approximation)

The affine function given by:

(5.2)#\[\hat{f} (\bx) = f(\ba) + Df(\ba)(\bx - \ba)\]

is called the first order approximation of \(f\) at \(\bx=\ba \in \interior \dom f\).

5.1.2. Real Valued Functions#

Rest of this section focuses mostly on real valued functions of type \(f : \RR^n \to \RR\).

  1. First order derivative of a real valued function is called a gradient.

  2. Second order derivative of a real valued function is called a Hessian.

  3. We consider first order and second order approximations of a real valued function.

5.1.3. Gradient#

Definition 5.4 (Gradient)

When \(f : \RR^n \to \RR\) is a real valued function, then the derivative \(Df(\bx)\) is a \(1 \times n\) matrix. The gradient of a real valued function is defined as:

\[ \nabla f(\bx) = Df (\bx)^T \]

at \(\bx \in \interior \dom f\) if \(f\) is differentiable at \(\bx\).

For real valued functions, the derivative is a row vector but the gradient is a column vector.

The components of the gradient are given by the partial derivatives as:

\[ \nabla f(\bx)_i = \frac{\partial f(\bx)}{\partial x_i}, \quad i=1,\dots,n. \]

Example 5.4 (Gradient of linear functional)

Let \(f : \RR^n \to \RR\) be a linear functional given by:

\[ f(\bx) = \langle \bx, \ba \rangle = \ba^T \bx. \]

We can expand it as:

\[ f(\bx) = \sum_{j=1}^n a_j x_j. \]

Computing partial derivative with respect to \(x_i\), we get:

\[ \frac{\partial f(\bx)}{\partial x_i} = \frac{\partial }{\partial x_i}\left (\sum_{j=1}^n a_j x_j \right ) = a_i. \]

Putting the partial derivatives together, we get:

\[ \nabla f(\bx) = \ba. \]

Example 5.5 (Gradient of affine functional)

Let \(f : \RR^n \to \RR\) be a affine functional given by:

\[ f(\bx) = \ba^T \bx + b \]

where \(\ba \in \RR^n\) and \(b \in \RR\).

We can expand it as:

\[ f(\bx) = \sum_{j=1}^n a_j x_j + b. \]

Computing partial derivative with respect to \(x_i\), we get:

\[ \frac{\partial f(\bx)}{\partial x_i} = \frac{\partial }{\partial x_i}\left (\sum_{j=1}^n a_j x_j + b \right) = a_i. \]

Putting the partial derivatives together, we get:

\[ \nabla f(\bx) = \ba. \]

The intercept \(b\) is a constant term which doesn’t affect the gradient.

Example 5.6 (Gradient of quadratic form)

Let \(f : \RR^n \to \RR\) be a quadratic form given by:

\[ f(\bx) = \bx^T \bA \bx \]

where \(\bA \in \RR^{n \times n}\).

We can expand it as:

\[ f(\bx) = \sum_{i=1}^n \sum_{j=1}^n x_i a_{i j} x_j. \]

Note that the diagonal elements \(a_{ii}\) give us terms of the form \(a_{i i} x_i^2\). Let us split the expression into diagonal and non-diagonal terms:

\[\begin{split} f(\bx) = \sum_{i=1}^n a_{i i }x_i^2 + \sum_{\substack{i, j\\i \neq j}} x_i a_{i j} x_j. \end{split}\]

There are \(n\) terms in the first sum (the diagonal entries of \(\bA\)) and \(n^2 - n\) terms in the second sum (the non-diagonal entries of \(\bA\)).

Taking partial derivative w.r.t. \(x_k\), we obtain:

\[\begin{split} \frac{\partial f(\bx)}{\partial x_k} = 2 a_{k k} x_k + \sum_{\substack{i\\i \neq k}} x_i a_{ i k} + \sum_{\substack{j\\j \neq k}} a_{k j} x_j. \end{split}\]
  • The first term comes from \(a_{k k}\) term that is quadratic in \(x_k\).

  • The first sum comes from linear terms where \(k=j\) and \(i=1,\dots,n\) except \(i\neq k\).

  • The second sum comes from linear terms where \(k=i\) and \(j=1,\dots,n\) except \(j\neq k\).

  • There are \(2n -2\) terms in the sums and \(2\) \(a_{k k} x_k\) terms.

  • We can move one \(a_{k k }x_k\) into each sum to simplify the partial derivative as:

\[ \frac{\partial f(\bx)}{\partial x_k} = \sum_{i=1}^n x_i a_{i k} + \sum_{j = 1}^n a_{k j} x_j. \]

Note that the \(k\)-th component of the vector \(\bu = \bA \bx\) is \(\sum_{j=1}^n a_{k j} x_j\).

Similarly, the \(k\)-th component of the vector \(\bv = \bA^T \bx\) is \(\sum_{i=1}^n a_{i k} x_i\).

Thus,

\[ \frac{\partial f(\bx)}{\partial x_k} = v_k + u_k. \]

Putting together the partial derivatives, we obtain:

\[ \nabla f(\bx) = \bv + \bu = \bA^T \bx + \bA \bx = (\bA^T + \bA) \bx = (\bA + \bA^T) \bx. \]

If \(\bA\) is symmetric then,

\[ \nabla f(\bx) = 2 \bA \bx. \]

Example 5.7 (Gradient of squared \(\ell_2\) norm)

Let \(f : \RR^n \to \RR\) be a quadratic form given by:

\[ f(\bx) = \| \bx \|_2^2 = \bx^T \bx. \]

We can write this as

\[ f(\bx) = \bx^T \bI \bx \]

where \(\bI\) is the identity matrix.

Following, Example 5.6,

\[ \nabla f(\bx) = 2 \bI \bx = 2 \bx. \]

Example 5.8 (Gradient of quadratic functional)

Let \(\bP \in \SS^n\) be a symmetric matrix. Let \(\bq \in \RR^n\) and \(r \in \RR\). Consider the quadratic functional \(f: \RR^n \to \RR\) given as:

\[ f(\bx) = \frac{1}{2} \bx^T \bP \bx + \bq^T \bx + r. \]

We can compute the gradient as follows:

\[\begin{split} \nabla f(\bx) &= \nabla \left( \frac{1}{2} \bx^T \bP \bx + \bq^T \bx + r \right )\\ &= \frac{1}{2} \nabla (\bx^T \bP \bx) + \nabla (\bq^T \bx) + \nabla r \\ &= \frac{1}{2} (\bP + \bP^T) \bx + \bq \\ &= \frac{1}{2} (\bP + \bP) \bx + \bq\\ &= \bP \bx + \bq. \end{split}\]
  • We took advantage of the fact that gradient operation commutes with scalar multiplication and distributes on vector addition.

  • Since \(r\) is a constant, it has no contribution to the derivative.

  • We reused results from previous examples.

  • We utilized the fact that \(\bP = \bP^T\) since \(\bP\) is symmetric.

In summary:

\[ \nabla f(\bx) = \bP \bx + \bq. \]

The derivative of \(f\) is then obtained by taking the transpose of the gradient:

\[ Df (\bx) = \bx^T \bP + \bq^T. \]

Definition 5.5 (Gradient mapping)

If a real valued function \(f: \RR^n \to \RR\) is differentiable, the gradient mapping of \(f\) is the function \(\nabla f : \RR^n \to \RR^n\) with \(\dom \nabla f = \dom f\), with the value \(\nabla f(\bx)\) at every \(\bx \in \dom f\).

5.1.4. Continuous Differentiability#

Definition 5.6 (Continuously differentiable real valued function)

Let \(f: \RR^n \to \RR\) be a real valued function with \(S = \dom f\). Let \(U \subseteq S\) be an open set. If all the partial derivatives of \(f\) exist and are continuous at every \(\bx \in U\), then \(f\) is called continuously differentiable over \(U\).

If \(f\) is continuously differentiable over an open set \(U \subseteq S\), then it is continuously differentiable over every subset \(C \subseteq U\).

If \(S\) is open itself and \(f\) is continuously differentiable over \(S\), then \(f\) is called continuously differentiable.

5.1.5. First Order Approximation#

Definition 5.7 (First order approximation of real valued functions)

The affine function given by:

(5.3)#\[\hat{f} (\bx) = f(\ba) + \nabla f(\ba)^T(\bx - \ba)\]

is the first order approximation of a real valued function \(f\) at \(\bx=\ba \in \interior \dom f\).

Theorem 5.1 (First order approximation accuracy)

Let \(f : \RR^n \to \RR\) be defined on an open set \(S = \dom f\). Assume that \(f\) is continuously differentiable on \(S\). Then,

\[ \lim_{\bd \to \bzero} \frac{f(\bx + \bd) - f(\bx) - \nabla f(\bx)^T \bd}{\| \bd \|} = 0 \Forall \bx \in S. \]

Another way to write this result is:

\[ f(\bx) = f(\ba) + \nabla f(\ba)^T (\bx - \ba) + o (\| \bx - \ba \|) \]

where \(\ba \in S\) and \(o(\cdot) : \RR_+ \to \RR\) is a one dimensional function satisfying \(\frac{o(t)}{t} \to 0\) as \(t \to 0^+\).

5.1.6. Chain Rule#

Theorem 5.2 (Chain rule)

Suppose \(f : \RR^n \to \RR^m\) is differentiable at \(\bx \in \interior \dom f\) and \(g : \RR^m \to \RR^p\) is differentiable at \(f(\bx) \in \interior \dom g\). Define the composition \(h: \RR^n \to \RR^p\) as:

\[ h(\bx) = g(f(\bx)). \]

Then, \(h\) is differentiable at \(\bx\) with the derivative given by:

\[ Dh(\bx) = Dg(f(\bx)) Df(\bx). \]

Notice how the derivative lines up as a simple matrix multiplication.

Corollary 5.1 (Chain rule for real valued functions)

Suppose \(f : \RR^n \to \RR\) is differentiable at \(\bx \in \interior \dom f\) and \(g : \RR \to \RR\) is differentiable at \(f(\bx) \in \interior \dom g\). Define the composition \(h: \RR^n \to \RR\) as:

\[ h(\bx) = g(f(\bx)). \]

Then, \(h\) is differentiable at \(\bx\) with the gradient given by:

\[ \nabla h(\bx) = g'(f(\bx)) \nabla f(\bx). \]

Example 5.9 (Gradient of log-sum-exp)

Let \(h : \RR^n \to \RR\) be given by:

\[ h(\bx) = \ln \left ( \sum_{i=1}^n \exp x_i \right ) \]

with \(\dom h = \RR^n\).

Let \(g(y) = \ln y\) and

\[ f(\bx) = \sum_{i=1}^n \exp x_i \]

Then, we can see that \(h(\bx) = g (f (\bx))\). Now \(g'(y) = \frac{1}{y}\) and

\[\begin{split} \nabla f(\bx) = \begin{bmatrix} \exp x_1 \\ \vdots \\ \exp x_n \end{bmatrix}. \end{split}\]

Thus,

\[\begin{split} \nabla h(\bx) = \frac{1}{\sum_{i=1}^n \exp x_i} \begin{bmatrix} \exp x_1 \\ \vdots \\ \exp x_n \end{bmatrix}. \end{split}\]

Now, if we define

\[\begin{split} \bz = \begin{bmatrix} \exp x_1 \\ \vdots \\ \exp x_n \end{bmatrix} \end{split}\]

then, we see that:

\[ \bone^T \bz = \sum_{i=1}^n \exp x_i. \]

Using this notation:

\[ \nabla h(\bx) = \frac{1}{\bone^T \bz} \bz. \]

Example 5.10 (Gradient of \(\ell_2\) norm at nonzero vectors)

Let \(h : \RR^n \to \RR\) be given by:

\[ h(\bx) = \| \bx \|_2 = \sqrt{ \langle \bx, \bx \rangle} \]

with \(\dom h = \RR^n\).

Let \(g : \RR \to \RR\) with \(\dom g = \RR_+\) be given by \(g(y) = \sqrt{y}\).

Let \(f : \RR^n \to \RR\) with \(\dom f = \RR^n\) be given by

\[ f(\bx) = \langle \bx, \bx \rangle = \sum_{i=1}^n x_i^2 = \| \bx \|_2^2. \]

Then, we can see that \(h(\bx) = g (f (\bx))\) or \(h = g \circ f\).

\(g\) is differentiable on the open set \(\RR_{++}\). For every \(y \in \RR_{++}\),

\[ g'(y) = \frac{1}{2 \sqrt{y}} \]

and (from Example 5.7)

\[ \nabla f(\bx) = 2 \bx. \]

Thus, for every \(\bx \neq \bzero\), following Corollary 5.1,

\[ \nabla h(\bx) = g'(f(\bx)) \nabla f(\bx) = \frac{1}{2 \sqrt{\| \bx \|_2^2}} 2 \bx = \frac{\bx}{\| \bx \|_2}. \]

The gradient of \(\ell_2\) norm at \(\bzero\) doesn’t exist. However, subgradients can be computed. See Example 9.71 and Example 9.72.

Corollary 5.2 (Chain rule for composition with affine function)

Suppose \(f : \RR^n \to \RR^m\) is differentiable. Let \(\bA \in \RR^{n \times p}\) and \(\bb \in \RR^n\). Define \(g : \RR^p \to \RR^m\) as:

\[ g(\bx) = f(\bA \bx + \bb) \]

with \(\dom g = \{ \bx \ST \bA \bx + \bb \in \dom f \}\).

The derivative of \(g\) at \(\bx \in \interior \dom g\) is given by:

\[ Dg(\bx) = Df(\bA \bx + \bb) A. \]

If \(f\) is real valued (i.e. \(m=1\)), then the gradient of a composition of a function with an affine function is given by:

\[ \nabla g(\bx) = \bA^T \nabla f(\bA \bx + \bb). \]

Example 5.11 (Chain rule for restriction on a line)

Let \(f : \RR^n \to \RR\) be a real valued differentiable function. Consider the restriction of \(f\) on a line in its domain

\[ g(t) = f(\bx + t \bv) \]

where \(\bx \in \dom f\) and \(\bv \in \RR^n\) with the domain

\[ \dom g = \{t \ST \bx + t \bv \in \dom f\}. \]

If we define \(h : \RR \to \RR^n\) as:

\[ h(t) = \bx + t \bv; \]

we can see that:

\[ g(t) = f(h(t)) \]

By chain rule:

\[ g'(t) = Df(h(t)) Dh(t) = \nabla f(h(t))^T \bv = \nabla f(\bx + t \bv)^T \bv. \]

In particular, if \(\bv = \by - \bx\), with \(\by \in \dom f\),

\[ g'(t) = \nabla f(\bx + t (\by -\bx) )^T (\by - \bx) = \nabla f(t \by + (1-t) \bx)^T (\by - \bx). \]

5.1.7. Hessian#

In this section, we review the second derivative of a real valued function \(f: \RR^n \to \RR\).

Definition 5.8 (Hessian)

The second derivative or Hessian matrix of \(f\) at \(\bx \in \interior \dom f\), denoted by \(\nabla^2 f\), is given by:

\[\nabla^2 f(\bx)_{i j} = \frac{\partial^2 f(\bx)}{\partial x_i \partial x_j}, i=1,\dots,n \quad j=1,\dots,n\]

provided \(f\) is twice differentiable at \(\bx\).

Example 5.12 (Hessian of linear functional)

Let \(f : \RR^n \to \RR\) be a linear functional given by:

\[ f(\bx) = \langle \bx, \ba \rangle = \ba^T \bx. \]

We can expand it as:

\[ f(\bx) = \sum_{j=1}^n a_j x_j. \]

Computing partial derivative with respect to \(x_i\), we get:

\[ \frac{\partial f(\bx)}{\partial x_i} = \frac{\partial }{\partial x_i}\left (\sum_{j=1}^n a_j x_j \right ) = a_i. \]

If we further compute the partial derivative w.r.t. \(x_j\), we get:

\[ \frac{\partial^2 f(\bx)}{\partial x_i \partial x_j} = \frac{\partial a_i}{\partial x_j} = 0. \]

Thus, the Hessian is an \(n \times n\) 0 matrix:

\[ \nabla^2 f(\bx) = \ZERO_n. \]

Theorem 5.3

Hessian is the derivative of the gradient mapping.

\[ D \nabla f(\bx) = \nabla^2 f(\bx). \]

Example 5.13 (Hessian of quadratic form)

Let \(f : \RR^n \to \RR\) be a quadratic form given by:

\[ f(\bx) = \bx^T \bA \bx \]

where \(\bA \in \RR^{n \times n}\).

Recall from Example 5.6 that:

\[ \nabla f(\bx) = (\bA^T + \bA) \bx. \]

Also recall from Example 5.2 that

\[ D (\bC \bx) = \bC \]

for all \(\bC \in \RR^{m \times n}\).

Thus, using Theorem 5.3

\[ \nabla^2 f(\bx) = D \nabla f(\bx) = D ((\bA^T + \bA) \bx) = \bA^T + \bA. \]

If \(\bA\) is symmetric then

\[ \nabla^2 f(\bx) = 2 \bA. \]

Example 5.14 (Hessian of log-sum-exp)

Let \(f : \RR^n \to \RR\) be given by:

\[ f(\bx) = \ln \left ( \sum_{i=1}^n e^{x_i} \right ) \]

with \(\dom f = \RR^n\).

Define

\[\begin{split} \bz = \begin{bmatrix} e^{x_1} \\ \vdots \\ e^{x_n} \end{bmatrix} \end{split}\]

then, we see that:

\[ \bone^T \bz = \sum_{i=1}^n e^{x_i}. \]

Using this notation:

\[ f(\bx) = \ln \left (\bone^T \bz \right). \]

We have:

\[ \frac{\partial z_i}{\partial x_i} = \frac{\partial}{\partial x_i} e^{x_i} = e^{x_i} = z_i. \]

\(\frac{\partial z_j}{\partial x_i} = 0\) for \(i \neq j\). Now,

\[\begin{split} \frac{\partial }{\partial x_i} f(\bx) &= \frac{\partial}{\partial z_i} \ln \left (\bone^T \bz \right) \cdot \frac{\partial z_i}{\partial x_i} \\ &= \frac{1}{\bone^T \bz}\frac{\partial}{\partial z_i} \bone^T \bz \cdot z_i \\ &= \frac{1}{\bone^T \bz} z_i. \end{split}\]

Proceeding to compute the second derivatives:

\[\begin{split} \frac{\partial^2 }{\partial x_i \partial x_j} f(\bx) &= \frac{\partial }{\partial x_i} \left (\frac{1}{\bone^T \bz} z_j \right )\\ &= \frac{\partial }{\partial z_i} \left (\frac{1}{\bone^T \bz} z_j \right ) \cdot \frac{\partial z_i}{\partial x_i} \\ &= \frac{\bone^T \bz \delta_{i j} - z_j}{(\bone^T \bz)^2} \cdot z_i\\ &= \frac{\bone^T \bz \delta_{i j} z_i - z_i z_j}{(\bone^T \bz)^2}\\ &=\frac{\delta_{i j} z_i}{\bone^T \bz} - \frac{z_i z_j}{(\bone^T \bz)^2}. \end{split}\]

Now, note that \((\bz \bz^T)_{i j} = z_i z_j\). And, \((\Diag (\bz))_{i j} = \delta_{ i j} z_i \).

Thus,

\[ \nabla^2 f(\bx) = \frac{1}{\bone^T \bz} \Diag (\bz) - \frac{1}{(\bone^T \bz)^2} \bz \bz^T. \]

Alternatively,

\[ \nabla^2 f(\bx) = \frac{1}{(\bone^T \bz)^2} \left ((\bone^T \bz) \Diag (\bz) - \bz \bz^T \right ). \]

Example 5.15 (Derivatives for least squares cost function)

Let \(\bA \in \RR^{m \times n}\). Let \(\bb \in \RR^n\). Consider the least squares cost function:

\[ f(\bx) = \frac{1}{2} \| \bA \bx - \bb \|_2^2. \]

Expanding it, we get:

\[ f(\bx) = \frac{1}{2} \bx^T \bA^T \bA \bx - \bb^T \bA \bx + \frac{1}{2} \bb^T \bb. \]

Note that \(\bA^T \bA\) is symmetric. Using previous results, we obtain the gradient:

\[ \nabla f(\bx) = \bA^T \bA \bx - \bA^T \bb. \]

And the Hessian is:

\[ \nabla^2 f(\bx) = D \nabla f (\bx) = \bA^T \bA. \]

Example 5.16 (Derivatives for quadratic over linear function)

Let \(f : \RR \times \RR \to \RR\) be given by:

\[ f(x, y) = \frac{x^2}{y} \]

with \(\dom f = \{ (x, y) \ST y > 0\}\).

The gradient is obtained by computing the partial derivatives w.r.t. \(x\) and \(y\):

\[\begin{split} \nabla f(x,y) = \begin{bmatrix} \frac{2x}{y}\\ \frac{-x^2}{y^2} \end{bmatrix}. \end{split}\]

The Hessian is obtained by computing second order partial derivatives:

\[\begin{split} \nabla^2 f(x, y) = \begin{bmatrix} \frac{2}{y} & \frac{-2 x}{y^2}\\ \frac{-2 x}{y^2} & \frac{2 x^2}{y^3} \end{bmatrix} = \frac{2}{y^3} \begin{bmatrix} y^2 & - x y\\ - x y & x^2 \end{bmatrix}. \end{split}\]

5.1.8. Twice Continuous Differentiability#

Definition 5.9 (Twice continuously differentiable real valued function)

Let \(f: \RR^n \to \RR\) be a real valued function with \(S = \dom f\). Let \(U \subseteq S\) be an open set. If all the second order partial derivatives of \(f\) exist and are continuous at every \(\bx \in U\), then \(f\) is called twice continuously differentiable over \(U\).

If \(f\) is twice continuously differentiable over an open set \(U \subseteq S\), then it is twice continuously differentiable over every subset \(C \subseteq U\).

If \(S\) is open itself and \(f\) is twice continuously differentiable over \(S\), then \(f\) is called twice continuously differentiable.

Theorem 5.4 (Symmetry of Hessian)

If \(f : \RR^n \to \RR\) with \(S = \dom f\) is twice continuously differentiable over a set \(U \subseteq S\), then its Hessian matrix \(\nabla^2 f(\bx)\) is symmetric at every \(\bx \in U\)

5.1.9. Second Order Approximation#

Theorem 5.5 (Linear approximation theorem)

Let \(f : \RR^n \to \RR\) with \(S = \dom f\) be twice continuously differentiable over an open set \(U \subseteq S\). Let \(\bx \in U\). Let \(r > 0\) be such that \(B(\bx, r) \subseteq U\). Then, for any \(\by \in B(\bx, r)\), there exist \(\bz \in [\bx, \by]\) such that

\[ f(\by) - f(\bx) = \nabla f(\bx)^T (\by - \bx) + \frac{1}{2} (\by - \bx)^T \nabla^2 f(\bz) (\by - \bx). \]

Theorem 5.6 (Quadratic approximation theorem)

Let \(f : \RR^n \to \RR\) with \(S = \dom f\) be twice continuously differentiable over an open set \(U \subseteq S\). Let \(\bx \in U\). Let \(r > 0\) be such that \(B(\bx, r) \subseteq U\). Then, for any \(\by \in B(\bx, r)\),

\[ f(\by) = f(\bx) + \nabla f(\bx)^T (\by - \bx) + \frac{1}{2} (\by - \bx)^T \nabla^2 f(\bx) (\by - \bx) + o(\| \by - \bx \|^2). \]

Definition 5.10 (Second order approximation)

The second order approximation of \(f\) at or near \(\bx=\ba\) is the quadratic function defined by:

\[\hat{f} (\bx) = f(\ba) + \nabla f(\ba)^T (\bx - \ba) + \frac{1}{2} (\bx - \ba)^T \nabla^2 f(\ba) (\bx - \ba).\]

5.1.10. Smoothness#

5.1.10.1. Real Functions#

Definition 5.11 (Class of continuous functions)

The class of continuous real functions, denoted by \(C\), is the set of functions of type \(f: \RR \to \RR\) which are continuous over their domain \(\dom f\).

Definition 5.12 (Differentiability class \(C^k\))

Let \(f: \RR \to \RR\) be a real function with \(S = \dom f\).

Then, we say that \(f\) belongs to the differentiability class \(C^k\) on \(S\) if and only if

\[ \frac{d^k}{d x^k} f(x) \in C. \]

In other words, the \(k\)-th derivative of \(f\) exists and is continuous.

  1. \(C^0\) consists of class of continuous real functions.

  2. \(C^1\) consists of class of continuously differentiable functions.

  3. \(C^{\infty}\) consists of class of smooth functions which are infinitely differentiable.

5.1.10.2. Real Valued Functions on Euclidean Space#

Definition 5.13 (Differentiability class \(C^k\))

A function \(f: \RR^n \to \RR\) with \(S = \dom f\) where \(S\) is an open subset of \(\RR^n\) is said to be of class \(C^k\) on \(S\), for a positive integer \(k\), if all the partial derivatives of \(f\)

\[ \frac{\partial^m f}{\partial x_1^{m_1} \partial x_2^{m_2} \dots \partial x_n^{m_n}} (\bx) \]

exist and are continuous for every \(m_1,m_2,\dots,m_n \geq 0\) and \(m = m_1 + m_2 + \dots m_n \leq k\).

  1. If \(f\) is continuous, it is said to belong to \(C\) or \(C^0\).

  2. If \(f\) is continuously differentiable, it is said to belong to \(C^1\).

  3. If \(f\) is twice continuously differentiable, it is said to belong to \(C^2\).

  4. If \(f\) is infinitely differentiable, it is said to belong to \(C^{\infty}\).