# 5.1. Differentiation#

We consider functions from $$\RR^n$$ to $$\RR^m$$.

## 5.1.1. Differentiability and Jacobian#

Definition 5.1 (Differentiability at a point)

Let $$f : \RR^n \to \RR^m$$. Let $$\bx \in \interior \dom f$$. The function $$f$$ is differentiable at $$\bx$$ if there exists a matrix $$Df(\bx) \in \RR^{m \times n}$$ that satisfies

(5.1)#$\underset{\bz \in \dom f, \bz \neq \bx, \bz \to \bx}{\lim} \frac{\| f(\bz) - f(\bx) - Df(\bx) (\bz - \bx) \|_2}{\| \bz - \bx \|_2} = 0.$

Such a matrix $$Df(\bx)$$ is called the derivative (or Jacobian) of $$f$$ at $$\bx$$.

There can be at most one $$Df(\bx)$$ satisfying the limit in (5.1).

Observation 5.1

If we write $$\bz = \bx + \bh$$ then an alternative form for (5.1) is given by:

$\underset{\bx + \bh \in \dom f, \bh \neq \bzero, \bh \to \bzero}{\lim} \frac{\| f(\bx + \bh) - f(\bx) - Df(\bx) \bh \|_2}{\| \bh \|_2} = 0.$

The matrix $$Df(\bx)$$ can be obtained from the partial derivatives:

$Df(\bx)_{ij} = \frac{\partial f_i(\bx)}{\partial x_j}, \quad i=1,\dots,m, \quad j=1,\dots,n.$
$\begin{split} Df(\bx) = \begin{bmatrix} \frac{\partial f_1(\bx)}{\partial x_1} & \frac{\partial f_1(\bx)}{\partial x_2} & \dots & \frac{\partial f_1(\bx)}{\partial x_n}\\ \frac{\partial f_2(\bx)}{\partial x_1} & \frac{\partial f_2(\bx)}{\partial x_2} & \dots & \frac{\partial f_2(\bx)}{\partial x_n}\\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m(\bx)}{\partial x_1} & \frac{\partial f_m(\bx)}{\partial x_2} & \dots & \frac{\partial f_m(\bx)}{\partial x_n} \end{bmatrix}. \end{split}$
1. The Jacobian $$Df(\bx)$$ is an $$m \times n$$ real matrix.

2. Partial derivatives of each component of $$f$$ (i.e., $$f_i$$) line up on the $$i$$-th row.

3. Partial derivatives for one coordinate $$x_j$$ line up on the $$j$$-th column.

4. If $$f$$ is single valued, then the Jacobian $$Df(\bx)$$ is a row vector.

Example 5.1 (Jacobian of identity function)

Let $$f: \RR^n \to \RR^n$$ be defined as:

$f(\bx) = \bx.$

Then, $$f_i(\bx) = x_i$$. Hence,

$\frac{\partial f_i(\bx)}{\partial x_j} = \delta(i, j).$

Thus

$D f(\bx) = \bI_n$

the $$n\times n$$ identity matrix.

Example 5.2 (Jacobian of linear transformation)

Let $$f: \RR^n \to \RR^m$$ be defined as:

$f(\bx) = \bA \bx$

where $$\bA = (a_{i j})$$ is an $$m \times n$$ real matrix.

Then, $$f_i(\bx) = \sum_{j=1}^n a_{i j} x_j$$. Hence,

$\frac{\partial f_i(\bx)}{\partial x_j} = a_{i j}.$

Thus

$D f(\bx) = \bA.$

Example 5.3 (Jacobian of affine transformation)

Let $$f: \RR^n \to \RR^m$$ be defined as:

$f(\bx) = \bA \bx + \bb$

where $$\bA = (a_{i j}) \in \RR^{m \times n}$$ and $$\bb \in \RR^m$$.

Then, $$f_i(\bx) = \sum_{j=1}^n a_{i j} x_j + b_i$$. Hence,

$\frac{\partial f_i(\bx)}{\partial x_j} = a_{i j}.$

Thus

$D f(\bx) = \bA.$

The vector $$\bb$$ is a constant offset. It has no impact on the derivative.

Definition 5.2 (Differentiable function)

A function $$f$$ is called differentiable if its domain $$\dom f$$ is open and it is differentiable at every point of $$\dom f$$.

Definition 5.3 (First order approximation)

The affine function given by:

(5.2)#$\hat{f} (\bx) = f(\ba) + Df(\ba)(\bx - \ba)$

is called the first order approximation of $$f$$ at $$\bx=\ba \in \interior \dom f$$.

## 5.1.2. Real Valued Functions#

Rest of this section focuses mostly on real valued functions of type $$f : \RR^n \to \RR$$.

1. First order derivative of a real valued function is called a gradient.

2. Second order derivative of a real valued function is called a Hessian.

3. We consider first order and second order approximations of a real valued function.

When $$f : \RR^n \to \RR$$ is a real valued function, then the derivative $$Df(\bx)$$ is a $$1 \times n$$ matrix. The gradient of a real valued function is defined as:

$\nabla f(\bx) = Df (\bx)^T$

at $$\bx \in \interior \dom f$$ if $$f$$ is differentiable at $$\bx$$.

For real valued functions, the derivative is a row vector but the gradient is a column vector.

The components of the gradient are given by the partial derivatives as:

$\nabla f(\bx)_i = \frac{\partial f(\bx)}{\partial x_i}, \quad i=1,\dots,n.$

Example 5.4 (Gradient of linear functional)

Let $$f : \RR^n \to \RR$$ be a linear functional given by:

$f(\bx) = \langle \bx, \ba \rangle = \ba^T \bx.$

We can expand it as:

$f(\bx) = \sum_{j=1}^n a_j x_j.$

Computing partial derivative with respect to $$x_i$$, we get:

$\frac{\partial f(\bx)}{\partial x_i} = \frac{\partial }{\partial x_i}\left (\sum_{j=1}^n a_j x_j \right ) = a_i.$

Putting the partial derivatives together, we get:

$\nabla f(\bx) = \ba.$

Example 5.5 (Gradient of affine functional)

Let $$f : \RR^n \to \RR$$ be a affine functional given by:

$f(\bx) = \ba^T \bx + b$

where $$\ba \in \RR^n$$ and $$b \in \RR$$.

We can expand it as:

$f(\bx) = \sum_{j=1}^n a_j x_j + b.$

Computing partial derivative with respect to $$x_i$$, we get:

$\frac{\partial f(\bx)}{\partial x_i} = \frac{\partial }{\partial x_i}\left (\sum_{j=1}^n a_j x_j + b \right) = a_i.$

Putting the partial derivatives together, we get:

$\nabla f(\bx) = \ba.$

The intercept $$b$$ is a constant term which doesn’t affect the gradient.

Let $$f : \RR^n \to \RR$$ be a quadratic form given by:

$f(\bx) = \bx^T \bA \bx$

where $$\bA \in \RR^{n \times n}$$.

We can expand it as:

$f(\bx) = \sum_{i=1}^n \sum_{j=1}^n x_i a_{i j} x_j.$

Note that the diagonal elements $$a_{ii}$$ give us terms of the form $$a_{i i} x_i^2$$. Let us split the expression into diagonal and non-diagonal terms:

$\begin{split} f(\bx) = \sum_{i=1}^n a_{i i }x_i^2 + \sum_{\substack{i, j\\i \neq j}} x_i a_{i j} x_j. \end{split}$

There are $$n$$ terms in the first sum (the diagonal entries of $$\bA$$) and $$n^2 - n$$ terms in the second sum (the non-diagonal entries of $$\bA$$).

Taking partial derivative w.r.t. $$x_k$$, we obtain:

$\begin{split} \frac{\partial f(\bx)}{\partial x_k} = 2 a_{k k} x_k + \sum_{\substack{i\\i \neq k}} x_i a_{ i k} + \sum_{\substack{j\\j \neq k}} a_{k j} x_j. \end{split}$
• The first term comes from $$a_{k k}$$ term that is quadratic in $$x_k$$.

• The first sum comes from linear terms where $$k=j$$ and $$i=1,\dots,n$$ except $$i\neq k$$.

• The second sum comes from linear terms where $$k=i$$ and $$j=1,\dots,n$$ except $$j\neq k$$.

• There are $$2n -2$$ terms in the sums and $$2$$ $$a_{k k} x_k$$ terms.

• We can move one $$a_{k k }x_k$$ into each sum to simplify the partial derivative as:

$\frac{\partial f(\bx)}{\partial x_k} = \sum_{i=1}^n x_i a_{i k} + \sum_{j = 1}^n a_{k j} x_j.$

Note that the $$k$$-th component of the vector $$\bu = \bA \bx$$ is $$\sum_{j=1}^n a_{k j} x_j$$.

Similarly, the $$k$$-th component of the vector $$\bv = \bA^T \bx$$ is $$\sum_{i=1}^n a_{i k} x_i$$.

Thus,

$\frac{\partial f(\bx)}{\partial x_k} = v_k + u_k.$

Putting together the partial derivatives, we obtain:

$\nabla f(\bx) = \bv + \bu = \bA^T \bx + \bA \bx = (\bA^T + \bA) \bx = (\bA + \bA^T) \bx.$

If $$\bA$$ is symmetric then,

$\nabla f(\bx) = 2 \bA \bx.$

Example 5.7 (Gradient of squared $$\ell_2$$ norm)

Let $$f : \RR^n \to \RR$$ be a quadratic form given by:

$f(\bx) = \| \bx \|_2^2 = \bx^T \bx.$

We can write this as

$f(\bx) = \bx^T \bI \bx$

where $$\bI$$ is the identity matrix.

Following, Example 5.6,

$\nabla f(\bx) = 2 \bI \bx = 2 \bx.$

Let $$\bP \in \SS^n$$ be a symmetric matrix. Let $$\bq \in \RR^n$$ and $$r \in \RR$$. Consider the quadratic functional $$f: \RR^n \to \RR$$ given as:

$f(\bx) = \frac{1}{2} \bx^T \bP \bx + \bq^T \bx + r.$

We can compute the gradient as follows:

$\begin{split} \nabla f(\bx) &= \nabla \left( \frac{1}{2} \bx^T \bP \bx + \bq^T \bx + r \right )\\ &= \frac{1}{2} \nabla (\bx^T \bP \bx) + \nabla (\bq^T \bx) + \nabla r \\ &= \frac{1}{2} (\bP + \bP^T) \bx + \bq \\ &= \frac{1}{2} (\bP + \bP) \bx + \bq\\ &= \bP \bx + \bq. \end{split}$
• We took advantage of the fact that gradient operation commutes with scalar multiplication and distributes on vector addition.

• Since $$r$$ is a constant, it has no contribution to the derivative.

• We reused results from previous examples.

• We utilized the fact that $$\bP = \bP^T$$ since $$\bP$$ is symmetric.

In summary:

$\nabla f(\bx) = \bP \bx + \bq.$

The derivative of $$f$$ is then obtained by taking the transpose of the gradient:

$Df (\bx) = \bx^T \bP + \bq^T.$

Definition 5.5 (Gradient mapping)

If a real valued function $$f: \RR^n \to \RR$$ is differentiable, the gradient mapping of $$f$$ is the function $$\nabla f : \RR^n \to \RR^n$$ with $$\dom \nabla f = \dom f$$, with the value $$\nabla f(\bx)$$ at every $$\bx \in \dom f$$.

## 5.1.4. Continuous Differentiability#

Definition 5.6 (Continuously differentiable real valued function)

Let $$f: \RR^n \to \RR$$ be a real valued function with $$S = \dom f$$. Let $$U \subseteq S$$ be an open set. If all the partial derivatives of $$f$$ exist and are continuous at every $$\bx \in U$$, then $$f$$ is called continuously differentiable over $$U$$.

If $$f$$ is continuously differentiable over an open set $$U \subseteq S$$, then it is continuously differentiable over every subset $$C \subseteq U$$.

If $$S$$ is open itself and $$f$$ is continuously differentiable over $$S$$, then $$f$$ is called continuously differentiable.

## 5.1.5. First Order Approximation#

Definition 5.7 (First order approximation of real valued functions)

The affine function given by:

(5.3)#$\hat{f} (\bx) = f(\ba) + \nabla f(\ba)^T(\bx - \ba)$

is the first order approximation of a real valued function $$f$$ at $$\bx=\ba \in \interior \dom f$$.

Theorem 5.1 (First order approximation accuracy)

Let $$f : \RR^n \to \RR$$ be defined on an open set $$S = \dom f$$. Assume that $$f$$ is continuously differentiable on $$S$$. Then,

$\lim_{\bd \to \bzero} \frac{f(\bx + \bd) - f(\bx) - \nabla f(\bx)^T \bd}{\| \bd \|} = 0 \Forall \bx \in S.$

Another way to write this result is:

$f(\bx) = f(\ba) + \nabla f(\ba)^T (\bx - \ba) + o (\| \bx - \ba \|)$

where $$\ba \in S$$ and $$o(\cdot) : \RR_+ \to \RR$$ is a one dimensional function satisfying $$\frac{o(t)}{t} \to 0$$ as $$t \to 0^+$$.

## 5.1.6. Chain Rule#

Theorem 5.2 (Chain rule)

Suppose $$f : \RR^n \to \RR^m$$ is differentiable at $$\bx \in \interior \dom f$$ and $$g : \RR^m \to \RR^p$$ is differentiable at $$f(\bx) \in \interior \dom g$$. Define the composition $$h: \RR^n \to \RR^p$$ as:

$h(\bx) = g(f(\bx)).$

Then, $$h$$ is differentiable at $$\bx$$ with the derivative given by:

$Dh(\bx) = Dg(f(\bx)) Df(\bx).$

Notice how the derivative lines up as a simple matrix multiplication.

Corollary 5.1 (Chain rule for real valued functions)

Suppose $$f : \RR^n \to \RR$$ is differentiable at $$\bx \in \interior \dom f$$ and $$g : \RR \to \RR$$ is differentiable at $$f(\bx) \in \interior \dom g$$. Define the composition $$h: \RR^n \to \RR$$ as:

$h(\bx) = g(f(\bx)).$

Then, $$h$$ is differentiable at $$\bx$$ with the gradient given by:

$\nabla h(\bx) = g'(f(\bx)) \nabla f(\bx).$

Example 5.9 (Gradient of log-sum-exp)

Let $$h : \RR^n \to \RR$$ be given by:

$h(\bx) = \ln \left ( \sum_{i=1}^n \exp x_i \right )$

with $$\dom h = \RR^n$$.

Let $$g(y) = \ln y$$ and

$f(\bx) = \sum_{i=1}^n \exp x_i$

Then, we can see that $$h(\bx) = g (f (\bx))$$. Now $$g'(y) = \frac{1}{y}$$ and

$\begin{split} \nabla f(\bx) = \begin{bmatrix} \exp x_1 \\ \vdots \\ \exp x_n \end{bmatrix}. \end{split}$

Thus,

$\begin{split} \nabla h(\bx) = \frac{1}{\sum_{i=1}^n \exp x_i} \begin{bmatrix} \exp x_1 \\ \vdots \\ \exp x_n \end{bmatrix}. \end{split}$

Now, if we define

$\begin{split} \bz = \begin{bmatrix} \exp x_1 \\ \vdots \\ \exp x_n \end{bmatrix} \end{split}$

then, we see that:

$\bone^T \bz = \sum_{i=1}^n \exp x_i.$

Using this notation:

$\nabla h(\bx) = \frac{1}{\bone^T \bz} \bz.$

Example 5.10 (Gradient of $$\ell_2$$ norm at nonzero vectors)

Let $$h : \RR^n \to \RR$$ be given by:

$h(\bx) = \| \bx \|_2 = \sqrt{ \langle \bx, \bx \rangle}$

with $$\dom h = \RR^n$$.

Let $$g : \RR \to \RR$$ with $$\dom g = \RR_+$$ be given by $$g(y) = \sqrt{y}$$.

Let $$f : \RR^n \to \RR$$ with $$\dom f = \RR^n$$ be given by

$f(\bx) = \langle \bx, \bx \rangle = \sum_{i=1}^n x_i^2 = \| \bx \|_2^2.$

Then, we can see that $$h(\bx) = g (f (\bx))$$ or $$h = g \circ f$$.

$$g$$ is differentiable on the open set $$\RR_{++}$$. For every $$y \in \RR_{++}$$,

$g'(y) = \frac{1}{2 \sqrt{y}}$

and (from Example 5.7)

$\nabla f(\bx) = 2 \bx.$

Thus, for every $$\bx \neq \bzero$$, following Corollary 5.1,

$\nabla h(\bx) = g'(f(\bx)) \nabla f(\bx) = \frac{1}{2 \sqrt{\| \bx \|_2^2}} 2 \bx = \frac{\bx}{\| \bx \|_2}.$

The gradient of $$\ell_2$$ norm at $$\bzero$$ doesn’t exist. However, subgradients can be computed. See Example 9.71 and Example 9.72.

Corollary 5.2 (Chain rule for composition with affine function)

Suppose $$f : \RR^n \to \RR^m$$ is differentiable. Let $$\bA \in \RR^{n \times p}$$ and $$\bb \in \RR^n$$. Define $$g : \RR^p \to \RR^m$$ as:

$g(\bx) = f(\bA \bx + \bb)$

with $$\dom g = \{ \bx \ST \bA \bx + \bb \in \dom f \}$$.

The derivative of $$g$$ at $$\bx \in \interior \dom g$$ is given by:

$Dg(\bx) = Df(\bA \bx + \bb) A.$

If $$f$$ is real valued (i.e. $$m=1$$), then the gradient of a composition of a function with an affine function is given by:

$\nabla g(\bx) = \bA^T \nabla f(\bA \bx + \bb).$

Example 5.11 (Chain rule for restriction on a line)

Let $$f : \RR^n \to \RR$$ be a real valued differentiable function. Consider the restriction of $$f$$ on a line in its domain

$g(t) = f(\bx + t \bv)$

where $$\bx \in \dom f$$ and $$\bv \in \RR^n$$ with the domain

$\dom g = \{t \ST \bx + t \bv \in \dom f\}.$

If we define $$h : \RR \to \RR^n$$ as:

$h(t) = \bx + t \bv;$

we can see that:

$g(t) = f(h(t))$

By chain rule:

$g'(t) = Df(h(t)) Dh(t) = \nabla f(h(t))^T \bv = \nabla f(\bx + t \bv)^T \bv.$

In particular, if $$\bv = \by - \bx$$, with $$\by \in \dom f$$,

$g'(t) = \nabla f(\bx + t (\by -\bx) )^T (\by - \bx) = \nabla f(t \by + (1-t) \bx)^T (\by - \bx).$

## 5.1.7. Hessian#

In this section, we review the second derivative of a real valued function $$f: \RR^n \to \RR$$.

Definition 5.8 (Hessian)

The second derivative or Hessian matrix of $$f$$ at $$\bx \in \interior \dom f$$, denoted by $$\nabla^2 f$$, is given by:

$\nabla^2 f(\bx)_{i j} = \frac{\partial^2 f(\bx)}{\partial x_i \partial x_j}, i=1,\dots,n \quad j=1,\dots,n$

provided $$f$$ is twice differentiable at $$\bx$$.

Example 5.12 (Hessian of linear functional)

Let $$f : \RR^n \to \RR$$ be a linear functional given by:

$f(\bx) = \langle \bx, \ba \rangle = \ba^T \bx.$

We can expand it as:

$f(\bx) = \sum_{j=1}^n a_j x_j.$

Computing partial derivative with respect to $$x_i$$, we get:

$\frac{\partial f(\bx)}{\partial x_i} = \frac{\partial }{\partial x_i}\left (\sum_{j=1}^n a_j x_j \right ) = a_i.$

If we further compute the partial derivative w.r.t. $$x_j$$, we get:

$\frac{\partial^2 f(\bx)}{\partial x_i \partial x_j} = \frac{\partial a_i}{\partial x_j} = 0.$

Thus, the Hessian is an $$n \times n$$ 0 matrix:

$\nabla^2 f(\bx) = \ZERO_n.$

Theorem 5.3

Hessian is the derivative of the gradient mapping.

$D \nabla f(\bx) = \nabla^2 f(\bx).$

Example 5.13 (Hessian of quadratic form)

Let $$f : \RR^n \to \RR$$ be a quadratic form given by:

$f(\bx) = \bx^T \bA \bx$

where $$\bA \in \RR^{n \times n}$$.

Recall from Example 5.6 that:

$\nabla f(\bx) = (\bA^T + \bA) \bx.$

Also recall from Example 5.2 that

$D (\bC \bx) = \bC$

for all $$\bC \in \RR^{m \times n}$$.

Thus, using Theorem 5.3

$\nabla^2 f(\bx) = D \nabla f(\bx) = D ((\bA^T + \bA) \bx) = \bA^T + \bA.$

If $$\bA$$ is symmetric then

$\nabla^2 f(\bx) = 2 \bA.$

Example 5.14 (Hessian of log-sum-exp)

Let $$f : \RR^n \to \RR$$ be given by:

$f(\bx) = \ln \left ( \sum_{i=1}^n e^{x_i} \right )$

with $$\dom f = \RR^n$$.

Define

$\begin{split} \bz = \begin{bmatrix} e^{x_1} \\ \vdots \\ e^{x_n} \end{bmatrix} \end{split}$

then, we see that:

$\bone^T \bz = \sum_{i=1}^n e^{x_i}.$

Using this notation:

$f(\bx) = \ln \left (\bone^T \bz \right).$

We have:

$\frac{\partial z_i}{\partial x_i} = \frac{\partial}{\partial x_i} e^{x_i} = e^{x_i} = z_i.$

$$\frac{\partial z_j}{\partial x_i} = 0$$ for $$i \neq j$$. Now,

$\begin{split} \frac{\partial }{\partial x_i} f(\bx) &= \frac{\partial}{\partial z_i} \ln \left (\bone^T \bz \right) \cdot \frac{\partial z_i}{\partial x_i} \\ &= \frac{1}{\bone^T \bz}\frac{\partial}{\partial z_i} \bone^T \bz \cdot z_i \\ &= \frac{1}{\bone^T \bz} z_i. \end{split}$

Proceeding to compute the second derivatives:

$\begin{split} \frac{\partial^2 }{\partial x_i \partial x_j} f(\bx) &= \frac{\partial }{\partial x_i} \left (\frac{1}{\bone^T \bz} z_j \right )\\ &= \frac{\partial }{\partial z_i} \left (\frac{1}{\bone^T \bz} z_j \right ) \cdot \frac{\partial z_i}{\partial x_i} \\ &= \frac{\bone^T \bz \delta_{i j} - z_j}{(\bone^T \bz)^2} \cdot z_i\\ &= \frac{\bone^T \bz \delta_{i j} z_i - z_i z_j}{(\bone^T \bz)^2}\\ &=\frac{\delta_{i j} z_i}{\bone^T \bz} - \frac{z_i z_j}{(\bone^T \bz)^2}. \end{split}$

Now, note that $$(\bz \bz^T)_{i j} = z_i z_j$$. And, $$(\Diag (\bz))_{i j} = \delta_{ i j} z_i$$.

Thus,

$\nabla^2 f(\bx) = \frac{1}{\bone^T \bz} \Diag (\bz) - \frac{1}{(\bone^T \bz)^2} \bz \bz^T.$

Alternatively,

$\nabla^2 f(\bx) = \frac{1}{(\bone^T \bz)^2} \left ((\bone^T \bz) \Diag (\bz) - \bz \bz^T \right ).$

Example 5.15 (Derivatives for least squares cost function)

Let $$\bA \in \RR^{m \times n}$$. Let $$\bb \in \RR^n$$. Consider the least squares cost function:

$f(\bx) = \frac{1}{2} \| \bA \bx - \bb \|_2^2.$

Expanding it, we get:

$f(\bx) = \frac{1}{2} \bx^T \bA^T \bA \bx - \bb^T \bA \bx + \frac{1}{2} \bb^T \bb.$

Note that $$\bA^T \bA$$ is symmetric. Using previous results, we obtain the gradient:

$\nabla f(\bx) = \bA^T \bA \bx - \bA^T \bb.$

And the Hessian is:

$\nabla^2 f(\bx) = D \nabla f (\bx) = \bA^T \bA.$

Example 5.16 (Derivatives for quadratic over linear function)

Let $$f : \RR \times \RR \to \RR$$ be given by:

$f(x, y) = \frac{x^2}{y}$

with $$\dom f = \{ (x, y) \ST y > 0\}$$.

The gradient is obtained by computing the partial derivatives w.r.t. $$x$$ and $$y$$:

$\begin{split} \nabla f(x,y) = \begin{bmatrix} \frac{2x}{y}\\ \frac{-x^2}{y^2} \end{bmatrix}. \end{split}$

The Hessian is obtained by computing second order partial derivatives:

$\begin{split} \nabla^2 f(x, y) = \begin{bmatrix} \frac{2}{y} & \frac{-2 x}{y^2}\\ \frac{-2 x}{y^2} & \frac{2 x^2}{y^3} \end{bmatrix} = \frac{2}{y^3} \begin{bmatrix} y^2 & - x y\\ - x y & x^2 \end{bmatrix}. \end{split}$

## 5.1.8. Twice Continuous Differentiability#

Definition 5.9 (Twice continuously differentiable real valued function)

Let $$f: \RR^n \to \RR$$ be a real valued function with $$S = \dom f$$. Let $$U \subseteq S$$ be an open set. If all the second order partial derivatives of $$f$$ exist and are continuous at every $$\bx \in U$$, then $$f$$ is called twice continuously differentiable over $$U$$.

If $$f$$ is twice continuously differentiable over an open set $$U \subseteq S$$, then it is twice continuously differentiable over every subset $$C \subseteq U$$.

If $$S$$ is open itself and $$f$$ is twice continuously differentiable over $$S$$, then $$f$$ is called twice continuously differentiable.

Theorem 5.4 (Symmetry of Hessian)

If $$f : \RR^n \to \RR$$ with $$S = \dom f$$ is twice continuously differentiable over a set $$U \subseteq S$$, then its Hessian matrix $$\nabla^2 f(\bx)$$ is symmetric at every $$\bx \in U$$

## 5.1.9. Second Order Approximation#

Theorem 5.5 (Linear approximation theorem)

Let $$f : \RR^n \to \RR$$ with $$S = \dom f$$ be twice continuously differentiable over an open set $$U \subseteq S$$. Let $$\bx \in U$$. Let $$r > 0$$ be such that $$B(\bx, r) \subseteq U$$. Then, for any $$\by \in B(\bx, r)$$, there exist $$\bz \in [\bx, \by]$$ such that

$f(\by) - f(\bx) = \nabla f(\bx)^T (\by - \bx) + \frac{1}{2} (\by - \bx)^T \nabla^2 f(\bz) (\by - \bx).$

Theorem 5.6 (Quadratic approximation theorem)

Let $$f : \RR^n \to \RR$$ with $$S = \dom f$$ be twice continuously differentiable over an open set $$U \subseteq S$$. Let $$\bx \in U$$. Let $$r > 0$$ be such that $$B(\bx, r) \subseteq U$$. Then, for any $$\by \in B(\bx, r)$$,

$f(\by) = f(\bx) + \nabla f(\bx)^T (\by - \bx) + \frac{1}{2} (\by - \bx)^T \nabla^2 f(\bx) (\by - \bx) + o(\| \by - \bx \|^2).$

Definition 5.10 (Second order approximation)

The second order approximation of $$f$$ at or near $$\bx=\ba$$ is the quadratic function defined by:

$\hat{f} (\bx) = f(\ba) + \nabla f(\ba)^T (\bx - \ba) + \frac{1}{2} (\bx - \ba)^T \nabla^2 f(\ba) (\bx - \ba).$

## 5.1.10. Smoothness#

### 5.1.10.1. Real Functions#

Definition 5.11 (Class of continuous functions)

The class of continuous real functions, denoted by $$C$$, is the set of functions of type $$f: \RR \to \RR$$ which are continuous over their domain $$\dom f$$.

Definition 5.12 (Differentiability class $$C^k$$)

Let $$f: \RR \to \RR$$ be a real function with $$S = \dom f$$.

Then, we say that $$f$$ belongs to the differentiability class $$C^k$$ on $$S$$ if and only if

$\frac{d^k}{d x^k} f(x) \in C.$

In other words, the $$k$$-th derivative of $$f$$ exists and is continuous.

1. $$C^0$$ consists of class of continuous real functions.

2. $$C^1$$ consists of class of continuously differentiable functions.

3. $$C^{\infty}$$ consists of class of smooth functions which are infinitely differentiable.

### 5.1.10.2. Real Valued Functions on Euclidean Space#

Definition 5.13 (Differentiability class $$C^k$$)

A function $$f: \RR^n \to \RR$$ with $$S = \dom f$$ where $$S$$ is an open subset of $$\RR^n$$ is said to be of class $$C^k$$ on $$S$$, for a positive integer $$k$$, if all the partial derivatives of $$f$$

$\frac{\partial^m f}{\partial x_1^{m_1} \partial x_2^{m_2} \dots \partial x_n^{m_n}} (\bx)$

exist and are continuous for every $$m_1,m_2,\dots,m_n \geq 0$$ and $$m = m_1 + m_2 + \dots m_n \leq k$$.

1. If $$f$$ is continuous, it is said to belong to $$C$$ or $$C^0$$.

2. If $$f$$ is continuously differentiable, it is said to belong to $$C^1$$.

3. If $$f$$ is twice continuously differentiable, it is said to belong to $$C^2$$.

4. If $$f$$ is infinitely differentiable, it is said to belong to $$C^{\infty}$$.