5.1. Differentiation#

We consider functions from Rn to Rm.

5.1.1. Differentiability and Jacobian#

Definition 5.1 (Differentiability at a point)

Let f:RnRm. Let xintdomf. The function f is differentiable at x if there exists a matrix Df(x)Rm×n that satisfies

(5.1)#limzdomf,zx,zxf(z)f(x)Df(x)(zx)2zx2=0.

Such a matrix Df(x) is called the derivative (or Jacobian) of f at x.

There can be at most one Df(x) satisfying the limit in (5.1).

Observation 5.1

If we write z=x+h then an alternative form for (5.1) is given by:

limx+hdomf,h0,h0f(x+h)f(x)Df(x)h2h2=0.

The matrix Df(x) can be obtained from the partial derivatives:

Df(x)ij=fi(x)xj,i=1,,m,j=1,,n.
Df(x)=[f1(x)x1f1(x)x2f1(x)xnf2(x)x1f2(x)x2f2(x)xnfm(x)x1fm(x)x2fm(x)xn].
  1. The Jacobian Df(x) is an m×n real matrix.

  2. Partial derivatives of each component of f (i.e., fi) line up on the i-th row.

  3. Partial derivatives for one coordinate xj line up on the j-th column.

  4. If f is single valued, then the Jacobian Df(x) is a row vector.

Example 5.1 (Jacobian of identity function)

Let f:RnRn be defined as:

f(x)=x.

Then, fi(x)=xi. Hence,

fi(x)xj=δ(i,j).

Thus

Df(x)=In

the n×n identity matrix.

Example 5.2 (Jacobian of linear transformation)

Let f:RnRm be defined as:

f(x)=Ax

where A=(aij) is an m×n real matrix.

Then, fi(x)=j=1naijxj. Hence,

fi(x)xj=aij.

Thus

Df(x)=A.

Example 5.3 (Jacobian of affine transformation)

Let f:RnRm be defined as:

f(x)=Ax+b

where A=(aij)Rm×n and bRm.

Then, fi(x)=j=1naijxj+bi. Hence,

fi(x)xj=aij.

Thus

Df(x)=A.

The vector b is a constant offset. It has no impact on the derivative.

Definition 5.2 (Differentiable function)

A function f is called differentiable if its domain domf is open and it is differentiable at every point of domf.

Definition 5.3 (First order approximation)

The affine function given by:

(5.2)#f^(x)=f(a)+Df(a)(xa)

is called the first order approximation of f at x=aintdomf.

5.1.2. Real Valued Functions#

Rest of this section focuses mostly on real valued functions of type f:RnR.

  1. First order derivative of a real valued function is called a gradient.

  2. Second order derivative of a real valued function is called a Hessian.

  3. We consider first order and second order approximations of a real valued function.

5.1.3. Gradient#

Definition 5.4 (Gradient)

When f:RnR is a real valued function, then the derivative Df(x) is a 1×n matrix. The gradient of a real valued function is defined as:

f(x)=Df(x)T

at xintdomf if f is differentiable at x.

For real valued functions, the derivative is a row vector but the gradient is a column vector.

The components of the gradient are given by the partial derivatives as:

f(x)i=f(x)xi,i=1,,n.

Example 5.4 (Gradient of linear functional)

Let f:RnR be a linear functional given by:

f(x)=x,a=aTx.

We can expand it as:

f(x)=j=1najxj.

Computing partial derivative with respect to xi, we get:

f(x)xi=xi(j=1najxj)=ai.

Putting the partial derivatives together, we get:

f(x)=a.

Example 5.5 (Gradient of affine functional)

Let f:RnR be a affine functional given by:

f(x)=aTx+b

where aRn and bR.

We can expand it as:

f(x)=j=1najxj+b.

Computing partial derivative with respect to xi, we get:

f(x)xi=xi(j=1najxj+b)=ai.

Putting the partial derivatives together, we get:

f(x)=a.

The intercept b is a constant term which doesn’t affect the gradient.

Example 5.6 (Gradient of quadratic form)

Let f:RnR be a quadratic form given by:

f(x)=xTAx

where ARn×n.

We can expand it as:

f(x)=i=1nj=1nxiaijxj.

Note that the diagonal elements aii give us terms of the form aiixi2. Let us split the expression into diagonal and non-diagonal terms:

f(x)=i=1naiixi2+i,jijxiaijxj.

There are n terms in the first sum (the diagonal entries of A) and n2n terms in the second sum (the non-diagonal entries of A).

Taking partial derivative w.r.t. xk, we obtain:

f(x)xk=2akkxk+iikxiaik+jjkakjxj.
  • The first term comes from akk term that is quadratic in xk.

  • The first sum comes from linear terms where k=j and i=1,,n except ik.

  • The second sum comes from linear terms where k=i and j=1,,n except jk.

  • There are 2n2 terms in the sums and 2 akkxk terms.

  • We can move one akkxk into each sum to simplify the partial derivative as:

f(x)xk=i=1nxiaik+j=1nakjxj.

Note that the k-th component of the vector u=Ax is j=1nakjxj.

Similarly, the k-th component of the vector v=ATx is i=1naikxi.

Thus,

f(x)xk=vk+uk.

Putting together the partial derivatives, we obtain:

f(x)=v+u=ATx+Ax=(AT+A)x=(A+AT)x.

If A is symmetric then,

f(x)=2Ax.

Example 5.7 (Gradient of squared 2 norm)

Let f:RnR be a quadratic form given by:

f(x)=x22=xTx.

We can write this as

f(x)=xTIx

where I is the identity matrix.

Following, Example 5.6,

f(x)=2Ix=2x.

Example 5.8 (Gradient of quadratic functional)

Let PSn be a symmetric matrix. Let qRn and rR. Consider the quadratic functional f:RnR given as:

f(x)=12xTPx+qTx+r.

We can compute the gradient as follows:

f(x)=(12xTPx+qTx+r)=12(xTPx)+(qTx)+r=12(P+PT)x+q=12(P+P)x+q=Px+q.
  • We took advantage of the fact that gradient operation commutes with scalar multiplication and distributes on vector addition.

  • Since r is a constant, it has no contribution to the derivative.

  • We reused results from previous examples.

  • We utilized the fact that P=PT since P is symmetric.

In summary:

f(x)=Px+q.

The derivative of f is then obtained by taking the transpose of the gradient:

Df(x)=xTP+qT.

Definition 5.5 (Gradient mapping)

If a real valued function f:RnR is differentiable, the gradient mapping of f is the function f:RnRn with domf=domf, with the value f(x) at every xdomf.

5.1.4. Continuous Differentiability#

Definition 5.6 (Continuously differentiable real valued function)

Let f:RnR be a real valued function with S=domf. Let US be an open set. If all the partial derivatives of f exist and are continuous at every xU, then f is called continuously differentiable over U.

If f is continuously differentiable over an open set US, then it is continuously differentiable over every subset CU.

If S is open itself and f is continuously differentiable over S, then f is called continuously differentiable.

5.1.5. First Order Approximation#

Definition 5.7 (First order approximation of real valued functions)

The affine function given by:

(5.3)#f^(x)=f(a)+f(a)T(xa)

is the first order approximation of a real valued function f at x=aintdomf.

Theorem 5.1 (First order approximation accuracy)

Let f:RnR be defined on an open set S=domf. Assume that f is continuously differentiable on S. Then,

limd0f(x+d)f(x)f(x)Tdd=0xS.

Another way to write this result is:

f(x)=f(a)+f(a)T(xa)+o(xa)

where aS and o():R+R is a one dimensional function satisfying o(t)t0 as t0+.

5.1.6. Chain Rule#

Theorem 5.2 (Chain rule)

Suppose f:RnRm is differentiable at xintdomf and g:RmRp is differentiable at f(x)intdomg. Define the composition h:RnRp as:

h(x)=g(f(x)).

Then, h is differentiable at x with the derivative given by:

Dh(x)=Dg(f(x))Df(x).

Notice how the derivative lines up as a simple matrix multiplication.

Corollary 5.1 (Chain rule for real valued functions)

Suppose f:RnR is differentiable at xintdomf and g:RR is differentiable at f(x)intdomg. Define the composition h:RnR as:

h(x)=g(f(x)).

Then, h is differentiable at x with the gradient given by:

h(x)=g(f(x))f(x).

Example 5.9 (Gradient of log-sum-exp)

Let h:RnR be given by:

h(x)=ln(i=1nexpxi)

with domh=Rn.

Let g(y)=lny and

f(x)=i=1nexpxi

Then, we can see that h(x)=g(f(x)). Now g(y)=1y and

f(x)=[expx1expxn].

Thus,

h(x)=1i=1nexpxi[expx1expxn].

Now, if we define

z=[expx1expxn]

then, we see that:

1Tz=i=1nexpxi.

Using this notation:

h(x)=11Tzz.

Example 5.10 (Gradient of 2 norm at nonzero vectors)

Let h:RnR be given by:

h(x)=x2=x,x

with domh=Rn.

Let g:RR with domg=R+ be given by g(y)=y.

Let f:RnR with domf=Rn be given by

f(x)=x,x=i=1nxi2=x22.

Then, we can see that h(x)=g(f(x)) or h=gf.

g is differentiable on the open set R++. For every yR++,

g(y)=12y

and (from Example 5.7)

f(x)=2x.

Thus, for every x0, following Corollary 5.1,

h(x)=g(f(x))f(x)=12x222x=xx2.

The gradient of 2 norm at 0 doesn’t exist. However, subgradients can be computed. See Example 9.71 and Example 9.72.

Corollary 5.2 (Chain rule for composition with affine function)

Suppose f:RnRm is differentiable. Let ARn×p and bRn. Define g:RpRm as:

g(x)=f(Ax+b)

with domg={x|Ax+bdomf}.

The derivative of g at xintdomg is given by:

Dg(x)=Df(Ax+b)A.

If f is real valued (i.e. m=1), then the gradient of a composition of a function with an affine function is given by:

g(x)=ATf(Ax+b).

Example 5.11 (Chain rule for restriction on a line)

Let f:RnR be a real valued differentiable function. Consider the restriction of f on a line in its domain

g(t)=f(x+tv)

where xdomf and vRn with the domain

domg={t|x+tvdomf}.

If we define h:RRn as:

h(t)=x+tv;

we can see that:

g(t)=f(h(t))

By chain rule:

g(t)=Df(h(t))Dh(t)=f(h(t))Tv=f(x+tv)Tv.

In particular, if v=yx, with ydomf,

g(t)=f(x+t(yx))T(yx)=f(ty+(1t)x)T(yx).

5.1.7. Hessian#

In this section, we review the second derivative of a real valued function f:RnR.

Definition 5.8 (Hessian)

The second derivative or Hessian matrix of f at xintdomf, denoted by 2f, is given by:

2f(x)ij=2f(x)xixj,i=1,,nj=1,,n

provided f is twice differentiable at x.

Example 5.12 (Hessian of linear functional)

Let f:RnR be a linear functional given by:

f(x)=x,a=aTx.

We can expand it as:

f(x)=j=1najxj.

Computing partial derivative with respect to xi, we get:

f(x)xi=xi(j=1najxj)=ai.

If we further compute the partial derivative w.r.t. xj, we get:

2f(x)xixj=aixj=0.

Thus, the Hessian is an n×n 0 matrix:

2f(x)=On.

Theorem 5.3

Hessian is the derivative of the gradient mapping.

Df(x)=2f(x).

Example 5.13 (Hessian of quadratic form)

Let f:RnR be a quadratic form given by:

f(x)=xTAx

where ARn×n.

Recall from Example 5.6 that:

f(x)=(AT+A)x.

Also recall from Example 5.2 that

D(Cx)=C

for all CRm×n.

Thus, using Theorem 5.3

2f(x)=Df(x)=D((AT+A)x)=AT+A.

If A is symmetric then

2f(x)=2A.

Example 5.14 (Hessian of log-sum-exp)

Let f:RnR be given by:

f(x)=ln(i=1nexi)

with domf=Rn.

Define

z=[ex1exn]

then, we see that:

1Tz=i=1nexi.

Using this notation:

f(x)=ln(1Tz).

We have:

zixi=xiexi=exi=zi.

zjxi=0 for ij. Now,

xif(x)=ziln(1Tz)zixi=11Tzzi1Tzzi=11Tzzi.

Proceeding to compute the second derivatives:

2xixjf(x)=xi(11Tzzj)=zi(11Tzzj)zixi=1Tzδijzj(1Tz)2zi=1Tzδijzizizj(1Tz)2=δijzi1Tzzizj(1Tz)2.

Now, note that (zzT)ij=zizj. And, (diag(z))ij=δijzi.

Thus,

2f(x)=11Tzdiag(z)1(1Tz)2zzT.

Alternatively,

2f(x)=1(1Tz)2((1Tz)diag(z)zzT).

Example 5.15 (Derivatives for least squares cost function)

Let ARm×n. Let bRn. Consider the least squares cost function:

f(x)=12Axb22.

Expanding it, we get:

f(x)=12xTATAxbTAx+12bTb.

Note that ATA is symmetric. Using previous results, we obtain the gradient:

f(x)=ATAxATb.

And the Hessian is:

2f(x)=Df(x)=ATA.

Example 5.16 (Derivatives for quadratic over linear function)

Let f:R×RR be given by:

f(x,y)=x2y

with domf={(x,y)|y>0}.

The gradient is obtained by computing the partial derivatives w.r.t. x and y:

f(x,y)=[2xyx2y2].

The Hessian is obtained by computing second order partial derivatives:

2f(x,y)=[2y2xy22xy22x2y3]=2y3[y2xyxyx2].

5.1.8. Twice Continuous Differentiability#

Definition 5.9 (Twice continuously differentiable real valued function)

Let f:RnR be a real valued function with S=domf. Let US be an open set. If all the second order partial derivatives of f exist and are continuous at every xU, then f is called twice continuously differentiable over U.

If f is twice continuously differentiable over an open set US, then it is twice continuously differentiable over every subset CU.

If S is open itself and f is twice continuously differentiable over S, then f is called twice continuously differentiable.

Theorem 5.4 (Symmetry of Hessian)

If f:RnR with S=domf is twice continuously differentiable over a set US, then its Hessian matrix 2f(x) is symmetric at every xU

5.1.9. Second Order Approximation#

Theorem 5.5 (Linear approximation theorem)

Let f:RnR with S=domf be twice continuously differentiable over an open set US. Let xU. Let r>0 be such that B(x,r)U. Then, for any yB(x,r), there exist z[x,y] such that

f(y)f(x)=f(x)T(yx)+12(yx)T2f(z)(yx).

Theorem 5.6 (Quadratic approximation theorem)

Let f:RnR with S=domf be twice continuously differentiable over an open set US. Let xU. Let r>0 be such that B(x,r)U. Then, for any yB(x,r),

f(y)=f(x)+f(x)T(yx)+12(yx)T2f(x)(yx)+o(yx2).

Definition 5.10 (Second order approximation)

The second order approximation of f at or near x=a is the quadratic function defined by:

f^(x)=f(a)+f(a)T(xa)+12(xa)T2f(a)(xa).

5.1.10. Smoothness#

5.1.10.1. Real Functions#

Definition 5.11 (Class of continuous functions)

The class of continuous real functions, denoted by C, is the set of functions of type f:RR which are continuous over their domain domf.

Definition 5.12 (Differentiability class Ck)

Let f:RR be a real function with S=domf.

Then, we say that f belongs to the differentiability class Ck on S if and only if

dkdxkf(x)C.

In other words, the k-th derivative of f exists and is continuous.

  1. C0 consists of class of continuous real functions.

  2. C1 consists of class of continuously differentiable functions.

  3. C consists of class of smooth functions which are infinitely differentiable.

5.1.10.2. Real Valued Functions on Euclidean Space#

Definition 5.13 (Differentiability class Ck)

A function f:RnR with S=domf where S is an open subset of Rn is said to be of class Ck on S, for a positive integer k, if all the partial derivatives of f

mfx1m1x2m2xnmn(x)

exist and are continuous for every m1,m2,,mn0 and m=m1+m2+mnk.

  1. If f is continuous, it is said to belong to C or C0.

  2. If f is continuously differentiable, it is said to belong to C1.

  3. If f is twice continuously differentiable, it is said to belong to C2.

  4. If f is infinitely differentiable, it is said to belong to C.