Lecture 9¶

Thursday, October 3rd, 2019¶

Automatic Differentiation: The Motivation¶

References¶

A Hitchhiker’s Guide to Automatic Differentiation
Griewank, A. and Walther, A., 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation (Vol. 105). Siam.

In [1]:

# Load libraries that we'll need
import matplotlib.pyplot as plt
import numpy as np
%matplotlib notebook
# or inline

Introduction and Motivation¶

Differentiation is one of the most important operations in science. Finding extrema of functions and determining zeros of functions are central to optimization. Numerically solving differential equations forms a cornerstone of modern science and engineering and is intimately linked with predictive science.

A very frequent occurrence in science requires the scientist to find the zeros of a function $f\left(x\right)$. The input to the function is an $m-$ dimensional vector and the function returns an $n-$ dimensional vector. We denote this mathematically as $f\left(x\right): \mathbb{R}^{m} \mapsto \mathbb{R}^{n}$. This expression is read: the function $f\left(x\right)$ maps $\mathbb{R}^{m}$ to $\mathbb{R}^{n}$.

Example 1¶

The system of nonlinear equations \begin{align} x_{1}x_{2}^{3} + \ln\left(x_{3}^{2}\right) = \sin\left(x_{1}x_{2}x_{3}\right) \\ x_{1} + x_{2} + \tan\left(x_{3}\right) = \frac{1}{x_{1}x_{2}x_{3}} \end{align} has \begin{align} x = \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \end{bmatrix} \end{align} and we say $x\in\mathbb{R}^{3}$. Thus, following the notation above, we say $m = 3$. The function of interest is \begin{align} \displaystyle f\left(x\right) = \begin{bmatrix} \displaystyle x_{1}x_{2}^{3} + \ln\left(x_{3}^{2}\right) - \sin\left(x_{1}x_{2}x_{3}\right) \\ \displaystyle x_{1} + x_{2} + \tan\left(x_{3}\right) - \dfrac{1}{x_{1}x_{2}x_{3}} \end{bmatrix} . \end{align} Thus $f\left(x\right)$ maps $\mathbb{R}^{3}$ to $\mathbb{R}^{2}$ and we write $f\left(x\right): \mathbb{R}^{3} \mapsto \mathbb{R}^{2}$.

Newton's Method¶

We may have cause to find an $x$ that renders $f\left(x\right)=0$. This is not so difficult for a linear system, but for a nonlinear system it can be a major challenge. Newton's method is an algorithm with excellent convergence properties that allows us to find the roots of a nonlinear function.

Derivation of Newton's Method¶

We want to find $x\in\mathbb{R}^{m}$ such that $f\left(x\right) = 0$ for $f\left(x\right) \in \mathbb{R}^{n}$. Here are the basic steps:

Step 1¶

Start with an initial guess $x_{k}$. It is very likely that $f\left(x_{k}\right) \neq 0$.

Step 2¶

Look at a point just a little beyond $x_{k}$. That is $x_{k+1} = x_{k} + \Delta x_{k}$ where $\Delta x_{k} = x_{k+1} - x_{k}$.

Step 3¶

Since $\Delta x_{k}$ is "small", and we are still in the local neighborhood of $x_{k}$, we ought to expand $f\left(x\right)$ about $x_{k}$ in a Taylor series. $$f\left(x_{k} + \Delta x_{k}\right) = f\left(x_{k}\right) + \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x_{k}} \Delta x_{k} + \cdots.$$

Step 4¶

Working with higher-order terms can be messy, so we'll just keep the linear terms and write $$f\left(x_{k} + \Delta x_{k}\right) \approx f\left(x_{k}\right) + \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x_{k}} \Delta x_{k}.$$

Step 5¶

We want $f\left(x_{k+1}\right) = 0$. So we write $$f\left(x_{k}\right) + \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x_{k}} \Delta x_{k} = 0.$$ It is understood that this relationship is approximate and that we're looking for an approximate correction $\Delta x_{k}$ that will give $f\left(x_{k+1}\right) = 0$.

Step 6¶

We write a linear system at iteration $k$ for $\Delta x_{k}$. $$\displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x_{k}} \Delta x_{k} = -f\left(x_{k}\right).$$ In linear algebra terms we write $$J \Delta x_{k} = -f\left(x_{k}\right)$$ where $$J = \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x_{k}}$$ is called the Jacobian matrix evaluated at $x_{k}$.

Step 7¶

The new guess for the root is $x_{k+1} = x_{k} + \Delta x_{k}$. This process is repeated until $f\left(x_{k+1}\right) \approx 0$ to within some acceptable tolerance.

Start with an initial guess $x_{k}$. It is very likely that $f\left(x_{k}\right) \neq 0$.
Look at a point just a little beyond $x_{k}$. That is $x_{k+1} = x_{k} + \Delta x_{k}$ where $\Delta x_{k} = x_{k+1} - x_{k}$.
Since $\Delta x_{k}$ is "small", and we are still in the local neighborhood of $x_{k}$, we ought to expand $f\left(x\right)$ about $x_{k}$ in a Taylor series. $$f\left(x_{k} + \Delta x_{k}\right) = f\left(x_{k}\right) + \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x_{k}} \Delta x_{k} + \cdots.$$
Working with higher-order terms can be messy, so we'll just keep the linear terms and write $$f\left(x_{k} + \Delta x_{k}\right) \approx f\left(x_{k}\right) + \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x_{k}} \Delta x_{k}.$$
We want $f\left(x_{k+1}\right) = 0$. So we write $$f\left(x_{k}\right) + \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x_{k}} \Delta x_{k} = 0.$$ It is understood that this relationship is approximate and that we're looking for an approximate correction $\Delta x_{k}$ that will give $f\left(x_{k+1}\right) = 0$.
We write a linear system at iteration $k$ for $\Delta x_{k}$. $$\displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x_{k}} \Delta x_{k} = -f\left(x_{k}\right).$$ In linear algebra terms we write $$J \Delta x_{k} = -f\left(x_{k}\right)$$ where $$J = \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x_{k}}$$ is called the Jacobian matrix.
The new guess for the root is $x_{k+1} = x_{k} + \Delta x_{k}$. This process is repeated until $f\left(x_{k+1}\right) \approx 0$ to within some acceptable tolerance.

Note: There are many ways of implementing Newton's method with varying sophistication. We won't cover those ideas at the moment.

At the heart of Newton's method lies the Jacobian matrix, which is a matrix of partial derivatives. Accurate calculation of the Jacobian is key to realizing the good convergence properties of Newton's method.

Example 2¶

We want to find where the functions $y_{1} = x$ and $y_{2} = \exp\left(-2\sin^{2}\left(4x\right)\right)$ intersect. That is, for what values of $x$ is $y_{1} = y_{2}$? This question is equivalent to the problem of finding the zeros of $$f\left(x\right) = x - \exp\left(-2\sin^{2}\left(4x\right)\right).$$

Before doing anything, it's a good idea to visualize what we're up against.

In [2]:

# Define function
x = np.linspace(0.0, 2.0*np.pi, 1000)
y = np.exp(-2.0 * np.sin(4.0*x)*np.sin(4.0*x))

In [3]:

# Plot the two functions to see where they intersect
fig, ax = plt.subplots(1,1, figsize=(10,6))
ax.plot(x, y, lw=3, label=r'$y_{1} = \exp\left(-2\sin^{2}\left(4x\right)\right)$')
ax.plot(x, x, ls='--', lw=3, label=r'$y_{2} = x$')

ax.set_xlim(0, 2.0)
ax.set_ylim(0, 2.0)
ax.set_xlabel(r'$x$', fontsize=24)
ax.set_ylabel(r'$y$', fontsize=24)
ax.set_title(r'Intersections of $y_{1}$ and $y_{2}$', fontsize=24)
ax.tick_params(labelsize=24)
ax.legend(fontsize=24)
plt.tight_layout()

In [4]:

# Plot f(x) = x - y to see where the zeros are
fig, ax = plt.subplots(1,1, figsize=(10,6))
ax.axhline(0, color='k', ls='--', lw=6)
ax.plot(x, x-y, lw=6, label=r'$x-\exp\left(-2\sin^{2}\left(4x\right)\right)$')

ax.set_xlim(0, 1.0)
ax.set_ylim(-1.0, 1.0)

ax.set_xlabel(r'$x$', fontsize=24)
ax.set_ylabel(r'$f$', fontsize=24)
ax.set_title(r'Zeros of $f\left(x\right) = x-\exp\left(-2\sin^{2}\left(4x\right)\right)$', fontsize=24)
ax.tick_params(labelsize=24)

plt.tight_layout()

It's pretty messy. There are three zeros and we can't solve this monster by hand. Let's try Newton's method. Like any good developer, we begin by sketching out our implementation.

xk # Set initial guess
tol # Set tolerance
max_it # Maximum iterations
for k in range(max_it):
    delta_xk = -f(xk) / dfdx(xk) # Evaluate Delta x
    if (delta_xk <= tol): # Stop iteration if solution
        root = xk + delta_xk
        break
    xk += delta_xk # Update xk

Looks good. Let's start implementing. Clearly we need functions for evaluating $f\left(x\right)$ and the Jacobian. Note that the symbolic derivative of $f\left(x\right)$ is $$\dfrac{df}{dx} = 1 + 16\exp\left(-2\sin^{2}\left(4x\right)\right)\sin\left(4x\right)\cos\left(4x\right).$$

In [5]:

def f(x):
    # Hard-coded f(x)
    return x - np.exp(-2.0 * np.sin(4.0*x) * np.sin(4.0*x))

def dfdx(x):
    # Hard-coded "Jacobian matrix" of f(x)
    return 1.0 + 16.0 * np.exp(-2.0 * np.sin(4.0*x) * np.sin(4.0*x)) * np.sin(4.0*x) * np.cos(4.0*x)

# Start Newton algorithm
xk = 0.1 # Initial guess
tol = 1.0e-08 # Some tolerance
max_it = 100 # Just stop if a root isn't found after 100 iterations

root = None # Initialize root
for k in range(max_it):
    delta_xk = -f(xk) / dfdx(xk) # Update Delta x_{k}
    if (abs(delta_xk) <= tol): # Stop iteration if solution found
        root = xk + delta_xk
        print("Found root at x = {0:17.16f} after {1} iteratons.".format(root, k+1))
        break
    print("At iteration {0}, Delta x = {1:17.16f}".format(k+1, delta_xk))
    xk += delta_xk # Update xk

At iteration 1, Delta x = 0.1218876773122230
At iteration 2, Delta x = 0.0233959919379915
At iteration 3, Delta x = 0.0020665479886371
At iteration 4, Delta x = 0.0000150007973689
Found root at x = 0.2473652188201050 after 5 iteratons.

It appears to be working okay. Let's check to see what the function's value is at our supposed root.

In [6]:

print("f({0:17.16f}) = {1:17.16e}".format(root, f(root)))

f(0.2473652188201050) = 5.5511151231257827e-17

Alright, I'm convinced.

(However, if you play with this implementation a bit you'll see some strange behavior. We can discuss offline if you're curious.)

Warning! This little sketch is not the optimal way of implementing Newton's method. It just gives you the basic ideas!

Summary¶

Derivatives are everywhere in science and engineering. We illustrated a situation from optimization where we try to find the roots of a complicated, high-dimensional nonlinear function. The algorithm that we used was Newton's method, which requires evaluations of the Jacobian.

We did a nice little example to find the roots of a nonlinear function in one dimension. We hard-coded the function and its derivative. What if we can't easily compute the derivative by hand? Or what if we're just lazy, and we don't want to compute the derivative by hand?

The Finite Difference¶

Suppose we want to avoid relying on the symbolic computation of the derivative. An obvious and very convenient way to do so is to use a finite difference. For a single-variable function, we just write $$\dfrac{\partial f}{\partial x} \approx \dfrac{f\left(x+\epsilon\right) - f\left(x\right)}{\epsilon}$$ for some "small" $\epsilon$. Let's do a little demo to see how things turn out.

We'll compute the derivative of $$f\left(x\right) = x - \exp\left(-2\sin^{2}\left(4x\right)\right).$$ We've already defined python functions for $f\left(x\right)$ and its derivative. Let's define a function for the first order finite difference.

In [7]:

def dfdx_h(x, epsilon):
    return (f(x + epsilon) - f(x)) / epsilon

In [8]:

x = np.linspace(0.0, 2.0, 1000) # Define domain
eps = np.logspace(-13, 1, 1000) # Define \epsilon domain

err = np.zeros(len(eps))
# Loop over epsilons
for idx, e in enumerate(eps):
    df_err = dfdx_h(x,e) - dfdx(x) # Compute error between FD and analytical at each point
    err[idx] = np.linalg.norm(df_err) # Store error as L2 norm

In [9]:

# Make a log-log plot of the error
fig, ax = plt.subplots(1,1, figsize=(10,6))
ax.plot(eps, err, lw=3)
ax.set_xscale('log')
ax.set_yscale('log')

ax.set_xlabel(r'$\epsilon$', fontsize=24)
ax.set_ylabel(r'$\| f^{\prime}_{FD} - f^{\prime}_{exact}\|_{L_{2}}$', fontsize=24)
ax.tick_params(labelsize=24)

plt.tight_layout()

In [10]:

# Print out minimum error
print(r"The minimum error of {0:17.16e} was obtained at epsilon = {1:17.16e}.".format(err.min(), eps[err==err.min()][0]))

The minimum error of 1.4490391090618430e-06 was obtained at epsilon = 2.8605955351757505e-09.

Very interesting! First of all, the smallest error we could get was around $10^{-6}$ which is several orders of magnitude above machine precision. Secondly, the minimum error was not obtained at $\epsilon = $ machine precision. The observation is that $\epsilon$ too small starts to amplify floating point errors while $\epsilon$ too large doesn't provide a good approximation to the derivative.

In fact, it's not clear how to choose $\epsilon$ in general. There are some results from numerical analysis and they indicate that $\epsilon$ should be around $\sqrt{\epsilon_{\text{machine}}}$ as a rule of thumb for a first-order method.

In [11]:

eps_mach = np.finfo(float).eps

print("Machine precision: {0}".format(eps_mach))
print("Square root of machine precision: {0}".format(np.sqrt(eps_mach)))

Machine precision: 2.220446049250313e-16
Square root of machine precision: 1.4901161193847656e-08

A few more notes on finite difference approaches¶

We could get a higher-order approximation by using the second-order finite difference formula $$\dfrac{\partial f}{\partial x} \approx \dfrac{f\left(x+\epsilon\right) - f\left(x - \epsilon\right)}{2\epsilon}.$$ This formula suffers from the same difficulties as the first-order approximation.
To calculate the derivative of a multivariate function we can write $$\dfrac{\partial f}{\partial x_{j}} \approx \dfrac{f\left(x+\epsilon e_{j}\right) - f\left(x\right)}{\epsilon}$$ where $e_{j}$ is the unit vector in the direction of $x_{j}$.

Towards Automatic Differentiation¶

In the introduction, we motivated the need for computational techniques to compute derivatives. The focus in the introduction was on the finite difference method, but we also computed a symbolic derivative. The finite difference approach is nice because it is quick and easy. However, it suffers from accuracy and stability problems. On the other hand, symbolic derivatives can be evaluated to machine precision, but can be costly to evaluate. We'll have more to say about cost of symbolic differentiation later.

Automatic differentiation (AD) overcomes both of these deficiencies. It is less costly than symbolic differentiation while evaluating derivatives to machine precision. There are two modes of automatic differentiation: forward and reverse. This course will be primarily concerned with the forward mode. Time-permitting, we will give an introduction to the reverse mode. In fact, the famous backpropagation algorithm from machine learn is a special case of the reverse mode of automatic differentiation.