Key Word(s): Automatic differentiation
Lecture 9¶
Tuesday, October 6th, 2020¶
Automatic Differentiation: The Motivation¶
References¶
- A Hitchhiker’s Guide to Automatic Differentiation
- You can access this through Harvard. No need to pay for it.
- Griewank, A. and Walther, A., 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation (Vol. 105). Siam.
# Load libraries that we'll need
import matplotlib.pyplot as plt
import numpy as np
%matplotlib notebook
# or inline
Introduction and Motivation¶
Differentiation is one of the most important operations in science. Finding extrema of functions and determining zeros of functions are central to optimization. Numerically solving differential equations forms a cornerstone of modern science and engineering and is intimately linked with predictive science.
A very frequent occurrence in science requires the scientist to find the zeros of a function $f\left(x\right)$. The input to the function is an $m-$ dimensional vector and the function returns an $n-$ dimensional vector. We denote this mathematically as $f\left(x\right): \mathbb{R}^{m} \mapsto \mathbb{R}^{n}$. This expression is read: the function $f\left(x\right)$ maps $\mathbb{R}^{m}$ to $\mathbb{R}^{n}$.
Example 1¶
The system of nonlinear equations \begin{align} x_{1}x_{2}^{3} + \ln\left(x_{3}^{2}\right) = \sin\left(x_{1}x_{2}x_{3}\right) \\ x_{1} + x_{2} + \tan\left(x_{3}\right) = \frac{1}{x_{1}x_{2}x_{3}} \end{align} has \begin{align} x = \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \end{bmatrix} \end{align} and we say $x\in\mathbb{R}^{3}$. Thus, following the notation above, we say $m = 3$. The function of interest is \begin{align} \displaystyle f\left(x\right) = \begin{bmatrix} \displaystyle x_{1}x_{2}^{3} + \ln\left(x_{3}^{2}\right) - \sin\left(x_{1}x_{2}x_{3}\right) \\ \displaystyle x_{1} + x_{2} + \tan\left(x_{3}\right) - \dfrac{1}{x_{1}x_{2}x_{3}} \end{bmatrix} . \end{align} Thus $f\left(x\right)$ maps $\mathbb{R}^{3}$ to $\mathbb{R}^{2}$ and we write $f\left(x\right): \mathbb{R}^{3} \mapsto \mathbb{R}^{2}$.
Example 1 Continued (with numbers!)¶
Say $x_{1} = 1, x_{2} = 2, x_{3} = 1$. Then, \begin{align} x = \begin{bmatrix} 1 \\ 2 \\ 1 \end{bmatrix} \end{align} and \begin{align} f = \begin{bmatrix} 7.09 \\ 4.06 \end{bmatrix}. \end{align}
Newton's Method¶
We may have cause to find an $x$ that renders $f\left(x\right)=0$. This is not so difficult for a linear system, but for a nonlinear system it can be a major challenge. Newton's method is an algorithm with excellent convergence properties that allows us to find the roots of a nonlinear function.
Breakout Room (10 minutes)¶
- Discuss / understand the mathematical terminology just introduced.
- Root-finding isn't just some academic exercise. Discuss some applications of root-finding in the real world.
Derivation of Newton's Method¶
We want to find $x\in\mathbb{R}^{m}$ such that $f\left(x\right) = 0$ for $f\left(x\right) \in \mathbb{R}^{n}$. Here are the basic steps:
Step 1¶
- Start with an initial guess $x^{\left(k\right)}$. It is very likely that $f\left(x^{\left(k\right)}\right) \neq 0$.
Step 2¶
- Look at a point just a little beyond $x^{\left(k\right)}$. That is $x^{\left(k+1\right)} = x^{\left(k\right)} + \Delta x^{\left(k\right)}$ where $\Delta x^{\left(k\right)} = x^{\left(k+1\right)} - x^{\left(k\right)}$.
Step 3¶
- Since $\Delta x^{\left(k\right)}$ is "small", and we are still in the local neighborhood of $x^{\left(k\right)}$, we ought to expand $f\left(x\right)$ about $x^{\left(k\right)}$ in a Taylor series. $$f\left(x^{\left(k\right)} + \Delta x^{\left(k\right)}\right) = f\left(x^{\left(k\right)}\right) + \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x^{\left(k\right)}} \Delta x^{\left(k\right)} + \cdots.$$
Step 4¶
- Working with higher-order terms can be messy, so we'll just keep the linear terms and write $$f\underbrace{\left(x^{\left(k\right)} + \Delta x^{\left(k\right)}\right)}_{x^{\left(k+1\right)}} \approx f\left(x^{\left(k\right)}\right) + \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x^{\left(k\right)}} \Delta x^{\left(k\right)}.$$
Step 5¶
- We want $f\left(x^{\left(k+1\right)}\right) = 0$. So we write $$f\left(x^{\left(k\right)}\right) + \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x^{\left(k\right)}} \Delta x^{\left(k\right)} = 0.$$ It is understood that this relationship is approximate and that we're looking for an approximate correction $\Delta x^{\left(k\right)}$ that will give $f\left(x^{\left(k+1\right)}\right) = 0$.
Step 6¶
- We write a linear system at iteration $k$ for $\Delta x^{\left(k\right)}$. $$\displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x^{\left(k\right)}} \Delta x^{\left(k\right)} = -f\left(x^{\left(k\right)}\right).$$ In linear algebra terms we write $$J \Delta x^{\left(k\right)} = -f\left(x^{\left(k\right)}\right)$$ where $$J = \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x^{\left(k\right)}}$$ is called the Jacobian matrix evaluated at $x^{\left(k\right)}$.
Step 7¶
- The new guess for the root is $x^{\left(k+1\right)} = x^{\left(k\right)} + \Delta x^{\left(k\right)}$. This process is repeated until $f\left(x^{\left(k+1\right)}\right) \approx 0$ to within some acceptable tolerance.
- Start with an initial guess $x^{\left(k\right)}$. It is very likely that $f\left(x^{\left(k\right)}\right) \neq 0$.
- Look at a point just a little beyond $x^{\left(k\right)}$. That is $x^{\left(k+1\right)} = x^{\left(k\right)} + \Delta x^{\left(k\right)}$ where $\Delta x^{\left(k\right)} = x^{\left(k+1\right)} - x^{\left(k\right)}$.
- Since $\Delta x^{\left(k\right)}$ is "small", and we are still in the local neighborhood of $x^{\left(k\right)}$, we ought to expand $f\left(x\right)$ about $x^{\left(k\right)}$ in a Taylor series. $$f\left(x^{\left(k\right)} + \Delta x^{\left(k\right)}\right) = f\left(x^{\left(k\right)}\right) + \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x^{\left(k\right)}} \Delta x^{\left(k\right)} + \cdots.$$
- Working with higher-order terms can be messy, so we'll just keep the linear terms and write $$f\left(x^{\left(k\right)} + \Delta x^{\left(k\right)}\right) \approx f\left(x^{\left(k\right)}\right) + \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x^{\left(k\right)}} \Delta x^{\left(k\right)}.$$
- We want $f\left(x^{\left(k+1\right)}\right) = 0$. So we write $$f\left(x^{\left(k\right)}\right) + \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x^{\left(k\right)}} \Delta x^{\left(k\right)} = 0.$$ It is understood that this relationship is approximate and that we're looking for an approximate correction $\Delta x^{\left(k\right)}$ that will give $f\left(x^{\left(k+1\right)}\right) = 0$.
- We write a linear system at iteration $k$ for $\Delta x^{\left(k\right)}$. $$\displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x^{\left(k\right)}} \Delta x^{\left(k\right)} = -f\left(x^{\left(k\right)}\right).$$ In linear algebra terms we write $$J \Delta x^{\left(k\right)} = -f\left(x^{\left(k\right)}\right)$$ where $$J = \displaystyle\left.\dfrac{\partial f}{\partial x}\right|_{x = x^{\left(k\right)}}$$ is called the Jacobian matrix.
- The new guess for the root is $x^{\left(k+1\right)} = x^{\left(k\right)} + \Delta x^{\left(k\right)}$. This process is repeated until $f\left(x^{\left(k+1\right)}\right) \approx 0$ to within some acceptable tolerance.
Note: There are many ways of implementing Newton's method with varying sophistication. We won't cover those ideas at the moment.
Note: There are many ways of implementing Newton's method with varying sophistication. We won't cover those ideas at the moment.
At the heart of Newton's method lies the Jacobian matrix, which is a matrix of partial derivatives. Accurate calculation of the Jacobian is key to realizing the good convergence properties of Newton's method.
Example 2¶
We want to find where the functions $y_{1} = x$ and $y_{2} = \exp\left(-2\sin^{2}\left(4x\right)\right)$ intersect. That is, for what values of $x$ is $y_{1} = y_{2}$? This question is equivalent to the problem of finding the zeros of $$f\left(x\right) = x - \exp\left(-2\sin^{2}\left(4x\right)\right).$$
Before doing anything, it's a good idea to visualize what we're up against.
# Define function
x = np.linspace(0.0, 2.0*np.pi, 1000)
y = np.exp(-2.0 * np.sin(4.0*x)*np.sin(4.0*x))
# Plot the two functions to see where they intersect
fig, ax = plt.subplots(1,1, figsize=(10,6), constrained_layout=True)
ax.plot(x, y, lw=3, label=r'$y_{1} = \exp\left(-2\sin^{2}\left(4x\right)\right)$')
ax.plot(x, x, ls='--', lw=3, label=r'$y_{2} = x$')
ax.set_xlim(0, 2.0)
ax.set_ylim(0, 2.0)
ax.set_xlabel(r'$x$', fontsize=24)
ax.set_ylabel(r'$y$', fontsize=24)
ax.set_title(r'Intersections of $y_{1}$ and $y_{2}$', fontsize=24)
ax.tick_params(labelsize=24)
ax.legend(fontsize=24);
# Plot f(x) = x - y to see where the zeros are
fig, ax = plt.subplots(1,1, figsize=(10,6), constrained_layout=True)
ax.axhline(0, color='k', ls='--', lw=6)
ax.plot(x, x-y, lw=6, label=r'$x-\exp\left(-2\sin^{2}\left(4x\right)\right)$')
ax.set_xlim(0, 1.0)
ax.set_ylim(-1.0, 1.0)
ax.set_xlabel(r'$x$', fontsize=24)
ax.set_ylabel(r'$f$', fontsize=24)
ax.set_title(r'Zeros of $f\left(x\right) = x-\exp\left(-2\sin^{2}\left(4x\right)\right)$', fontsize=24)
ax.tick_params(labelsize=24);
It's pretty messy. There are three zeros and we can't solve this monster by hand. Let's try Newton's method. Like any good developer, we begin by sketching out our implementation.
xk # Set initial guess
tol # Set tolerance
max_it # Maximum iterations
for k in range(max_it):
delta_xk = -f(xk) / dfdx(xk) # Evaluate Delta x
if (delta_xk <= tol): # Stop iteration if solution
root = xk + delta_xk
break
xk += delta_xk # Update xk
Looks good. Let's start implementing. Clearly we need functions for evaluating $f\left(x\right)$ and the Jacobian. Note that the symbolic derivative of $f\left(x\right)$ is $$\dfrac{df}{dx} = 1 + 16\exp\left(-2\sin^{2}\left(4x\right)\right)\sin\left(4x\right)\cos\left(4x\right).$$
def f(x):
# Hard-coded f(x)
return x - np.exp(-2.0 * np.sin(4.0*x) * np.sin(4.0*x))
def dfdx(x):
# Hard-coded "Jacobian matrix" of f(x)
return 1.0 + 16.0 * np.exp(-2.0 * np.sin(4.0*x) * np.sin(4.0*x)) * np.sin(4.0*x) * np.cos(4.0*x)
# Start Newton algorithm
xk = 0.1 # Initial guess
tol = 1.0e-08 # Some tolerance
max_it = 100 # Just stop if a root isn't found after 100 iterations
root = None # Initialize root
for k in range(max_it):
delta_xk = -f(xk) / dfdx(xk) # Update Delta x_{k}
if (abs(delta_xk) <= tol): # Stop iteration if solution found
root = xk + delta_xk
print("Found root at x = {0:17.16f} after {1} iteratons.".format(root, k+1))
break
print("At iteration {0}, Delta x = {1:17.16f}".format(k+1, delta_xk))
xk += delta_xk # Update xk
At iteration 1, Delta x = 0.1218876773122230 At iteration 2, Delta x = 0.0233959919379915 At iteration 3, Delta x = 0.0020665479886371 At iteration 4, Delta x = 0.0000150007973689 Found root at x = 0.2473652188201050 after 5 iteratons.
It appears to be working okay. Let's check to see what the function's value is at our supposed root.
print("f({0:17.16f}) = {1:17.16e}".format(root, f(root)))
f(0.2473652188201050) = 5.5511151231257827e-17
Alright, I'm convinced.
(However, if you play with this implementation a bit you'll see some strange behavior. We can discuss offline if you're curious.)
Warning! This little sketch is not the optimal way of implementing Newton's method. It just gives you the basic ideas!
Breakout Room (10 minutes)¶
- Discuss problems from your own research areas that require derivatives. How do you calculate them?
Summary¶
Derivatives are everywhere in science and engineering. We illustrated a situation from optimization where we try to find the roots of a complicated, high-dimensional nonlinear function. The algorithm that we used was Newton's method, which requires evaluations of the Jacobian.
We did a nice little example to find the roots of a nonlinear function in one dimension. We hard-coded the function and its derivative. What if we can't easily compute the derivative by hand? Or what if we're just lazy, and we don't want to compute the derivative by hand?
The Finite Difference¶
Suppose we want to avoid relying on the symbolic computation of the derivative. An obvious and very convenient way to do so is to use a finite difference. For a single-variable function, we just write $$\dfrac{\partial f}{\partial x} \approx \dfrac{f\left(x+\epsilon\right) - f\left(x\right)}{\epsilon}$$ for some "small" $\epsilon$. Let's do a little demo to see how things turn out.
We'll compute the derivative of $$f\left(x\right) = x - \exp\left(-2\sin^{2}\left(4x\right)\right).$$ We've already defined python
functions for $f\left(x\right)$ and its derivative. Let's define a function for the first order finite difference.
def dfdx_h(x, epsilon):
return (f(x + epsilon) - f(x)) / epsilon
x = np.linspace(0.0, 2.0, 1000) # Define domain
eps = np.logspace(-13, 1, 1000) # Define \epsilon domain
err = np.zeros(len(eps))
# Loop over epsilons
for idx, e in enumerate(eps):
df_err = dfdx_h(x,e) - dfdx(x) # Compute error between FD and analytical at each point
err[idx] = np.linalg.norm(df_err) # Store error as L2 norm
# Make a log-log plot of the error
fig, ax = plt.subplots(1,1, figsize=(10,6), constrained_layout=True)
ax.plot(eps, err, lw=3)
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel(r'$\epsilon$', fontsize=24)
ax.set_ylabel(r'$\| f^{\prime}_{FD} - f^{\prime}_{exact}\|_{L_{2}}$', fontsize=24)
ax.tick_params(labelsize=24);
# Print out minimum error
print(r"The minimum error of {0:17.16e} was obtained at epsilon = {1:17.16e}.".format(err.min(), eps[err==err.min()][0]))
The minimum error of 1.4490391090618430e-06 was obtained at epsilon = 2.8605955351757505e-09.
Very interesting! First of all, the smallest error we could get was around $10^{-6}$ which is several orders of magnitude above machine precision. Secondly, the minimum error was not obtained at $\epsilon = $ machine precision. The observation is that $\epsilon$ too small starts to amplify floating point errors while $\epsilon$ too large doesn't provide a good approximation to the derivative.
In fact, it's not clear how to choose $\epsilon$ in general. There are some results from numerical analysis and they indicate that $\epsilon$ should be around $\sqrt{\epsilon_{\text{machine}}}$ as a rule of thumb for a first-order method.
eps_mach = np.finfo(float).eps
print("Machine precision: {0}".format(eps_mach))
print("Square root of machine precision: {0}".format(np.sqrt(eps_mach)))
Machine precision: 2.220446049250313e-16 Square root of machine precision: 1.4901161193847656e-08
A few more notes on finite difference approaches¶
- We could get a higher-order approximation by using the second-order finite difference formula $$\dfrac{\partial f}{\partial x} \approx \dfrac{f\left(x+\epsilon\right) - f\left(x - \epsilon\right)}{2\epsilon}.$$ This formula suffers from the same difficulties as the first-order approximation.
- To calculate the derivative of a multivariate function we can write $$\dfrac{\partial f}{\partial x_{j}} \approx \dfrac{f\left(x+\epsilon e_{j}\right) - f\left(x\right)}{\epsilon}$$ where $e_{j}$ is the unit vector in the direction of $x_{j}$.
Breakout Room (10-15 minutes)¶
- Make sure you understand the following:
- Machine precision
- Finite difference of multivariate function
- What are the benefits of the finite difference method over symbolic derivatives?
- What are the benefits of symbolic derivatives over the finite difference method?
Towards Automatic Differentiation¶
In the introduction, we motivated the need for computational techniques to compute derivatives. The focus in the introduction was on the finite difference method, but we also computed a symbolic derivative. The finite difference approach is nice because it is quick and easy. However, it suffers from accuracy and stability problems. On the other hand, symbolic derivatives can be evaluated to machine precision, but can be costly to evaluate. We'll have more to say about cost of symbolic differentiation later.
Automatic differentiation (AD) overcomes both of these deficiencies. It is less costly than symbolic differentiation while evaluating derivatives to machine precision. There are two modes of automatic differentiation: forward and reverse. The famous backpropagation algorithm from machine learn is a special case of the reverse mode of automatic differentiation.