Key Word(s): Automatic differentiation, Forward mode, Dual numbers, Reverse mode
References¶
- A Hitchhiker’s Guide to Automatic Differentiation
- Griewank, A. and Walther, A., 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation (Vol. 105). Siam.
- Nocedal, J. and Wright, S., 2001. Numerical Optimization, Second Edition. Springer.
# Load libraries that we'll need
import matplotlib.pyplot as plt
import numpy as np
%matplotlib notebook
# or inline
def f(x):
# Hard-coded f(x)
return x - np.exp(-2.0 * np.sin(4.0*x) * np.sin(4.0*x))
def dfdx(x):
# Hard-coded "Jacobian matrix" of f(x)
return 1.0 + 16.0 * np.exp(-2.0 * np.sin(4.0*x) * np.sin(4.0*x)) * np.sin(4.0*x) * np.cos(4.0*x)
def dfdx_h(x, epsilon):
return (f(x + epsilon) - f(x)) / epsilon
Today¶
- The Jacobian in forward mode
- What the forward mode actually computes
Dual numbers- Towards implementation
Auto-eD
: An AD Tool¶
A former student, Lindsey Brown, developed an AD tool out of her final project in CS207.
You are welcome / encouraged to give it a try:
- Clone the repo and launch locally
- Clone: https://github.com/lindseysbrown/Auto-eD
- Read the Docs: https://auto-ed.readthedocs.io/en/latest/
- Launch locally:
python ADappHeroku.py
- Go to the public server: https://autoed.herokuapp.com/
- WARNING: Traffic to this server is limited. It will be very slow if too many people try to access it at once. Try to run locally if you can.
Note: At this time, Auto-eD
only works on Chrome and Microsoft Edge. We're working on Safari and Firefox.
Towards the Jacobian Matrix¶
Time to take a step back and recap. In the very beginning of the AD lectures, we motivated calculation of derivatives by Newton's method for solving a system of nonlinear equations. One of the key elements in that method is the Jacobian matrix $J = \dfrac{\partial f_{i}}{\partial x_{j}}$. So far, you have learned how to manually perform the forward mode of automatic differentiation both for scalar functions of a single variable and scalar functions of multiple variables. The solution of systems of equations requires differentation of a vector-function of multiple variables.
What Forward AD Actually Computes¶
The forward mode of automatic differentiation actually computes the product $\nabla f \cdot p$.
Some Notation and the Seed Vector¶
Note that every time we computed the derivative using the forward mode, we "seeded" the derivative with a value.
We chose this value to be $1$.
Let's look at a very simple function $f\left(x,y\right) = xy$.
Clearly $\nabla f = \begin{bmatrix} y \\ x \end{bmatrix}$ for $f\left(x,y\right) = xy$.
The evaluation trace consists of $f = x_{3} = x_{1}x_{2}$.
Let's introduce a seed vector $p$. The definition of the directional derivative is $$D_{p}x_{3} = \nabla x_{3} \cdot p = \sum_{j=1}^{2}{\dfrac{\partial x_{3}}{\partial x_{j}}p_{j}}.$$
This tells us the derivative of $x_{3}$ ($f$) in the direction of the vector $p$.
Expanding the sum gives \begin{align} D_{p}x_{3} &= \dfrac{\partial x_{3}}{\partial x_{1}}p_{1} + \dfrac{\partial x_{3}}{\partial x_{2}}p_{2} \\ &= x_{2}p_{1} + x_{1}p_{2}. \end{align}
Let's consider two choices for $p$:
- $p = \left(1,0\right)$ gives us $\dfrac{\partial f}{\partial x}$
- $p = \left(0,1\right)$ gives us $\dfrac{\partial f}{\partial y}$
Do you see why?
When $p = \left(1, 0\right)$, our directional derivative is \begin{align} D_{p}x_{3} &= \dfrac{\partial x_{3}}{\partial x_{1}}p_{1} + \dfrac{\partial x_{3}}{\partial x_{2}}p_{2} \\ &= x_{2}p_{1} + x_{1}p_{2} \\ &= x_{2}. \end{align}
But remember that $\nabla f = \begin{bmatrix} \partial f / \partial x \\ \partial f / \partial y \end{bmatrix} = \begin{bmatrix} y \\ x \end{bmatrix} = \begin{bmatrix} x_{2} \\ x_{1} \end{bmatrix}$.
Of course, we aren't required to choose the seed vectors to be unit vectors. However, the utility should be apparent.
Note that this could be written as $$D_{p}x_{3} = \nabla x_{3}\cdot p.$$ So, the forward mode of AD is really computing the product of the gradient of our function with the seed vector!
If $f$ is a vector, then the forward mode actually computes $Jp$ where $J = \dfrac{\partial f_{i}}{\partial x_{j}}, \quad i = 1,\ldots, n, \quad j = 1,\ldots,m$ is the Jacobian matrix.
In many applications, we really only want the "action" of the Jacobian on a vector. That is, we just want to compute the matrix-vector product $Jp$ for some vector $p$. The action is $Jp$.
Of course, if we really want to form the entire Jacobian, this is readily performed by forming a set of $m$ unit seed vectors $\left\{p_{1},\ldots p_{m}\right\}$ where $p_{k}\in \mathbb{R}^{m}$ consists of all zeros except for a single $1$ in position $k$.
Short Demo¶
That was a lot of jargon. Let's do a little example to give the basic ideas. Let \begin{align} f\left(x,y\right) = \begin{bmatrix} xy + \sin\left(x\right) \\ x + y + \sin\left(xy\right) \end{bmatrix}. \end{align}
The computational graph will have two inputs, $x$ and $y$, and two outputs, this time $f_{1}$ and $f_{2}$.
In the notation below $v_{5}$ corresponds to $f_{1}$ and $v_{6}$ corresponds to $f_{2}$.
Let's start by computing the gradient of $f_{1}$. Using the directional derivative we have \begin{align} D_{p}v_{5} = \left(\cos\left(x_{1}\right) + x_{2}\right)p_{1} + x_{1}p_{2}. \end{align} Note: You should fill in the details!
- Draw the graph
- Write out the trace
- You can do this by evaluating the function at the point $\left(a,b\right)$, although this isn't strictly necessary for this problem.
Notice that \begin{align*} \dfrac{\partial f_{1}}{\partial x} &= y + \cos\left(x\right) \\ \dfrac{\partial f_{1}}{\partial y} &= x \end{align*}
$D_{p}v_{5} = \left(\cos\left(x_{1}\right) + x_{2}\right)p_{1} + x_{1}p_{2}$
Choosing $p = (1,0)$ gives $\dfrac{\partial f_{1}}{\partial x}$ and choosing $p = (0,1)$ gives $\dfrac{\partial f_{1}}{\partial y}$.
These form the first row of the Jacobian! The second row of the Jacobian can be computed similarly by working with $v_{6}$.
The take-home message is that we can form the full Jacobian by using $m$ seed vectors where $m$ is the number of independent variables.
This is independent of the number of functions!
Thoughts on Implementation¶
There are different ways of implementing automatic differentiation. Two predominant approaches are:
- code translation through pre-processor directives and
- operator-overloading.
We won't much to say about code translation. It can be quite complicated, even going so far as to write a compiler. The high-level idea is that the compiler writes additional code to compute derivatives of functions. The benefits are that it can be very efficient and fast.
Our approach will use operator overloading.
Operator Overloading¶
The Wikipedia article on operator overloading states:
...operator overloading, sometimes termed operator ad hoc polymorphism, is a specific case of polymorphism, where different operators have different implementations depending on their arguments.
You have already seen the basics of this although we didn't call it operator overloading. Python
allows operator overloading via the special methods.
For example, suppose you have a Python class representing a circle. You could define the __add__
method (and __radd__
for communtivity) to add two circle objects. In this way, you would overload the additon operator (+
).
Example¶
Let's create a class to do complex numbers again.
# Complex class again
class Complex():
def __init__(self, a, b):
# Constructor to set up real and imag parts
self.a = a
self.b = b
# Create some complex numbers
z1 = Complex(0, 1)
z2 = Complex(2, 0)
# And add them
z = z1 + z2
print(z.a, z.b)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-2-5e457e1e9eab> in <module> 11 12 # And add them ---> 13 z = z1 + z2 14 print(z.a, z.b) TypeError: unsupported operand type(s) for +: 'Complex' and 'Complex'
That didn't work. python
doesn't know how to apply the addition operator (+
) to Complex
objects! Let's remedy that.
# Complex class again
class Complex():
def __init__(self, a, b):
# Constructor to st up real and imag parts
self.a = a
self.b = b
def __add__(self, other):
# Implementating complex addition
return Complex(self.a + other.a, self.b + other.b)
z1 = Complex(0, 1)
z2 = Complex(2, 0)
z = z1 + z2
print(z.a, z.b)
2 1
Okay, much better. Of course, there are still many problems here:
- No convenient way to print out a complex number (where's
__str__
?) - What happens when you try to add a complex number to a real number (e.g.
2 + z
)?- Try it!
- Then check out
__radd__
. - This is how you get commutivity.
- Many operations are missing (e.g. where's multiplication?).
Nevertheless, the addition operator has been (partly) successfully overloaded (we didn't get commutivity yet).
Comments on __add__
¶
- Python calls
__add__
to add two object instancesa
andb
. - When you write
a+b
Python really doesa.__add__(b)
. - So if you define
__add__
, you can add things to an object. - But, you can't add the object to things!
More Comments on __add__
¶
- Suppose you have
object1
andobject2
. These both have__add__
defined. - So there's no problem for Python to call
object1.__add__(object2)
. - But if you tried to do
object1 + 1.0
you would (probably) get an error because thefloat
type probably doesn't have the same kind of addition asobject1
. - So you would have to enhance
__add__
with duck-typing viatry-except
blocks.- Homework!
# Complex class again
class Complex():
def __init__(self, a, b):
# Constructor to st up real and imag parts
self.a = a
self.b = b
def __add__(self, other):
# Implementating complex addition
return Complex(self.a + other.a, self.b + other.b)
z1 = Complex(0.0, 1.0)
z = z1 + 1.0
print(z.a, z.b)
Comments on __radd__
¶
- But wait, there's more! Suppose you successfully support both
object1 + object2
andobject1 + 1.0
. - Many things in life are commutative, so you'd really like to be able to do
1.0 + object1
. - So Python calls
1.0.__add__(object1)
and you get an attribute error! - To solve this, you must use
__radd__
- Homework!
- First Python tries
1.0.__add__(object1)
and when that fails it callsobject1.__radd__(1.0)
. - If
__radd__
just calls__add__
, then all your bases are covered.- Hint:
self.__add__(other)
- Hint:
See Emulating Numeric Types for more.
Towards Implementation¶
This section will give you a hint on how you may go about implementing the forward mode. I will be deliberately terse. In fact, I will not even define a class or use any operator overloading. What I'm about to show you is almost purely conceptual. You will do the hard work of coming up with a user-friendly implementation for your project.
Let's start with a baby example. Suppose at some point in our evaluation trace we have $$v_{3} = v_{1}v_{2}.$$ We want to evaluate $v_{3}$ and the derivative. Following our usual conventions we have $$\dot{v}_{3} = v_{1}\dot{v}_{2} + \dot{v}_{1}v_{2}.$$ For our purposes here, suppose we have that $v_{1} = 2$ and $v_{2} = 3$ as well as $\dot{v}_{1} = 1$ and $\dot{v}_{2} = 4$. It doesn't matter where these values came from at the moment.
Let's define an object that has two components. The first component will hold the value of the function and the second component will hold the value of the derivative.
# v1 v1 dot
v1 = [2, 1] # value and derivative
v2 = [3, 4]
It's clear how to compute the value of the function at the point $a$. Simply use the first component of our object to compute the function value.
f = [0, 0] # Allocate the values of the function and its derivative
# f(a) = v1 * v2
f[0] = v1[0] * v2[0]
print(f)
[6, 0]
But to calculate the derivative, we need to overload multiplication. Any time we encounter a multiplication of something in the second slot, we need to use the chain rule.
#f'(a) = v1 * v2dot + v1dot * v2
f[1] = v1[0] * v2[1] + v1[1] * v2[0]
print(f)
[6, 11]
Let's recap. We'll use OOP vernacular.
- We instantiated two gradient objects $v_{1}$ and $v_{2}$ (here just lists).
- These objects have two name attributes.
- The first attribute is the function value.
- The second attribute is the derivative.
- When performing a multiplication, the usual multiplication operator can be used for the function value part of the object.
- However, for the derivative part, one must overload the multiplication operator.
- Note that this could be accomplished in a class by implementing the
__mul__
and__rmul__
special methods and returning the gradient object (like we did for__add__
in theComplex
class example).
- Note that this could be accomplished in a class by implementing the
Automatic Differentiation: The Reverse Mode¶
References¶
- A Hitchhiker’s Guide to Automatic Differentiation
- Griewank, A. and Walther, A., 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation (Vol. 105). Siam.
- Nocedal, J. and Wright, S., 2001. Numerical Optimization, Second Edition. Springer.
- Automatic Differentiation in Machine Learning: A Survey
Towards the Reverse Mode¶
The reverse mode is also extremely popular and useful (e.g. in scenarios that have $f:\mathbb{R}^{m}\mapsto\mathbb{R}$).
Here we will outline the mechanics of the reverse mode, show a little example, and survey the basic equations.
A Sketch of the Reverse Mode¶
- Create evaluation graph
- Forward pass does function evaluations
- Forward pass also does partial derivatives of elementary functions
- It does NOT do the chain rule!
- Just stores the partial derivatives
- For example:
- If $v_{3} = v_{1}v_{2}$ is a node, we store $\dfrac{\partial v_{3}}{\partial v_{1}}$ and $\dfrac{\partial v_{3}}{\partial v_{2}}$. That's it.
- If $v_{3} = \sin\left(v_{2}\right)$ is a node, we store $\cos\left(v_{2}\right)$. Notice that there's no $\dot{v}_{2}$.
- Reverse pass starts with $\overline{v}_{N} = \dfrac{\partial f}{\partial v_{N}} = 1$ (since $f$ is $v_{N}$)
- Then it gets $\overline{v}_{N-1} = \dfrac{\partial f}{\partial v_{N}}\dfrac{\partial v_{N}}{\partial v_{N-1}}$
- Note: $\dfrac{\partial v_{N}}{\partial v_{N-1}}$ is already stored from the forward pass
- The only trick occurrs when we get to a branch in the graph. That is, when the node we're on has more than one child. In that case, we sum the two paths. For example, if $v_{3}$ has $v_{4}$ and $v_{5}$ as children, then we do $$\overline{v}_{3} = \dfrac{\partial f}{\partial v_{3}} = \dfrac{\partial f}{\partial v_{4}}\dfrac{\partial v_{4}}{\partial v_{3}} + \dfrac{\partial f}{\partial v_{5}}\dfrac{\partial v_{5}}{\partial v_{3}}.$$
- Note: This summation is a manifestation of the chain rule.
The reverse mode cannot be interpreted in the context of dual numbers like the forward mode was.
The little implementation sketch that we did for the forward mode will need to be generalized for the reverse mode.
The Basic Equations¶
These equations are modified from Nocedal and Wright (page 180).
The partial derivative of $f$ with respect to $u_{i}$ can be written as \begin{align} \dfrac{\partial f}{\partial u_{i}} = \sum_{\text{j a child of i}}{\dfrac{\partial f}{\partial u_{j}}\dfrac{\partial u_{j}}{\partial u_{i}}}. \end{align} At each node $i$ we compute \begin{align} \overline{u}_{i} += \dfrac{\partial f}{\partial u_{j}}\dfrac{\partial u_{j}}{\partial u_{i}}. \end{align} The $\overline{u}_{i}$ variable stores the current value of the partial derivative at node $i$. It is sometimes called the adjoint variable.
In our notation, $u_{i}$ could be an input variable ($x_{i}$) or an intermediate variable ($v_{i}$).
An Example for Intuition¶
Let's try to evaluate the function $$f\left(x,y\right) = xy + \exp\left(xy\right)$$ and its gradient at the point $a = (1,2)$. We'll use the reverse mode this time.
Clearly we have $$\nabla f = \begin{bmatrix} y + y\exp\left(xy\right) \\ x + x\exp\left(xy\right) \end{bmatrix}.$$ Hence \begin{align} f\left(a\right) &= 2 + e^{2} \\ \nabla f &= \begin{bmatrix} 2 + 2e^{2} \\ 1 + e^{2} \end{bmatrix}. \end{align}
Here's a visualization of the computational graph:
Let's use the reverse mode to calculate the gradient of $f$.
- Generate the forward trace and calculate the partial derivatives of a node wrt its children.
- Notice that this time we need to save the graph.
- Start at $v_{3}$ and start calculating the chain rule.
Exercise 1¶
Consider the function $$f\left(w_{1}, w_{2}, w_{3}, w_{4}, w_{5}\right) = w_{1}w_{2}w_{3}w_{4}w_{5}$$ evaluated at the point $a = \left(2, 1, 1, 1, 1\right)$.
- Calculate the gradient using the reverse mode.
- You may want to start by drawing the graph to help you visualize.
- Set up an evaluation table. Note that this table can have the same columns as the example in class.
- Write out the reverse mode based on the evaluation table.
- Calculate the gradient using forward mode.
- You can use the same graph as in the reverse mode.
- Set up a forward evaluation table. Note that this table will have more columns than the one you created for the reverse mode.
- For both forward and reverse mode, calculate the number of operations.
- Hint: You may count only the floating point operations (e.g. addition and multiplication). You are not required to count memory access steps, retrievals, etc.
Observations¶
- The forward and reverse modes give the same answer.
- Forward mode calculation depends on number of independent variables
- Note: In many applications, a function will depend on some independent variables and be parameterized by some other variables (e.g. $L\left(x;\theta\right)$).
- In many optimization problems, we treat $\theta$ as the independent variable. It is a matter of perspective.
- The reverse mode calculation does not depend on the number of independent variables.
What Reverse Mode Actually Computes¶
- Recall that the forward mode actually computes the Jacobian-vector product $$Jp.$$
- We noted that the full Jacobian could be computed by choosing $m$ seed vectors where $m$ is the number of independent variables, independent of $n$, the number of functions.
What Reverse Mode Actually Computes¶
- The reverse mode computes $$J^{T}p$$ where $J^{T}$ is the transpose of the Jacobian.
- If you want the transpose of the Jacobian, this can be accomplished independent of the number of independent variables.
Connection to Backpropagation¶
- Backpropagation is a special case of the reverse mode of automatic differentiation.
- The special case is:
- The objective function is a scalar function.
- The objective function represents an error between the output and a true value.
Some Take-Aways¶
- Automatic differentiation can be used to compute derivatives to machine precision of functions $f:\mathbb{R}^{m} \to \mathbb{R}^{n}$.
Some Take-Aways¶
- Forward mode is more efficient when $n\gg m$.
- This corresponds to the case where the number of functions to evaluate is much greater than the number of inputs.
- Actually computes the Jacobian-vector product $Jp$.
Some Take-Aways¶
- Reverse mode is more efficient when $n\ll m$.
- This corresponds to the case where the number of inputs is much greater than the number of functions.
- Actually computes the Jacobian-transpose-vector product $J^{T}p$.
Some Take-Aways¶
- Backpropagation is a special case of the reverse mode applied to scalar objective functions.
Applications and Extensions¶
So far, you've seen the mechanics of AD and the key concepts.
But there are many extensions and applications.
We will discuss just a few.
Extensions¶
- Higher order derivatives and mixed derivatives.
- e.g. $\nabla\nabla f$, $\dfrac{\partial^{2} f}{\partial x^{2}}$, $\dfrac{\partial^{2} f}{\partial x \partial y}$, etc.
- Hessians and beyond
- Efficient computation
- Smart graph storage (if possible)
- Writing parts of the graph to disk and hold others in memory
- Combining forward and reverse mode
- Exploiting sparsity in the Jacobian and/or Hessian
- Non-differentiable functions
- Differential programming
Applications¶
There are a huge number of applications for AD. Here is a sampling of a few.
Numerical Solution of Ordinary Differential Equations¶
- Numerical integration of "stiff" differential equations.
- This is usually accomplished using implicit discretization methods (e.g. Backward Euler).
- Implicit methods require the solution of a nonlinear system of equations.
- This system can be solved with Newton's method.
- Newton's method requires Jacobian-vector products.
Optimization of Objective Functions¶
- Optimize an objective (a.k.a. loss, a.k.a. cost) function.
- This means to tune a set of parameters.
- Algorithms to accomplish this require derivatives of the ojective wrt the parameters.
- There are many types of objective functions out there.
Solution of Linear Systems¶
- Many problems reduce to the solution of a linear system $Ax = b$.
- Iterative methods are powerful algorithms for solving such problems.
- Some iterative methods require derivative information (e.g. steepest decent, conjugate gradient, biconjugate gradient, etc.)