Much of this content is based on lecture slides from slides from Professor David Barber at University College London: resources relating to this can be found at: www.cs.ucl.ac.uk/staff/D.Barber/brml
What is Autodiff?
Autodiff, or Automatic Differentiation, is a method of determining the exact derivative of a function with respect to its inputs. It is widely used in machine learning- in this post I will give an overview of what autodiff is and why it is a useful tool.
The above is not a very helpful definition, so we can compare autodiff first to symbolic differentiation and numerical approximations before going into how it works.
Symbolic differentiation is what we do when we calculate derivatives when we do it by hand, i.e. given a function , we find a new function . This is really good when we want to know how functions behave across all inputs. For example if we had we can find the derivative as and then we can find the derivative of the function for all values of . For example, for we have .
Numerical approximations can come from the definition of the derivative: normally we take the limit as for the strict definition, for the numerical approximation we just take a small and calculate the approximation as follows:
,
for using the same example as above with $\delta = 0.01$ we have $f'(x) \approx \frac{11.0701-11}{0.01} = 7.01$. This is pretty close!
We use autodiff instead of these two options because:
- It is computationally more efficient than symbolic differentiation
- Often we only needed the derivative at a point rather than an expression that gives the derivative for the whole function
- It returns an exact value for derivative
- Sometimes we will be looking for functions that include computational structures such as ‘for loops’ and ‘if’ statements. This makes getting a symbolic representation of the function as a whole a lot harder- breaking it into steps allows for easier computation.
Forward and reverse mode autodiff
Forward mode autodiff is a less computationally efficient method of derivative calculation than is reverse mode autodiff. We will consider them both briefly below.
What will become apparent, looking at the calculations, is that doing differentiation in this way will get you the same answers as a direct application of the Chain Rule. The reason for building a computational tree and for following the schedules outlined below is that we want to get the same answer as the Chain Rule but in a computationally efficient way.
What I want to do here is to give an example of a computational tree, which is an idea that we will leverage for the autodiff calculations. We will use this tree as a very simple example to demonstrate:
- how to do froward and reverse mode autodiff
- Why reverse mode autodiff is typically a faster
The above computational tree represents a nested function, starting with three inputs, let us define the function as follows:
For forward-mode autodiff: if we want to find , the derivative of with respect to , we start at and step through the tree as follows:
and then we are done. This is clearly the same answer as given by the Chain Rule. To find the values for we would do the corresponding calculations for those two inputs.
In the case where is an input to functions that are not nested (as shown below) we will sum the partial derivatives of the functions with respect to the input.
Reverse differentiation follows the exact same process, except backwards. The reason why this is a good idea can be seen when we look at the first computation graph. If we want the derivative of with respect to , starting from the top of the graph each time would result in some redundant calculations. The reason for this is that in all three cases we would need and , therefore using forward mode autodiff would result in calculating these two partial derivatives three times each, instead of once each for reverse mode autodiff. In this case it is 4 extra calculations, which does not seem like a big deal, but if we had, say, 500, inputs into , it might slow things down considerably.
So starting from the bottom of the graph and working upwards, we can get all three derivatives doing fewer calculations. We can find the derivatives as follows:
At the cost of some additional memory, reverse mode autodiff gives a faster, more efficient way of calculating exact derivatives of more complex computational routines.
Applications
For non-convex optimisation problems, a common method of optimisation is through gradient-descent-based algorithms. These rely on calculations of gradients of exactly the kind of functions that are discussed above. For example, ‘backpropogation’ that is used in training Deep Neural Networks is a kind of reverse Autodiff. Packages such as Tensorflow and Pytorch implement these kinds of calculations in order to train models/find optimal (in some sense) parameter values.
Leave a Reply