Automatic Differentiation from scratch

13 min read2 days ago

In the previous part of this series, we understood L-BFGS optimization algorithm and implemented the same from scratch in Python.

Implement L-BFGS optimization from scratch | by Abhijit Mondal | Jul, 2024 | Medium

One of the issues in the implementation we found was that, we needed to explicitly define the gradient for the objective function. While this works for simple objective functions, but for complicated functions calculating gradient can be tedious and erroneous.

For dynamic pricing, the objective function can be quite complicated with millions of parameters interacting with one another. In this post, we take a detour from L-BFGS and show an automatic method for computing the gradients without explicitly calculating the derivatives.

Recall, that from the 1st order Taylor series approximation, we can write the following:

The above approximation holds good when h is very small and is close to 0.

Then, we can compute the derivative f’ at some point x, using the above formula as:

Thus, we can use the above strategy to find the derivative of the objective function f at some point x. But the above has few problems such as:

Holds good only in the limit h->0. On most computers it can cause truncation errors. For e.g. if h is very small, then f(x+h) — f(x) would be very small and the computer will make it 0. Thus f’(x) would be 0.
If the function f(x) is complicated and require lots of CPU cycles, we are computing f 2 times for derivative. Now, if x is a vector of parameters of size n, then computing the gradient for each parameter would require 2n calculations.
Even if the error from the above approach is small, the errors can propagate and can blow up. For e.g. in neural networks with 1000s of layers, the error will propagate to all weights and all layers eventually, leading to significant errors in calculations.

To overcome the above issues, researchers have come up with the “automatic differentiation” approach. There are 2 different types of automatic differentiation strategies:

Forward mode and Reverse mode.

The forward mode is most intuitive and works as follows:

Given a function, which is a composition of multiple functions, for e.g. f(g1(h1(x,y), h2(y)), g2(h3(x,y), h4(x))). Then we can find the partial derivatives of f w.r.t. x and y as follows:

If we draw the computation graph for the above function:

Then we can obtain the partial derivatives of f w.r.t. x and y by backtracking from f in the above diagram to the corresponding nodes x and y. Note that we have to backtrack across all paths that lead to either x or y.

The backtracking paths are shown in red in the below diagram for partial derivative w.r.t. x.

For each edge from a to b in red, if we take the derivative da/db and take the product of all the derivatives for each edge in a given path and then sum over all paths, we will obtain the partial derivative formulae described above.

Thus, either we can use recursion or find the topological sorting of the computational graph to calculate the partial derivatives.

For e.g. using recursion we can write, G(f,x) = (df/dg1).G(g1,x) + (df/dg2).G(g2,x) where G(f,x) is the partial derivative of f w.r.t. x. or in general:

Another way to represent the above partial derivatives is through Jacobian matrices. Let us represent the following Jacobian matrices:

Then we can obtain the partial derivatives df/dx and df/dy simply by multiplying the above Jacobians.

Let’s look at the time complexity.

Let us assume that Jh is a matrix of size m1 x n (where n is the number of inputs and here n=2 i.e. x and y)

Let the dimension of Jg be m2 x m1 (The number of columns is m1 because g is a function of h and h has m1 rows).

Similarly, Jf will have m2 columns, but has only 1 row hence dimension is 1 x m2

Assuming that we are doing forward mode automatic differentiation, then the Jacobian Jh will be calculated first (due to recursion in the computational graph), then Jg and then Jf and thus the multiplication happens from right to left. Thus the time complexity is

O(m2*m1*n + 1*m2*n) = O(n³) assuming m1 = m2 = O(n).

Instead if we have done the matrix multiplications from left to right, the time complexity would have been:

O(1*m2*m1 + 1*m1*n) = O(n²)

This is only possible if we compute the Jacobian Jf first and then Jg and lastly Jh, which requires us to compute the automatic differentiation in the reverse order which is known as the Backward or Reverse mode of automatic differentiation.

An efficient way to calculate the Jacobians in the reverse mode is to first evaluate each function for the given inputs and then cache the function results and use them for differentiation. For e.g. in the above example given x and y as inputs calculate the following in the order:

h1, h2, h3 and h4
g1 and g2
f

Let us call it the forward pass.

Once we cache the results of the above function calls, then we can directly use them in the partial derivatives. For e.g. df/dg1 is simply the derivative of f w.r.t. g1 evaluated at g1 calculated in the forward pass.

Something similar happens in Neural Networks where you can assume that the (x, y) is the input layer, (h1, h2, h3, h4) is the 1st hidden layer, (g1, g2) is the next hidden layer and f is the output unit. Usually we make a forward pass first with x and y and then do backpropagation in the reverse order.

Thus the reverse mode of automatic differentiation is essentially the backpropagation algorithm.

Firstly, lets start by implementing the forward mode of automatic differention. Then we will gradually go and implement the reverse mode.

But the question is how to form the computational graph given any objective function. Let’s take an example:

Here we have only a single input variable x. Here is a way to break up that complicated looking function into intermediate functions as follows:

The computational graph will look as follows:

Note that after breaking up the function f into the above computational graph, we have obtained much simpler functions such as sin, cos, ln, exp, x^k etc. apart from addition and multiplication.

Computing the derivatives of the simpler functions is straightforward and moreover we can explictly define the derivatives for such functions given that there will be a handful of basic operations:

sin, cos, tan, exp, pow, log should cover most useful objective functions.

Here is a simple Python implementation of automatic differentiation for a single variable from scratch:

from functools import wraps
import math

# Automatic differentiation of exp(1 + x^4*(cos(log(x)) + sin(log(x))))

class Node:
    def __init__(self):
        self.function = None
        self.d_function = None

    def __add__(self, other):
        return add(self, other)

    def __sub__(self, other):
        return sub(self, other)

    def __mul__(self, other):
        return mul(self, other)

    def __truediv__(self, other):
        return div(self, other)

    def __call__(self, *args):
        return self.function(*args)

class Variable:
    def __init__(self):
        pass
        
    def __new__(cls):
        cls.node = Node()
        cls.node.function = lambda z: z
        cls.node.d_function = lambda : Constant(1.0)
        return cls.node

class Constant:
    def __init__(self, val):
        pass
        
    def __new__(cls, val):
        cls.node = Node()
        cls.node.function = lambda z: val
        cls.node.d_function = lambda : Constant(0.0)
        return cls.node

def my_decorator(f):
    @wraps(f)
    def wrapper(*args, **kwds):
        node = Node()
        new_args = []

        for x in args:
            if isinstance(x, Node):
                new_args += [x]
            else:
                raise Exception("Invalid input parameter !!!")

        evaluate, derive = f(*new_args, **kwds)
        
        node.function = evaluate
        node.d_function = derive
        
        return node
        
    return wrapper 

@my_decorator
def cos(x):
    def evaluate(p):
        return math.cos(x.function(p))
    def derive():
        return Constant(-1.0)*sin(x)*grad(x)
    return evaluate, derive

@my_decorator
def sin(x):
    def evaluate(p):
        return math.sin(x.function(p))
    def derive():
        return cos(x)*grad(x)
    return evaluate, derive

@my_decorator
def log(x):
    def evaluate(p):
        return math.log(x.function(p))
    def derive():
        return (Constant(1.0)/x)*grad(x)
    return evaluate, derive

@my_decorator
def exp(x):
    def evaluate(p):
        return math.exp(x.function(p))
    def derive():
        return exp(x)*grad(x)
    return evaluate, derive

@my_decorator
def pow(x, y):
    def evaluate(p):
        return math.pow(x.function(p), y.function(p))
    def derive():
        return y*pow(x, y-Constant(1))*grad(x)
    return evaluate, derive

@my_decorator
def add(x, y):
    def evaluate(p):
        return x.function(p) + y.function(p)
    def derive():
        return grad(x)+grad(y)
    return evaluate, derive

@my_decorator
def sub(x, y):
    def evaluate(p):
        return x.function(p) - y.function(p)
    def derive():
        return grad(x)-grad(y)
    return evaluate, derive

@my_decorator
def mul(x, y):
    def evaluate(p):
        return x.function(p) * y.function(p)
    def derive():
        return x*grad(y) + y*grad(x)
    return evaluate, derive

@my_decorator
def div(x, y):
    def evaluate(p):
        return x.function(p) / y.function(p)
    def derive():
        return Constant(-1.0)*x*pow(grad(y), Constant(-2.0)) + pow(y, Constant(-1.0))*grad(x)
    return evaluate, derive

@my_decorator
def grad(f):
    def evaluate(p):
        return f.d_function().function(p)
    def derive():
        return f.d_function().d_function()
    return evaluate, derive

def standard_fn(x):
    f1 = math.log(x)
    f2 = math.cos(f1)
    f3 = math.sin(f1)
    f4 = f2 + f3
    f5 = math.pow(x, 4.0)
    f6 = f5*f4
    f7 = math.exp(1.0+f6)
    return f7

def grad_numerical(f, x, h=1e-6):
    return (f(x+h)-f(x))/h

def grad_ad(x):
    y = Variable()
    f = exp(Constant(1.0) + pow(y, Constant(4.0))*(cos(log(y)) + sin(log(y))))
    return grad(f)(x)

if __name__ == '__main__':
    y = Variable()
    z = pow(y, Constant(5))
    assert z(2) == 32
    assert grad(z)(3) == 405
    assert grad(grad(z))(4) == 1280
    assert grad(grad(grad(z)))(5) == 1500

    x = 0.679
    ad = grad_ad(x)
    print(f"Autograd = {ad}")
    nm = grad_numerical(standard_fn, x)
    print(f"Numerical grad = {nm}")
    ex = math.exp(1.0 + (x**4)*(math.cos(math.log(x)) + math.sin(math.log(x))))*(4*(x**3)*(math.cos(math.log(x)) + math.sin(math.log(x))) + ((x**4)*(-math.sin(math.log(x))/x + math.cos(math.log(x))/x)))
    print(f"Exact grad = {ex}")

The idea is as follows:

We are wrapping the conventional python math functions inside a Node object. Each Node object has a “function” method and a “d_function” method. The “function” method evaluates the corresponding math function at some value p. “d_function” method returns a Node object corresponding to the derivative function.

Taking an example from above:

The corresponding representation of the above function in our implementation would be:

x = Variable()
sin(exp(x)+pow(x, Constant(3.0))) + \
  pow(x, Constant(2.0))*cos(Constant(1.0) + log(exp(x) + pow(x, Constant(3.0))))

Calling one of these functions sin(f(x)), cos(f(x)), pow(f(x), g(x)), log(f(x)), will first call the ‘my_decorator’ decorator which creates a Node object and checks if the input arguments are Node objects or not. Each argument must be a Node object, whether its another function, a variable or a constant.

After checking that all arguments are Node objects, execute the code inside the called method and return the “function” and “d_function” methods and attach them to the new Node object.

The above implementation for automatic differentiation for the given function may not be most optimal because note that the term (e^x + x³) occurs twice in the function f(x).

Instead of creating 2 different Node objects corresponding to (e^x + x³), we could have created a single Node object and re-used it like in the computational graph where x3 is used by both x4 and x6. Moreover we could cache the results of the “function” call at each Node.

Now let’s update the code to handle a vector of parameters and compute partial derivatives w.r.t. one of the parameters at a time. Also, adding caching of function results.

from functools import wraps
import math, numpy as np

class Node:
    def __init__(self):
        self.function = None
        self.d_function = None
        self.id = None
        self.cached_function = {}

    def __add__(self, other):
        return add(self, other)

    def __sub__(self, other):
        return sub(self, other)

    def __mul__(self, other):
        return mul(self, other)

    def __truediv__(self, other):
        return div(self, other)

    def __call__(self, *args):
        return self.function(*args)

class Variable:
    def __init__(self):
        pass
        
    def __new__(cls, id):
        cls.node = Node()
        cls.node.id = id
        cls.node.function = lambda z: z[id,0] if id < len(z) else 0
        cls.node.d_function = lambda j: Constant(1.0) if j == id else Constant(0.0)
        return cls.node

class Constant:
    def __init__(self, val):
        pass
        
    def __new__(cls, val):
        cls.node = Node()
        cls.node.function = lambda z: val
        cls.node.d_function = lambda j: Constant(0.0)
        return cls.node

def my_decorator(f):
    @wraps(f)
    def wrapper(*args, **kwds):
        node = Node()
        new_args = []

        for i in range(len(args)):
            if i == len(args)-1 and f.__name__ == 'grad':
                new_args += [args[i]]
            else:
                if isinstance(args[i], Node):
                    new_args += [args[i]]
                else:
                    raise Exception("Invalid input parameter !!!")

        evaluate, derive = f(*new_args, **kwds)
        
        node.function = evaluate
        node.d_function = derive
        
        return node
        
    return wrapper 

def get_cached_function(x, p):
    q = tuple(p.reshape((1, p.shape[0]))[0].tolist())
        
    if q not in x.cached_function:
        x.cached_function[q] = x.function(p) 
    return x.cached_function[q]

@my_decorator
def cos(x):
    def evaluate(p):
        z = get_cached_function(x, p)
        return math.cos(z)
    def derive(j):
        return Constant(-1.0)*sin(x)*grad(x,j)
        
    return evaluate, derive

@my_decorator
def sin(x):
    def evaluate(p):
        z = get_cached_function(x, p)
        return math.sin(z)
    def derive(j):
        return cos(x)*grad(x,j)
        
    return evaluate, derive

@my_decorator
def log(x):
    def evaluate(p):
        z = get_cached_function(x, p)
        return math.log(z)
    def derive(j):
        return (Constant(1.0)/x)*grad(x,j)
        
    return evaluate, derive

@my_decorator
def exp(x):
    def evaluate(p):
        z = get_cached_function(x, p)
        return math.exp(z)
    def derive(j):
        return exp(x)*grad(x,j)
        
    return evaluate, derive

@my_decorator
def pow(x, y):
    def evaluate(p):
        z = get_cached_function(x, p)
        w = get_cached_function(y, p)
        return math.pow(z, w)
    def derive(j):
        return y*pow(x, y-Constant(1.0))*grad(x,j)
        
    return evaluate, derive

@my_decorator
def add(x, y):
    def evaluate(p):
        z = get_cached_function(x, p)
        w = get_cached_function(y, p)
        return z + w
    def derive(j):
        return grad(x,j)+grad(y,j)
        
    return evaluate, derive

@my_decorator
def sub(x, y):
    def evaluate(p):
        z = get_cached_function(x, p)
        w = get_cached_function(y, p)
        return z - w
    def derive(j):
        return grad(x,j)-grad(y,j)
        
    return evaluate, derive

@my_decorator
def mul(x, y):
    def evaluate(p):
        z = get_cached_function(x, p)
        w = get_cached_function(y, p)
        return z * w
    def derive(j):
        return x*grad(y,j) + y*grad(x,j)
        
    return evaluate, derive

@my_decorator
def div(x, y):
    def evaluate(p):
        z = get_cached_function(x, p)
        w = get_cached_function(y, p)
        return z / w
    def derive(j):
        return Constant(-1.0)*x*pow(grad(y,j), Constant(-2.0)) + pow(y, Constant(-1.0))*grad(x,j)
        
    return evaluate, derive

@my_decorator
def grad(f,j):
    def evaluate(p):
        return f.d_function(j).function(p)
    def derive(k):
        return f.d_function(j).d_function(k)
    return evaluate, derive

if __name__ == '__main__':
    x = np.array([[Variable(0)], [Variable(1)]])
    f = cos(pow(x[0,0],Constant(3.0)) + pow(x[1,0],Constant(2.0)) + Constant(2.0)*x[0,0]*x[1,0])
    inp = np.array([[0.543], [0.678]])

    # forward pass to cache the function values
    f(inp)
    assert grad(grad(f, 1), 0)(inp) == -3.1197960007398184

To first compute the partial derivative w.r.t. y and then w.r.t. x:

at the point [0.543, 0.678].

When you compare the results of the gradients computed using automatic differentiation vs. numerical methods as described earlier, we can see that the result from automatic differentiation is much closer to the true value and thus the error is almost 0.

The updated L-BFGS code (introduced in the previous post of this series) using automatic differentiation instead of explicit gfn() method described earlier.

import numpy as np
from collections import deque

def p_grad(f, n):
    def eval(x0):
        out = []
        for i in range(n):
            out += [[grad(f, i)(x0)]]
    
        return np.array(out)
    return eval

def fn(x):
    # Compute the optimization function
    return pow(Constant(1.0)-x[0,0], Constant(2.0)) + Constant(100.0)*pow((x[1,0]-pow(x[0,0], Constant(2.0))), Constant(2.0))

def line_search(x, g, d, fnode, gnode, a0=1e-5, e=1e-6):
    # Perform line (binary) search to find optimal value of a
    left, right = 0.0, 1.0
    a, c1, c2 = a0, 1e-4, 0.9
    
    while left < right:
        m = (left + right)/2
        
        x_new = x - m*d
        f_new = fnode(x_new)
        g_new = gnode(x_new)

        # sufficient decrease condition
        if f_new-fnode(x) < -c1*m*np.dot(g.T, d):
            # curvature condition
            if abs(np.dot(d.T, g_new)) < c2*abs(np.dot(d.T, g)):
                return m
                
            a = m
            left = m + e
        else:
            right = m - e
            
    return a
    
def optimize(x0, k=10, epsilon=1e-7, max_iter=1e5):
    # l-bfgs optimization algorithm
    m = len(x0)
    v = np.array([[Variable(i)] for i in range(m)])

    fnode = fn(v)
    gnode = p_grad(fnode, m)
    
    x = x0
    f = fnode(x)
    g = gnode(x)
    
    # Store the last k values of y and s vectors
    prev_values = deque([(x, g)])

    while max_iter > 0:
        print(f"Current objective value = {f}")
        max_iter -= 1
        
        prev_x = x
        prev_g = g

        q = g
        for i in range(len(prev_values)-1, -1, -1):
            y, s = prev_values[i]
            r = 1.0/np.dot(y.T, s)
            q = q - r*np.dot(y.T, q)*s

        for i in range(len(prev_values)):
            y, s = prev_values[i]
            r = 1.0/np.dot(y.T, s)
            q = q - r*np.dot(s.T, q)*y

        p = 0
        for i in range(len(prev_values)-1, -1, -1):
            w = g
            
            for j in range(len(prev_values)-1, -1, -1):
                if j > i:
                    y, s = prev_values[j]
                    r = 1.0/np.dot(y.T, s)
                    w = w - r*np.dot(y.T, w)*s
                else:
                    break
                    
            y, s = prev_values[i]
            r = 1.0/np.dot(y.T, s)
            w = r*np.dot(s.T, w)*s

            for j in range(len(prev_values)):
                if j < len(prev_values)-1-i:
                    y, s = prev_values[j]
                    r = 1.0/np.dot(y.T, s)
                    w = w - r*np.dot(s.T, w)*y
                else:
                    break
            p += w

        # Descent direction
        d = p+q

        # Perform line search
        a = line_search(x, g, d, fnode, gnode)
        
        x = x - a*d
        f = fnode(x)
        g = gnode(x)

        y = x-prev_x
        s = g-prev_g
        
        prev_values.append((y, s))

        # Maintain only latest k values
        if len(prev_values) > k:
            prev_values.popleft()

        # Exit condition
        if np.abs(y).all() < epsilon:
            break
        
    return x

if __name__ == '__main__':
    x0 = np.array([[0.003], [0.006]])
    x = optimize(x0, k=10, epsilon=1e-7)
    print(x)

Two popular libraries for automatic differentiation are autograd and JAX. I would encourage the reader to try out JAX for the above L-BFGS implementation.

In the next post we will dive deeper into L-BFGS algorithm for problems involving constraints and understand L-BFGS-B i.e. L-BFGS Bounded, algorithm.

References:

Automatic Differentiation from scratch

Written by Abhijit Mondal