In this blog post, I'll be revisiting Chain Rule as an essential concept behind Machine Learning and will explain how it works for both single and multivariable functions. Let's get started...

Fundamentals: functions, differentiability and compostion of functions.

Function : Think of a function $f$ as a mapping between the domain of a set, say $X$, to the codomain $Y$ of another set. Such a function is represented as $f : X \rightarrow Y$.

If with this $f$, it's given that $g : Y \rightarrow Z$, then $g \circ f : X \rightarrow Y$ isn't just gof, rather it's known as the composition of $g$ and $f$. This composition is defined $\forall x \in X$ as :

$$ (g \circ f)(x)=g(f(x)) $$

In the case of Single variable calculus, $f$ and $g$ are real valued functions with their Domains and Co-domains both belonging to $\mathbb{R}$.

Example :

  1. Let, $g(x) = x^{3}$ and $f(x) = x - 3$, then the composition gof is : $$(g \circ f)(x)=g(f(x))=g(x-3)=(x-3)^{3}$$

Note: Compositions work the other way around too, like you can also find $f \circ g$ but we shouldn't assume compositions to be commutative. Compositions are in fact associative.

Differntiability : I'll not go into much detail about this here. If you feel like refreshing this concept, head out to Khan Academy here to brush up your basics.

A function $f$ is differentiable at a point, say $a$ if the following limit exists

$$\lim _{h \rightarrow 0} \frac{f\left(a+h\right)-f\left(a\right)}{h}$$

This particular limit is also an expression for the derivative of function $f$ at point $a$. Mostly, instead of writing this lenghty limit, we express this derivative as $f^{'}(a)$ or $\frac{d}{d x} f\left(a\right)$. This particular point can be any point on the real line and the value of the derivative is also a real number.

Chain Rule for Single Variable Functions

It states that if $g$ is differentiable at $a$ and $f$ is differentiable at $g(a)$, then the composition $f \circ g$ is differentiable at $a$ and the value of its derivative is $$(f \circ g)^{\prime}\left(a\right)=f^{\prime}\left(g\left(a\right)\right) g^{\prime}\left(a\right)$$

It's proof is best left to the people at Khan Academy. However, there is one small error I noticed in a majority of the proofs online. All of them contain division by a quantity which might be zero. This is often overlooked and doesn't really make a difference, but for the sake of mathematical rigor if you wish to see the complicated version of the proof, please refer to Thomas Calculus'.

Examples :

  1. Take $f(x) = (x-2)^{2}$. Now try to think (and split) this function as a composition of two or more functions. Here, we can take $g(x) = x-2$ and $h(x) = x^{2}$. This way, $f(x)$ can be represented as $h\circ g$ and the derivative of this composition is $h^{\prime}(g(x)) g^{\prime}(x), \text { or } 2(x-2) \text { since } g^{\prime}(x)=1$. The derivative of $w$ exists at all points.

  2. Take $f(x)=\cos \left[(x-2)^{3}\right]$. After analysing, we observe that $f$ is a composition of 3 other functions: $$\begin{aligned} g(x) &=x-2 \\ h(x) &=x^{3} \\ i(x) &=\cos (x) \end{aligned}$$ As discussed, compositon of functions is associative. $f$ can be expressed in one of two associative ways: $i\circ (h\circ g)$ or $(i\circ h)\circ g$. Going with the first one, we have $h\circ g$ from the previous example, so :

$$ \begin{aligned} \frac{d f(x)}{d x}=\frac{d i(h(g(x)))}{d x} &=i^{\prime}(h(g(x))) h(g(x))^{\prime}(x) \\ &=-\sin (w(g(x))) 3(x-2) \\ &=-3 \sin \left[(x-2)^{3}\right](x-2)^{2} \end{aligned} $$

The Chain Rule in Programming

As we saw earlier, the chain rule can be used several times in a single calculation and it's this perk which makes chain rule such a powerful methjod for computing derivative of even every complex functions through a computational method. A function can simply be broken into simpler compositions and the chain rule be applied ever until we reach the base case of 1.

To understand, let's take the softmax function as an example. Softmax is a crucial function in Machine Learning and finding it's derivative is a step in backpropagation in Neural Networks where it is mostly used as the activation function of the final layer to get the output of the model as a probability distribution. If you have appropriate background, I suggest looking at Softmax Regression (C2W3L08) by Andrew Ng.


Here, we'll start drawing an equivalence between the mathematical part and the computational part to get a better intution.

import numpy as np
def sigmoid(z):
    """the sigmoid activation function"""
    return 1/(1+np.exp(-1*x))

This might seem extremely similar to the mathematical function defined but this is all due to pseudocode-like syntax of Python. When interpreted, this computation is done step-by-step. Assuming we can only do one computation operation at every step, the code becomes :

import numpy as np
def sigmoid(z):
    a = -x
    b = np.exp(a)
    c = 1 + b
    d = 1 / c
    return v

This can be equivalently written in this notation $S^{\prime}=d^{\prime}(c) c^{\prime}(b) b^{\prime}(a) a^{\prime}(x)$. Calculating this derivative is same as calculating derivative of the composition $d \circ(c \circ(b \circ a))$ and is really simple since the constituent functions' derivative is trivial.

$$ \begin{aligned} S^{\prime} &=d^{\prime}(c) c^{\prime}(b) b^{\prime}(a)(-1) \\ &=d^{\prime}(c) c^{\prime}(b) e^{-x}(-1) \\ &=d^{\prime}(c)(1) e^{-x}(-1) \\ &=\frac{-1}{\left(1+e^{-x}\right)^{2}} e^{-x}(-1) \\ &=\frac{e^{-x}}{\left(1+e^{-x}\right)^{2}} \end{aligned} $$

Now, the exciting part and the intution strikes! If every function (even a complex one) can be broken down into trivial functions, then, using the chain rule we can find the derivative of any function computable by a program!!! Isn't this awesome.

And this, my friends, is the basic of a technique known as Automatic Differentiation on which all of scientific computing and deep learning is based. As most of you would've heard about Backpropagation: The miraculous algorithm which makes the computer learn, understand and predict stuff; is the most notable use case of Automatic Differentiation and form the foundation of modern machine learning.

Fun Fact: The modern Deep Learning Derby b/w PyTorch and Tensorflow essentially comes down to the different ways in which both these pacakges do their automatic differentiation on a computational graph.

Tip: To learn more, read this from CSC321@UToronto.

The Multivariate Chain Rule

So far we've dealt with functions that map from $n$ to $m$ dimensions: $f: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$. Since every output of $f$ can be considered a separate function dependent on $n$ variables, we can also think about using matrices and vectors.

Multivariate Notation

  • Consider outputs of a function $f$ to be numbered from 1 to m as $f_{1}, f_{2} \ldots f_{m}$. For all these outputs, we can calculate the partial derivative by any of n inputs as : $$D_{j} f_{i}(a)=\frac{\partial f_{i}}{\partial a_{j}}(a)$$ where $j$ goes from 1 to $n$, $a$ is a vector with $n$ elems. When $f$ is differentiable at $a$, then the derivative is expressed as the Jacobian Matrix: a matrix of all first-order partial derivatives of the function.
$$ D f(a)=\left[\begin{array}{ccc} \frac{\partial f_{1}}{\partial a_{1}}(a) & \cdots & \frac{\partial f_{1}}{\partial a_{n}}(a) \\ \vdots & & \vdots \\ \frac{\partial f_{m}}{\partial a_{1}}(a) & \cdots &\frac{\partial f_{m}}{\partial a_{n}}(a) \end{array}\right] $$

The Rule

The Multivariate Chain Rule states that: given a function $f: \mathbb{R}^{m} \rightarrow \mathbb{R}^{p}$ and $g: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$ along with an arbitrary point $a$ in belonging to Real Numbers, if $g$ is differentiable at $a$ and $f$ is differentiable at $g(a)$;

Then $f \circ g$ is differentiable at $a$ and the derivative is $$D(f \circ g)(a)=D f(g(a)) \cdot D g(a)$$

This is also the product of matrix multiplication of $D f(g(a))$ and $D g(a)$ (try verifying the dimensions to cross-check the validity of the multiplication).

Intuitively, the multivariate chain rule is identical to the single variable one (the latter being nothing more than a special case of the former) with derivatives replaced by derivative matrices. We know From Linear Algebr how to represent Linear Transformations by Matrices, and how the composition of two linear transformations is the product of their matrices.

Thus, since derivative matrices - like derivatives in one dimension - are a linear approximation to the function, the chain rule makes complete sense. This is an amazing connection between Linear Algebra and Calculus. Again, the proof is best left to the experts at MIT (below) & this document from LSU.

Applying Multivariate Chain Rule


Let's take function $f$ as a scalar function $\mathbb{R}^{3} \rightarrow \mathbb{R}$ representing $f(x, y, z)$ giving the weather at some point in a 3D grid. Obviously, such a function would be a very complex function to write (or even think about) but just assume it as a simple function for the sake of this thought experiement.

Now, imagine moving through this 3D space on a curve defined by say, $g: \mathbb{R} \rightarrow \mathbb{R}^{3}$, which takes time and gives your spatial coordinates $x(t), y(t), z(t)$ at a particular point in the 3D space.

Now we want to know what the weather is like at that particular point and how it changes as a funtion of $t$. Since, here weather isn't a direct dependence on $t$, but a function of location and in turn location is a function of time. And just like that, we get a composition $f\circ g$ as the solution to our thought experiment.

Single Variate Approach:

$$g(t)=\left(\begin{array}{c} t \\ t^{2} \\ t^{3} \end{array}\right)$$

and say $f$ is : $$f\left(\begin{array}{l} x \\ y \\ z \end{array}\right)=x^{2}+5y+10z+x y z$$

Now, if we rewrite x, y, z as function of $t$, we get : $$f(x(t), y(t), z(t))=x(t)^{2}+x(t) y(t) z(t)+5 y(t) + 10 z(t)$$

Composing $f$ with $g$ gives us, $$(f \circ g)(t)=f(g(t))=f\left(t, t^{2}, t^{3}\right)=t^{2}+t^{6}+5 t^{2}+10 t^{3}=6 t^{2}+t^{6}+10 t^{3}$$ and the derivative is easily found out to be $(f \circ g)^{\prime}(t)=12 t+6 t^{5}+30 t^{2}$

Multi-variate Approach:

Let's try the same thing now but using Multivariate Chain Rule. Thus to compute $D(f \circ g)(t)$ we need $D f(g(t))$ and $D g(t)$. Starting with $D f(g(t))$, let's find $D f(x, y, z)$. Since it's a mapping from $\mathbb{R}^{3} \rightarrow \mathbb{R}$, the Jacobian Matrix is $1x3$ or simply, a row vector as :

$$D f(x, y, z)=\left[\begin{array}{lll} 2 x+y z & x z+5 & x y + 10 \end{array}\right]$$

But for applying the chain rule we need $D f(g(t))$ (see $g(t)$ from the single variate example) $$ D f(g(t))=\left[\begin{array}{lll} 2 t+t^{5} & t^{4}+5 & t^{3} + 10 \end{array}\right] $$

Now we need to find $D g(t) \cdot g(t)$ maps $\mathbb{R} \rightarrow \mathbb{R}^{3},$ so its Jacobian is a 3x1 matrix, or column vector: $$ D g(t)=\left[\begin{array}{c} 1 \\ 2 t \\ 3 t^{2} \end{array}\right] $$

Finally, multiplying $D f(g(t))$ by $D g(t),$ we get: $$ \begin{aligned} D(f \circ g)(t)=D f(g(t)) \cdot D g(t) &=\left[2 t+t^{5} \quad t^{4}+5 \quad t^{3}+10\right] \cdot\left[\begin{array}{c} 1 \\ 2 t \\ 3 t^{2} \end{array}\right] \\ &=2 t+t^{5}+2 t^{6}+10 t+3 t^{5} +30 t^{2}\\ &=12 t+6 t^{5}+30 t^{2} \end{aligned} $$

Now that we've explored both the ways of finding the derivative,we can also interpret this result for the case where $f: \mathbb{R}^{3} \rightarrow \mathbb{R}$ and $g: \mathbb{R} \rightarrow \mathbb{R}^{3}$ is to recall that the directional derivative of $f$ in the direction of some vector $\vec{v}$ is: $$ D_{\vec{v}} f=(\nabla f) \cdot \vec{v} $$

If you don't understand what I'm talking about, give this video a watch...

Coming back to our case, $(\nabla f)$ is the Jacobian of $f$ (because of its dimensionality). So if we take $\vec{v}$ to be the vector $D g(t),$ and evaluate the gradient at $g(t)$ we get: $$ D_{D \vec{g}(t)} f(t)=(\nabla f(g(t))) \cdot D g(t) $$ This gives us some additional intuition for the weather experiment. The change in weather as a function of time is the directional derivative of $f$ in the direction of the change in location $(D g(t))$

The 'Link'

So how does it all come together ?

Given function $f(x): \mathbb{R} \rightarrow \mathbb{R},$ the Jacobian matrix contains a single entry. $$ D f(a)=\left[D_{x} f(a)\right]=\left[\frac{d f}{d x}(a)\right] $$ Therefore, given two functions mapping $\mathbb{R} \rightarrow \mathbb{R}$, the derivative of their composition using the multivariate chain rule is: $$ D(f \circ g)(a)=D f(g(a)) \cdot D g(a)=f^{\prime}(g(a)) g^{\prime}(a) $$ Which is just the single-variable chain rule.

This results from matrix multiplication between two $1 \times 1$ matrices, which ends up being just the product of their single entries and Voila! We're Done.

$Q. E. D$

So, why this Madhav ?

I don't know truly what prompted me to write this. Inspiration appeared from various sources with my occasional flipping through calc notes and a strong urge to write something math-y with LaTeX being among the major ones.