Yuhang He's Blog

Some birds are not meant to be caged, their feathers are just too bright.

A step by step guide to understand backpropagation

Feedforward and backpropagation are two key components to solve neural networks problems. While feedforward is intuitively easy to undestand, backpropagation is somewhat difficult to understand. UFLDL tutorial directly talks about the way to calculate bias term $\delta$ for each neuron unit, without much theoretical or detailed reasoning about how the derivative of final loss function w.r.t. each weight $W$ or bias $b$ is computed from the mathematical perspective. This would inhibit learners from comprehending backpropagation totally. Here in this article, I give a step by step guide to understand backpropagation both matematically and intuitively.

Notation

NeuronNotationImage

I follow UFLDL tutorial to define the neural network. That is, let $z_i^l$ denote total weighted sum of inputs to unit $i$ in layer $l$. $a_i^{(l)}$ denote the activation of unit $i$ in layer $l$, which takes $z_i^l$ as input, $W_{i,j}^{(l)}$ denote the filter parameter between unit $j$ in layer $l$ and uint $i$ in layer $l+1$. At the same time, we also let $f_i^{(l)}$ denote the activation function of unit $i$ in layer $l$.

A Simple Example

single-example

From mathematical perspective, backpropagation builds on chain rule of derivation. In a deep neural network, backpropagation begins with the final output layer and backpropagate error term in a layerwise manner from output layer to input layer. Backpropagation dispatches and sends error of term of the output layer to all weights and bias, and further decides the contribution of each weight or bias to the final error term by taking local derivative into consideration. To update a weight or a bias value, we rely on the following two equations:

To compute $\frac{\partial J(W,b)}{\partial W_{i,j}^{(l)}}$ or $\frac{\partial J(W,b)}{\partial b_i^{(l)}}$, we rely on chain rule, that is (I take the simple layer structure in the above image for example):

Note that $z_1^{(4)} = W_{1,1}^{(3)} \cdot f_1^{(3)}(z_1^{(3)}) + b_{(3)}$, so $\frac{\partial z_1^{(4)}}{\partial z_1^{(3)}} = W_{1,1}^{(3)} \cdot f_1^{(3)’}(z_1^{(3)})$. By following this chain rule, we can easily get the equivalent equation

Note that the difference between $b^{(1)}$ and $W_{1,1}^{(1)}$ lies in the fact that $W_{1,1}^{(1)}$ takes the value $f_1^{(1)}(x)$ feeding to it into account while $b^{(1)}$ ignores it. UFLDL defines $-(y - h_{W,b}(x)) \cdot f_1^{(4)’}(z_1^{(4)})$ as the error term of the output layer. Output layer error term backpropagates through the layer structure and dispatches the error term to each neuron unit. Note that neuron unit $u_i$ shares $w_{1,1}^{(i)}f_1^{(i)’}(z_1^{(i)})$ contribution to the final error term (error term in the output layer), in which the “contribution” constitutes of the multiplication of the derivative of activation function it holds w.r.t. its input and the weight connecting itself and its next adjacent neuron unit. Both the derivative of $W_{1,1}^{(1)}$ and $b^{(1)}$ depend on the “contribution” of all neurons it traverses to reach the output neuron.

Multiple Input Output Example

single-example

Based on the simple example above, we now have a clear and intuitive understanding of how backpropagation works and how actually each parameter updates its value. This allows us to walk to a more complex multiple input and output neural network. What if the complex neural network with multiple input and output?

In fact, the transition is easy to understand. Look at the figure above, the weight (the red connection) connecting the neurons between the first and second layer depends on all the red connections in the whole neural network. That is, to update a query weight, the neural network backpropagates the error term through all connections that contribute to the query weight. The detail step works as:

  • For each unit in the output layer $n_l$, compute the output error term:
  • For each unit in layer $l$, $l = 2, \cdots, n_l - 2, n_l - 1$, compute each unit neuron’s own error term:
  • compute the desired partial derivative for each neuron by error term and activation value:

Summary

While forward propagation is easy and intuitive to understand, backpropagation is obscure to grasp. Even though, if we dissect it and follow the process step by step, we can soon gain a better understanding of it. Typically, backpropagation transfers and dispatches the error term to each neuron unit. Each weight or bias term updates its value by considering it error term of neuron unit it connects and activation function it takes.

Hope it helps!