Revisiting Backpropagation!




Introduction

Backpropagation is one of the most widely used algorithm in deep learning community. At its essence it is just chain rule, being used intelligently. The algorithm finds the derivative of the cost function with respect to the weights of the network.

Its very likely to get stumped while learning it for the first time, but carefully going through the derivation by manually applying it on a dummy network can do the trick.

Here we will try to apply the algorithm on two most common neural network architectures.

  1. Neural Networks with Cross Entropy Error and Logistic Activation
  2. Neural Networks with Cross Entropy Error and Softmax Activation

So, Let's get our hands dirty!

1.Neural Networks with Cross Entropy Error and Logistic Activation



In the classification task with two classes,using a neural network with single logistic output unit and cross entropy error is a standard practice. The output of such network is always between 0 and 1 and is interpreted as probability. This model architecture can be further expanded for the tasks which involve multiple, independent two-class classification by including multiple logistic output units(one logistic output unit corresponding to one two-class classification task).

To keep our derivation simple we will consider a network with two hidden layers of logistic units, multiple logistic output units where cross entropy error is summed over the outputs units.

The cross-entropy error for a single example with ntarget as total independent targets is given by the sum

$$E = -\sum_{i=1}^{ntarget} (t_ilog(y_i) + (1-t_i)log(1-y_i))$$

where t is the target vector, y is the output vector.

The output vectors y of the network are calculated by applying the logistic function to the weighted sum of the hidden layer activation(h)

$$y_i = \frac{1}{1+e^{-z_i}}$$

$$z_i = \sum_{j=1} h_jw_{ji}$$

The derivative of the error with respect to each weight connecting hidden units to the output units is given by:

$$\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial y_i}\frac{\partial y_i}{\partial z_i}\frac{\partial z_i}{\partial w_{ij}}$$

Let's calculate each term separately.

$$\frac{\partial E}{\partial y_i} = \frac{\partial }{\partial y_i}(-\sum_{i=1}^{ntarget} (t_ilog(y_i) + (1-t_i)log(1-y_i))) \\ =-\frac{ t_i}{ y_i} + \frac{ 1-t_i}{ 1-y_i} \\=\frac{ y_i-t_i}{ y_i(1-y_i)} $$

$$\frac{\partial y_i}{\partial z_i} = y_i(1-y_i)\label{eq:1}$$

$$\frac{\partial z_i}{\partial w_{ij}} = \frac{\partial {}}{\partial w_{ij}}(\sum_{j=1}^{} h_jw_{ij})\\= h_j$$

Combining all these equations gives: $$\boxed{\frac{\partial E}{\partial w_{ij}}=(y_i-t_i)h_j}$$

Now, for calculating the partial derivative of error with respect to each weight connecting input units to the hidden units, rather than writing the explicit formula(of chain rule, as done previously)first, it is more inituative to keep calculating the terms produced using chain rule to get the final result.

Let's begin:

$$\frac{\partial E}{\partial w_{jk}} = \frac{\partial }{\partial w_{jk}}(-\sum_{i=1}^{ntarget} (t_ilog(y_i) + (1-t_i)log(1-y_i))) \\ =\sum_{i=1}^{ntarget}-\frac{ t_i}{ y_i}\frac{\partial y_i}{\partial w_{jk}} + \frac{ 1-t_i}{ 1-y_i} \frac{\partial y_i}{\partial w_{jk}} \\=\sum_{i=1}^{ntarget}(-\frac{ t_i}{ y_i} + \frac{ 1-t_i}{ 1-y_i}) \frac{\partial y_i}{\partial w_{jk}} =\sum_{i=1}^{ntarget}[\frac{ y_i-t_i}{ y_i(1-y_i)}]\frac{\partial y_i}{\partial w_{jk}} \\ = \sum_{i=1}^{ntarget}[\frac{ y_i-t_i}{ y_i(1-y_i)}]\frac{\partial }{\partial w_{jk}}\sigma(z_i) \\ = \sum_{i=1}^{ntarget}[\frac{ y_i-t_i}{ y_i(1-y_i)}]\sigma(z_i)(1-\sigma(z_i))\frac{\partial }{\partial w_{jk}}z_i \\ = \sum_{i=1}^{ntarget}({ y_i-t_i})\frac{\partial }{\partial w_{jk}}z_i \\ = \sum_{i=1}^{ntarget}({ y_i-t_i})\frac{\partial z_i }{\partial h_{j}}\frac{\partial h_j}{\partial w_{jk}} \\ = \sum_{i=1}^{ntarget}({ y_i-t_i})\frac{\partial }{\partial h_{j}}(\sum_{j=1}^{} h_jw_{ij} + b_i)\frac{\partial }{\partial w_{jk}}(\sigma(z_j)) \\ = \sum_{i=1}^{ntarget}({ y_i-t_i})w_{ij}\sigma(z_j)(1-\sigma(z_j))\frac{\partial z_j}{\partial w_{jk}} \\ = \sum_{i=1}^{ntarget}({ y_i-t_i})w_{ij}\sigma(z_j)(1-\sigma(z_j))\frac{\partial }{\partial w_{jk}}(\sum_{k=1}^{} h_kw_{jk} + b_k) \\ \boxed{\frac{\partial E}{\partial w_{jk}} = \sum_{i=1}^{ntarget}({ y_i-t_i})w_{ij}\sigma(z_j)(1-\sigma(z_j))h_k}$$

Fine, that was a neat application of chain rule, but if the network is several layers deep, doing the derivation in the similar fashion will make the things really complicated ! Its time to write an explicit formula by which the above equation can be derived.

Using chain rule:

$$\\\frac{\partial E}{\partial w_{jk}} = \frac{\partial E}{\partial y_i}\frac{\partial y_i}{\partial z_i}\frac{\partial z_i}{\partial h_j}\frac{\partial h_j}{\partial z_j}\frac{\partial z_j}{\partial w_{jk}}$$

This equation can also be written as:

$$\\\frac{\partial E}{\partial w_{jk}} = \frac{\partial E}{\partial z_j}\frac{\partial z_j}{\partial w_{jk}}$$

This equation clearly shows that by computing the gradient of the error with respect to the activity of each neuron in a layer, we can compute the gradients for all weights in that particular layer.

Simplifying further:$$\\\frac{\partial E}{\partial z_{j}} = \delta_{j}$$

$$\frac{\partial z_j}{\partial w_{jk}} = \frac{\partial }{\partial w_{jk}}(\sum_{k=1}^{} h_kw_{jk}+b_k) = h_k$$

Finally:$$\\\boxed {\frac{\partial E}{\partial w_{jk}} =\delta_{j}h_k}$$

Similarly, we can write:$$\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial y_i}\frac{\partial y_i}{\partial z_i}\frac{\partial z_i}{\partial w_{ij}} = \frac{\partial E}{\partial z_i}\frac{\partial z_i}{\partial w_{ij}} = \delta_{i}h_j \\ \boxed{\frac{\partial E}{\partial w_{ij}} = \delta_{i}h_j}$$

This equation looks beautiful, but how does it simplifies the calculation?

$$\delta_j = \sum_{i=1}^{} w_{ij}\delta_i\sigma(z_j)(1-\sigma(z_j))$$

This equation can be applied recursively to calculate the error(delta) in other layers. For example: $$\delta_k = \sum_{j=1}^{} w_{jk}\delta_j\sigma(z_k)(1-\sigma(z_k))$$ and so on.

Calculating the error(delta) and recursively using it for the calculation of the errors in other layers make it much easier for calculating the gradient of cost function wrt to weights

2.Neural Networks with Cross Entropy Error and Softmax Activation


When a classification task has more than two classes, it is standard to use a softmax output layer. The softmax function provides a way of predicting a discrete probability distribution over the classes. We again use the cross-entropy error function, but it takes a slightly different form. We will use the same figure for refrence as in first part.

The softmax activation of the ith output unit is

$$y_i = \frac{ e^{z_i}}{ \sum_{c=1}^{nclasses} e^{z_c}}$$

and the cross entropy error function for multi-class output is

$$E = -\sum_{i=1}^{nclasses} t_ilog(y_i)$$

Summing up, the basic idea of backpropagation is to calculate the error for a layer, multiply it with the activations to get the gradients with respect to the weight and propagate the errors backwards. Once we have gradient of the cost function w.r.t. the weights of the network we can update the weights using gradient descent or any other available algorithms.