Editing Backpropagation (section)

==Derivation==
The gradient descent method involves calculating the derivative of the loss function with respect to the weights of the network. This is normally done using backpropagation. Assuming one output neuron,{{efn|There can be multiple output neurons, in which case the error is the squared norm of the difference vector.}} the squared error function is

:<math>E = L(t, y)</math>

where
:<math>L</math> is the loss for the output <math>y</math> and target value <math>t</math>,
:<math>t</math> is the target output for a training sample, and
:<math>y</math> is the actual output of the output neuron.

For each neuron <math>j</math>, its output <math>o_j</math> is defined as

:<math>o_j = \varphi(\text{net}_j) = \varphi\left(\sum_{k=1}^n w_{kj}x_k\right),</math>

where the [[activation function]] <math>\varphi</math> is [[non-linear]] and [[Differentiable function|differentiable]] over the activation region (the ReLU is not differentiable at one point). A historically used activation function is the [[logistic function]]:

:<math> \varphi(z) = \frac 1 {1+e^{-z}}</math>

which has a [[Logistic function#Mathematical_properties|convenient]] derivative of:

:<math> \frac {d \varphi}{d z} = \varphi(z)(1-\varphi(z)) </math>

The input <math>\text{net}_j</math> to a neuron is the weighted sum of outputs <math>o_k</math> of previous neurons. If the neuron is in the first layer after the input layer, the <math>o_k</math> of the input layer are simply the inputs <math>x_k</math> to the network. The number of input units to the neuron is <math>n</math>. The variable <math>w_{kj}</math> denotes the weight between neuron <math>k</math> of the previous layer and neuron <math>j</math> of the current layer.

===Finding the derivative of the error===
[[Image:ArtificialNeuronModel english.png|thumb|400px|Diagram of an artificial neural network to illustrate the notation used here]]

Calculating the [[partial derivative]] of the error with respect to a weight <math>w_{ij}</math> is done using the [[chain rule]] twice:
{{NumBlk|:|<math>\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial o_{j}} \frac{\partial o_{j}}{\partial w_{ij}}
= \frac{\partial E}{\partial o_j} \frac{\partial o_j}{\partial\text{net}_j} \frac{\partial \text{net}_j}{\partial w_{ij}}</math>|{{EquationRef|Eq. 1}}}}
 
In the last factor of the right-hand side of the above, only one term in the sum <math>\text{net}_j</math> depends on <math>w_{ij}</math>, so that
{{NumBlk|:|<math>\frac{\partial \text{net}_j}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}} \left(\sum_{k=1}^n w_{kj} o_k\right) = \frac{\partial}{\partial w_{ij}}  w_{ij} o_i= o_i.</math>|{{EquationRef|Eq. 2}}}}

If the neuron is in the first layer after the input layer, <math>o_i</math> is just <math>x_i</math>.

The derivative of the output of neuron <math>j</math> with respect to its input is simply the partial derivative of the activation function:

{{NumBlk|:|<math>\frac{\partial o_j}{\partial\text{net}_j} = \frac {\partial  \varphi(\text{net}_j)}{\partial \text{net}_j}</math>|{{EquationRef|Eq. 3}}}}

which for the [[logistic function|logistic activation function]] 
:<math>\frac{\partial o_j}{\partial\text{net}_j} = \frac {\partial}{\partial \text{net}_j} \varphi(\text{net}_j) = \varphi(\text{net}_j)(1-\varphi(\text{net}_j)) = o_j(1-o_j)</math>

This is the reason why backpropagation requires that the activation function be [[Differentiable function|differentiable]]. (Nevertheless, the [[ReLU]] activation function, which is non-differentiable at 0, has become quite popular, e.g. in [[AlexNet]])

The first factor is straightforward to evaluate if the neuron is in the output layer, because then <math>o_j = y</math> and
{{NumBlk|:|<math>\frac{\partial E}{\partial o_j} = \frac{\partial E}{\partial y} </math>|{{EquationRef|Eq. 4}}}}

If half of the square error is used as loss function we can rewrite it as

: <math>\frac{\partial E}{\partial o_j} = \frac{\partial E}{\partial y} = \frac{\partial}{\partial y} \frac{1}{2}(t - y)^2 = y - t </math>

However, if <math>j</math> is in an arbitrary inner layer of the network, finding the derivative <math>E</math> with respect to <math>o_j</math> is less obvious.

Considering <math>E</math> as a function with the inputs being all neurons <math>L = \{u, v, \dots, w\}</math> receiving input from neuron <math>j</math>,

: <math>\frac{\partial E(o_j)}{\partial o_j} = \frac{\partial E(\mathrm{net}_u, \text{net}_v, \dots, \mathrm{net}_w)}{\partial o_j}</math>

and taking the [[total derivative]] with respect to <math>o_j</math>, a recursive expression for the derivative is obtained:

{{NumBlk|:|<math>\frac{\partial E}{\partial o_j} = \sum_{\ell \in L} \left(\frac{\partial E}{\partial \text{net}_\ell}\frac{\partial \text{net}_\ell}{\partial o_j}\right) = \sum_{\ell \in L} \left(\frac{\partial E}{\partial o_\ell}\frac{\partial o_\ell}{\partial \text{net}_\ell}\frac{\partial \text{net}_\ell}{\partial o_j}\right) = \sum_{\ell \in L} \left(\frac{\partial E}{\partial o_\ell}\frac{\partial o_\ell}{\partial \text{net}_\ell}w_{j \ell}\right)</math>|{{EquationRef|Eq. 5}}}}

Therefore, the derivative with respect to <math>o_j</math> can be calculated if all the derivatives with respect to the outputs <math>o_\ell</math> of the next layer – the ones closer to the output neuron – are known. [Note, if any of the neurons in set <math>L</math> were not connected to neuron <math>j</math>, they would be independent of <math>w_{ij}</math> and the corresponding partial derivative under the summation would vanish to 0.]

Substituting {{EquationNote|Eq. 2}}, {{EquationNote|Eq. 3}} {{EquationNote|Eq.4}} and {{EquationNote|Eq. 5}} in {{EquationNote|Eq. 1}} we obtain:

: <math>\frac{\partial E}{\partial w_{ij}} 
= \frac{\partial E}{\partial o_{j}} \frac{\partial o_{j}}{\partial \text{net}_{j}} \frac{\partial \text{net}_{j}}{\partial w_{ij}}

=  \frac{\partial E}{\partial o_{j}} \frac{\partial o_{j}}{\partial \text{net}_{j}} o_i</math>
: <math> \frac{\partial E}{\partial w_{ij}} =  o_i \delta_j</math>

with

:<math>\delta_j 
    = 
        \frac{\partial E}{\partial o_j} \frac{\partial o_j}{\partial\text{net}_j} 
    = \begin{cases}
            \frac{\partial L(t, o_j)}{\partial o_j} \frac {d \varphi(\text{net}_j)}{d \text{net}_j}                    & \text{if } j \text{ is an output neuron,}\\
            (\sum_{\ell\in L} w_{j \ell} 
                            \delta_\ell)\frac {d \varphi(\text{net}_j)}{d \text{net}_j}   & \text{if } j \text{ is an inner neuron.}
\end{cases}</math>

if <math>\varphi</math> is the logistic function, and the error is the square error:

: <math>\delta_j = \frac{\partial E}{\partial o_j} \frac{\partial o_j}{\partial\text{net}_j} = \begin{cases}
(o_j-t_j)o_j(1-o_{j}) & \text{if } j \text{ is an output neuron,}\\
(\sum_{\ell\in L} w_{j \ell} \delta_\ell)o_j(1-o_j)  & \text{if } j \text{ is an inner neuron.}
\end{cases}</math>

To update the weight <math>w_{ij}</math> using gradient descent, one must choose a learning rate, <math>\eta >0</math>. The change in weight needs to reflect the impact on <math>E</math> of an increase or decrease in <math>w_{ij}</math>. If <math>\frac{\partial E}{\partial w_{ij}} > 0</math>, an increase in <math>w_{ij}</math> increases <math>E</math>; conversely, if <math>\frac{\partial E}{\partial w_{ij}} < 0</math>, an increase in <math>w_{ij}</math> decreases <math>E</math>. The new <math>\Delta w_{ij}</math> is added to the old weight, and the product of the learning rate and the gradient, multiplied by <math>-1</math> guarantees that <math>w_{ij}</math> changes in a way that always decreases <math>E</math>. In other words, in the equation immediately below, <math>- \eta \frac{\partial E}{\partial w_{ij}}</math> always changes <math>w_{ij}</math> in such a way that <math>E</math> is decreased:

: <math> \Delta w_{ij} = - \eta \frac{\partial E}{\partial w_{ij}} = - \eta o_i \delta_j</math>