Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Backpropagation
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Derivation== The gradient descent method involves calculating the derivative of the loss function with respect to the weights of the network. This is normally done using backpropagation. Assuming one output neuron,{{efn|There can be multiple output neurons, in which case the error is the squared norm of the difference vector.}} the squared error function is :<math>E = L(t, y)</math> where :<math>L</math> is the loss for the output <math>y</math> and target value <math>t</math>, :<math>t</math> is the target output for a training sample, and :<math>y</math> is the actual output of the output neuron. For each neuron <math>j</math>, its output <math>o_j</math> is defined as :<math>o_j = \varphi(\text{net}_j) = \varphi\left(\sum_{k=1}^n w_{kj}x_k\right),</math> where the [[activation function]] <math>\varphi</math> is [[non-linear]] and [[Differentiable function|differentiable]] over the activation region (the ReLU is not differentiable at one point). A historically used activation function is the [[logistic function]]: :<math> \varphi(z) = \frac 1 {1+e^{-z}}</math> which has a [[Logistic function#Mathematical_properties|convenient]] derivative of: :<math> \frac {d \varphi}{d z} = \varphi(z)(1-\varphi(z)) </math> The input <math>\text{net}_j</math> to a neuron is the weighted sum of outputs <math>o_k</math> of previous neurons. If the neuron is in the first layer after the input layer, the <math>o_k</math> of the input layer are simply the inputs <math>x_k</math> to the network. The number of input units to the neuron is <math>n</math>. The variable <math>w_{kj}</math> denotes the weight between neuron <math>k</math> of the previous layer and neuron <math>j</math> of the current layer. ===Finding the derivative of the error=== [[Image:ArtificialNeuronModel english.png|thumb|400px|Diagram of an artificial neural network to illustrate the notation used here]] Calculating the [[partial derivative]] of the error with respect to a weight <math>w_{ij}</math> is done using the [[chain rule]] twice: {{NumBlk|:|<math>\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial o_{j}} \frac{\partial o_{j}}{\partial w_{ij}} = \frac{\partial E}{\partial o_j} \frac{\partial o_j}{\partial\text{net}_j} \frac{\partial \text{net}_j}{\partial w_{ij}}</math>|{{EquationRef|Eq. 1}}}} In the last factor of the right-hand side of the above, only one term in the sum <math>\text{net}_j</math> depends on <math>w_{ij}</math>, so that {{NumBlk|:|<math>\frac{\partial \text{net}_j}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}} \left(\sum_{k=1}^n w_{kj} o_k\right) = \frac{\partial}{\partial w_{ij}} w_{ij} o_i= o_i.</math>|{{EquationRef|Eq. 2}}}} If the neuron is in the first layer after the input layer, <math>o_i</math> is just <math>x_i</math>. The derivative of the output of neuron <math>j</math> with respect to its input is simply the partial derivative of the activation function: {{NumBlk|:|<math>\frac{\partial o_j}{\partial\text{net}_j} = \frac {\partial \varphi(\text{net}_j)}{\partial \text{net}_j}</math>|{{EquationRef|Eq. 3}}}} which for the [[logistic function|logistic activation function]] :<math>\frac{\partial o_j}{\partial\text{net}_j} = \frac {\partial}{\partial \text{net}_j} \varphi(\text{net}_j) = \varphi(\text{net}_j)(1-\varphi(\text{net}_j)) = o_j(1-o_j)</math> This is the reason why backpropagation requires that the activation function be [[Differentiable function|differentiable]]. (Nevertheless, the [[ReLU]] activation function, which is non-differentiable at 0, has become quite popular, e.g. in [[AlexNet]]) The first factor is straightforward to evaluate if the neuron is in the output layer, because then <math>o_j = y</math> and {{NumBlk|:|<math>\frac{\partial E}{\partial o_j} = \frac{\partial E}{\partial y} </math>|{{EquationRef|Eq. 4}}}} If half of the square error is used as loss function we can rewrite it as : <math>\frac{\partial E}{\partial o_j} = \frac{\partial E}{\partial y} = \frac{\partial}{\partial y} \frac{1}{2}(t - y)^2 = y - t </math> However, if <math>j</math> is in an arbitrary inner layer of the network, finding the derivative <math>E</math> with respect to <math>o_j</math> is less obvious. Considering <math>E</math> as a function with the inputs being all neurons <math>L = \{u, v, \dots, w\}</math> receiving input from neuron <math>j</math>, : <math>\frac{\partial E(o_j)}{\partial o_j} = \frac{\partial E(\mathrm{net}_u, \text{net}_v, \dots, \mathrm{net}_w)}{\partial o_j}</math> and taking the [[total derivative]] with respect to <math>o_j</math>, a recursive expression for the derivative is obtained: {{NumBlk|:|<math>\frac{\partial E}{\partial o_j} = \sum_{\ell \in L} \left(\frac{\partial E}{\partial \text{net}_\ell}\frac{\partial \text{net}_\ell}{\partial o_j}\right) = \sum_{\ell \in L} \left(\frac{\partial E}{\partial o_\ell}\frac{\partial o_\ell}{\partial \text{net}_\ell}\frac{\partial \text{net}_\ell}{\partial o_j}\right) = \sum_{\ell \in L} \left(\frac{\partial E}{\partial o_\ell}\frac{\partial o_\ell}{\partial \text{net}_\ell}w_{j \ell}\right)</math>|{{EquationRef|Eq. 5}}}} Therefore, the derivative with respect to <math>o_j</math> can be calculated if all the derivatives with respect to the outputs <math>o_\ell</math> of the next layer β the ones closer to the output neuron β are known. [Note, if any of the neurons in set <math>L</math> were not connected to neuron <math>j</math>, they would be independent of <math>w_{ij}</math> and the corresponding partial derivative under the summation would vanish to 0.] Substituting {{EquationNote|Eq. 2}}, {{EquationNote|Eq. 3}} {{EquationNote|Eq.4}} and {{EquationNote|Eq. 5}} in {{EquationNote|Eq. 1}} we obtain: : <math>\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial o_{j}} \frac{\partial o_{j}}{\partial \text{net}_{j}} \frac{\partial \text{net}_{j}}{\partial w_{ij}} = \frac{\partial E}{\partial o_{j}} \frac{\partial o_{j}}{\partial \text{net}_{j}} o_i</math> : <math> \frac{\partial E}{\partial w_{ij}} = o_i \delta_j</math> with :<math>\delta_j = \frac{\partial E}{\partial o_j} \frac{\partial o_j}{\partial\text{net}_j} = \begin{cases} \frac{\partial L(t, o_j)}{\partial o_j} \frac {d \varphi(\text{net}_j)}{d \text{net}_j} & \text{if } j \text{ is an output neuron,}\\ (\sum_{\ell\in L} w_{j \ell} \delta_\ell)\frac {d \varphi(\text{net}_j)}{d \text{net}_j} & \text{if } j \text{ is an inner neuron.} \end{cases}</math> if <math>\varphi</math> is the logistic function, and the error is the square error: : <math>\delta_j = \frac{\partial E}{\partial o_j} \frac{\partial o_j}{\partial\text{net}_j} = \begin{cases} (o_j-t_j)o_j(1-o_{j}) & \text{if } j \text{ is an output neuron,}\\ (\sum_{\ell\in L} w_{j \ell} \delta_\ell)o_j(1-o_j) & \text{if } j \text{ is an inner neuron.} \end{cases}</math> To update the weight <math>w_{ij}</math> using gradient descent, one must choose a learning rate, <math>\eta >0</math>. The change in weight needs to reflect the impact on <math>E</math> of an increase or decrease in <math>w_{ij}</math>. If <math>\frac{\partial E}{\partial w_{ij}} > 0</math>, an increase in <math>w_{ij}</math> increases <math>E</math>; conversely, if <math>\frac{\partial E}{\partial w_{ij}} < 0</math>, an increase in <math>w_{ij}</math> decreases <math>E</math>. The new <math>\Delta w_{ij}</math> is added to the old weight, and the product of the learning rate and the gradient, multiplied by <math>-1</math> guarantees that <math>w_{ij}</math> changes in a way that always decreases <math>E</math>. In other words, in the equation immediately below, <math>- \eta \frac{\partial E}{\partial w_{ij}}</math> always changes <math>w_{ij}</math> in such a way that <math>E</math> is decreased: : <math> \Delta w_{ij} = - \eta \frac{\partial E}{\partial w_{ij}} = - \eta o_i \delta_j</math>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)