Editing Backpropagation (section)

==Overview==
Backpropagation computes the gradient in [[parameter space|weight space]] of a feedforward neural network, with respect to a [[loss function]]. Denote:
* <math>x</math>: input (vector of features)
* <math>y</math>: target output
*:For classification, output will be a vector of class probabilities (e.g., <math>(0.1, 0.7, 0.2)</math>, and target output is a specific class, encoded by the [[one-hot]]/[[Dummy variable (statistics)|dummy variable]] (e.g., <math>(0, 1, 0)</math>).
* <math>C</math>: [[loss function]] or "cost function"{{efn|Use <math>C</math> for the loss function to allow <math>L</math> to be used for the number of layers}}
*:For classification, this is usually [[cross-entropy]] (XC, [[log loss]]), while for regression it is usually [[squared error loss]] (SEL).
* <math>L</math>: the number of layers
* <math>W^l = (w^l_{jk})</math>: the weights between layer <math>l - 1</math> and <math>l</math>, where <math>w^l_{jk}</math> is the weight between the <math>k</math>-th node in layer <math>l - 1</math> and the <math>j</math>-th node in layer <math>l</math>{{efn|This follows {{harvtxt|Nielsen|2015}}, and means (left) multiplication by the matrix <math>W^l</math> corresponds to converting output values of layer <math>l - 1</math> to input values of layer <math>l</math>: columns correspond to input coordinates, rows correspond to output coordinates.}}
* <math>f^l</math>: [[activation function]]s at layer <math>l</math>
*:For classification the last layer is usually the [[logistic function]] for binary classification, and [[softmax function|softmax]] (softargmax) for multi-class classification, while for the hidden layers this was traditionally a [[sigmoid function]] (logistic function or others) on each node (coordinate), but today is more varied, with [[Rectifier (neural networks)|rectifier]] ([[ramp function|ramp]], [[ReLU]]) being common.
* <math>a^l_j</math>: activation of the <math>j</math>-th node in layer <math>l</math>.

In the derivation of backpropagation, other intermediate quantities are used by introducing them as needed below. Bias terms are not treated specially since they correspond to a weight with a fixed input of 1. For backpropagation the specific loss function and activation functions do not matter as long as they and their derivatives can be evaluated efficiently. Traditional activation functions include sigmoid, [[tanh]], and [[Rectifier (neural networks)|ReLU]]. [[Swish function|Swish]],<ref>{{cite arXiv|last1=Ramachandran|first1=Prajit|last2=Zoph|first2=Barret|last3=Le|first3=Quoc V.|date=2017-10-27|title=Searching for Activation Functions|class=cs.NE|eprint=1710.05941}}</ref> [[Rectifier (neural networks)#Mish|mish]],<ref>{{cite arXiv|last=Misra|first=Diganta|date=2019-08-23|title=Mish: A Self Regularized Non-Monotonic Activation Function|class=cs.LG|eprint=1908.08681|language=en}}</ref> and other activation functions have since been proposed as well.

The overall network is a combination of [[function composition]] and [[matrix multiplication]]:
:<math>g(x) := f^L(W^L f^{L-1}(W^{L-1} \cdots f^1(W^1 x)\cdots))</math>

For a training set there will be a set of input–output pairs, <math>\left\{(x_i, y_i)\right\}</math>. For each input–output pair <math>(x_i, y_i)</math> in the training set, the loss of the model on that pair is the cost of the difference between the predicted output <math>g(x_i)</math> and the target output <math>y_i</math>:
:<math>C(y_i, g(x_i))</math>

Note the distinction: during model evaluation the weights are fixed while the inputs vary (and the target output may be unknown), and the network ends with the output layer (it does not include the loss function). During model training the input–output pair is fixed while the weights vary, and the network ends with the loss function.

Backpropagation computes the gradient for a ''fixed'' input–output pair <math>(x_i, y_i)</math>, where the weights <math>w^l_{jk}</math> can vary. Each individual component of the gradient, <math>\partial C/\partial w^l_{jk},</math> can be computed by the chain rule; but doing this separately for each weight is inefficient. Backpropagation efficiently computes the gradient by avoiding duplicate calculations and not computing unnecessary intermediate values, by computing the gradient of each layer – specifically the gradient of the weighted ''input'' of each layer, denoted by <math>\delta^l</math> – from back to front.

Informally, the key point is that since the only way a weight in <math>W^l</math> affects the loss is through its effect on the ''next'' layer, and it does so ''linearly'', <math>\delta^l</math> are the only data you need to compute the gradients of the weights at layer <math>l</math>, and then the gradients of weights of previous layer can be computed by <math>\delta^{l-1}</math> and repeated recursively. This avoids inefficiency in two ways. First, it avoids duplication because when computing the gradient at layer <math>l</math>, it is unnecessary to recompute all derivatives on later layers <math>l+1, l+2, \ldots</math> each time. Second, it avoids unnecessary intermediate calculations, because at each stage it directly computes the gradient of the weights with respect to the ultimate output (the loss), rather than unnecessarily computing the derivatives of the values of hidden layers with respect to changes in weights <math>\partial a^{l'}_{j'}/\partial w^l_{jk}</math>.

Backpropagation can be expressed for simple feedforward networks in terms of [[#Matrix multiplication|matrix multiplication]], or more generally in terms of the [[#Adjoint graph|adjoint graph]].