Editing Backpropagation (section)

==Matrix multiplication==
For the basic case of a feedforward network, where nodes in each layer are connected only to nodes in the immediate next layer (without skipping any layers), and there is a loss function that computes a scalar loss for the final output, backpropagation can be understood simply by matrix multiplication.{{efn|This section largely follows and summarizes {{harvtxt|Nielsen|2015}}.}} Essentially, backpropagation evaluates the expression for the derivative of the cost function as a product of derivatives between each layer ''from right to left'' – "backwards" – with the gradient of the weights between each layer being a simple modification of the partial products (the "backwards propagated error").

Given an input–output pair <math>(x, y)</math>, the loss is:

:<math>C(y, f^L(W^L f^{L-1}(W^{L-1} \cdots f^2(W^2 f^1(W^1 x))\cdots)))</math>

To compute this, one starts with the input <math>x</math> and works forward; denote the weighted input of each hidden layer as <math>z^l</math> and the output of hidden layer <math>l</math> as the activation <math>a^l</math>. For backpropagation, the activation <math>a^l</math> as well as the derivatives <math>(f^l)'</math> (evaluated at <math>z^l</math>) must be cached for use during the backwards pass.

The derivative of the loss in terms of the inputs is given by the chain rule; note that each term is a [[total derivative]], evaluated at the value of the network (at each node) on the input <math>x</math>:

:<math>\frac{d C}{d a^L}\cdot \frac{d a^L}{d z^L} \cdot \frac{d z^L}{d a^{L-1}} \cdot \frac{d a^{L-1}}{d z^{L-1}}\cdot \frac{d z^{L-1}}{d a^{L-2}} \cdot \ldots \cdot \frac{d a^1}{d z^1} \cdot \frac{\partial z^1}{\partial x},</math>
where <math>\frac{d a^L}{d z^L}</math> is a [[diagonal matrix]].

These terms are: the derivative of the loss function;{{efn|The derivative of the loss function is a [[covector]], since the loss function is a [[scalar-valued function]] of several variables.}} the derivatives of the activation functions;{{efn|The activation function is applied to each node separately, so the derivative is just the diagonal matrix of the derivative on each node. This is often represented as the [[Hadamard product (matrices)|Hadamard product]] with the vector of derivatives, denoted by <math>(f^l)'\odot</math>, which is mathematically identical but better matches the internal representation of the derivatives as a vector, rather than a diagonal matrix.}} and the matrices of weights:{{efn|Since matrix multiplication is linear, the derivative of multiplying by a matrix is just the matrix: <math>(Wx)' = W</math>.}}
:<math>\frac{d C}{d a^L}\circ (f^L)' \cdot W^L \circ (f^{L-1})' \cdot W^{L-1} \circ \cdots \circ (f^1)' \cdot W^1.</math>

The gradient <math>\nabla</math> is the [[transpose]] of the derivative of the output in terms of the input, so the matrices are transposed and the order of multiplication is reversed, but the entries are the same:
:<math>\nabla_x C = (W^1)^T \cdot (f^1)' \circ \ldots \circ (W^{L-1})^T \cdot (f^{L-1})' \circ (W^L)^T \cdot (f^L)' \circ \nabla_{a^L} C.</math>

Backpropagation then consists essentially of evaluating this expression from right to left (equivalently, multiplying the previous expression for the derivative from left to right), computing the gradient at each layer on the way; there is an added step, because the gradient of the weights is not just a subexpression: there's an extra multiplication.

Introducing the auxiliary quantity <math>\delta^l</math> for the partial products (multiplying from right to left), interpreted as the "error at level <math>l</math>" and defined as the gradient of the input values at level <math>l</math>:
:<math>\delta^l := (f^l)' \circ (W^{l+1})^T\cdot(f^{l+1})' \circ \cdots \circ (W^{L-1})^T \cdot (f^{L-1})' \circ (W^L)^T \cdot (f^L)' \circ \nabla_{a^L} C.</math>
Note that <math>\delta^l</math> is a vector, of length equal to the number of nodes in level <math>l</math>; each component is interpreted as the "cost attributable to (the value of) that node".

The gradient of the weights in layer <math>l</math> is then:
:<math>\nabla_{W^l} C = \delta^l(a^{l-1})^T.</math>
The factor of <math>a^{l-1}</math> is because the weights <math>W^l</math> between level <math>l - 1</math> and <math>l</math> affect level <math>l</math> proportionally to the inputs (activations): the inputs are fixed, the weights vary.

The <math>\delta^l</math> can easily be computed recursively, going from right to left, as:
:<math>\delta^{l-1} := (f^{l-1})' \circ (W^l)^T \cdot \delta^l.</math>

The gradients of the weights can thus be computed using a few matrix multiplications for each level; this is backpropagation.

Compared with naively computing forwards (using the <math>\delta^l</math> for illustration):

:<math>\begin{align}
\delta^1 &= (f^1)' \circ (W^2)^T \cdot (f^2)' \circ \cdots \circ (W^{L-1})^T \cdot (f^{L-1})' \circ (W^L)^T \cdot (f^L)' \circ \nabla_{a^L} C\\
\delta^2 &= (f^2)' \circ \cdots \circ (W^{L-1})^T \cdot (f^{L-1})' \circ (W^L)^T \cdot (f^L)' \circ \nabla_{a^L} C\\
&\vdots\\
\delta^{L-1} &= (f^{L-1})' \circ (W^L)^T \cdot (f^L)' \circ \nabla_{a^L} C\\
\delta^L &= (f^L)' \circ \nabla_{a^L} C,
\end{align}</math>

There are two key differences with backpropagation:

# Computing <math>\delta^{l-1}</math> in terms of <math>\delta^l</math> avoids the obvious duplicate multiplication of layers <math>l</math> and beyond. 
# Multiplying starting from <math>\nabla_{a^L} C</math> – propagating the error ''backwards'' – means that each step simply multiplies a vector (<math>\delta^l</math>) by the matrices of weights <math>(W^l)^T</math> and derivatives of activations <math>(f^{l-1})'</math>. By contrast, multiplying forwards, starting from the changes at an earlier layer, means that each multiplication multiplies a ''matrix'' by a ''matrix''. This is much more expensive, and corresponds to tracking every possible path of a change in one layer <math>l</math> forward to changes in the layer <math>l+2</math> (for multiplying <math>W^{l+1}</math> by <math>W^{l+2}</math>, with additional multiplications for the derivatives of the activations), which unnecessarily computes the intermediate quantities of how weight changes affect the values of hidden nodes.