Editing Backpropagation (section)

===Learning as an optimization problem===
To understand the mathematical derivation of the backpropagation algorithm, it helps to first develop some intuition about the relationship between the actual output of a neuron and the correct output for a particular training example. Consider a simple neural network with two input units, one output unit and no hidden units, and in which each neuron uses a [[Artificial neuron#Linear combination|linear output]] (unlike most work on neural networks, in which mapping from inputs to outputs is non-linear){{efn|One may notice that multi-layer neural networks use non-linear activation functions, so an example with linear neurons seems obscure. However, even though the error surface of multi-layer networks are much more complicated, locally they can be approximated by a paraboloid. Therefore, linear neurons are used for simplicity and easier understanding.}} that is the weighted sum of its input. [[File:A simple neural network with two input units and one output unit.png|thumb|250px|A simple neural network with two input units (each with a single input) and one output unit (with two inputs)]]

Initially, before training, the weights will be set randomly. Then the neuron learns from [[Training set|training examples]], which in this case consist of a set of [[tuple]]s <math>(x_1, x_2, t)</math> where <math>x_1</math> and <math>x_2</math> are the inputs to the network and {{mvar|t}} is the correct output (the output the network should produce given those inputs, when it has been trained). The initial network, given <math>x_1</math> and <math>x_2</math>, will compute an output {{mvar|y}} that likely differs from {{mvar|t}} (given random weights). A [[loss function]] <math> L(t, y) </math> is used for measuring the discrepancy between the target output {{mvar|t}} and the computed output {{mvar|y}}. For [[regression analysis]] problems the squared error can be used as a loss function, for [[Statistical classification|classification]] the [[cross-entropy|categorical cross-entropy]] can be used.

As an example consider a regression problem using the square error as a loss:
:<math>L(t, y)= (t-y)^2 = E, </math>

where {{mvar|E}} is the discrepancy or error.
Consider the network on a single training case: <math>(1, 1, 0)</math>. Thus, the input <math>x_1</math> and <math>x_2</math> are 1 and 1 respectively and the correct output, {{mvar|t}} is 0. Now if the relation is plotted between the network's output {{mvar|y}} on the horizontal axis and the error {{mvar|E}} on the vertical axis, the result is a parabola. The [[Maxima and minima|minimum]] of the [[parabola]] corresponds to the output {{mvar|y}} which minimizes the error {{mvar|E}}. For a single training case, the minimum also touches the horizontal axis, which means the error will be zero and the network can produce an output {{mvar|y}} that exactly matches the target output {{mvar|t}}. Therefore, the problem of mapping inputs to outputs can be reduced to an [[optimization problem]] of finding a function that will produce the minimal error. [[File:Error surface of a linear neuron for a single training case.png|right|thumb|250px|Error surface of a linear neuron for a single training case]]

However, the output of a neuron depends on the weighted sum of all its inputs:

:<math>y=x_1w_1 + x_2w_2,</math>

where <math>w_1</math> and <math>w_2</math> are the weights on the connection from the input units to the output unit. Therefore, the error also depends on the incoming weights to the neuron, which is ultimately what needs to be changed in the network to enable learning.

In this example, upon injecting the training data <math>(1, 1, 0)</math>, the loss function becomes

<math> E = (t-y)^2 = y^2 = (x_1w_1 + x_2w_2)^2 = (w_1 + w_2)^2.</math>

Then, the loss function <math>E</math> takes the form of a parabolic cylinder with its base directed along <math>w_1 = -w_2</math>. Since all sets of weights that satisfy <math>w_1 = -w_2</math> minimize the loss function, in this case additional constraints are required to converge to a unique solution. Additional constraints could either be generated by setting specific conditions to the weights, or by injecting additional training data.

One commonly used algorithm to find the set of weights that minimizes the error is [[gradient descent]]. By backpropagation, the steepest descent direction is calculated of the loss function versus the present synaptic weights. Then, the weights can be modified along the steepest descent direction, and the error is minimized in an efficient way.