Editing Empirical risk minimization (section)

== Background ==
The following situation is a general setting of many [[supervised learning]] problems. There are two spaces of objects <math>X</math> and <math>Y</math> and we would like to learn a function <math>\ h: X \to Y</math> (often called ''hypothesis'') which outputs an object <math>y \in Y</math>, given <math>x \in X</math>. To do so, there is a ''training set'' of <math>n</math> examples <math>\ (x_1, y_1), \ldots, (x_n, y_n)</math> where <math>x_i \in X</math> is an input and <math>y_i \in Y</math> is the corresponding response that is desired from <math> h(x_i)</math>.

To put it more formally, assuming that there is a [[joint probability distribution]] <math>P(x, y)</math> over <math>X</math> and <math>Y</math>, and that the training set consists of <math>n</math> instances <math>\ (x_1, y_1), \ldots, (x_n, y_n)</math> drawn [[i.i.d.]] from <math>P(x, y)</math>. The assumption of a joint probability distribution allows for the modelling of uncertainty in predictions (e.g. from noise in data) because <math>y</math> is not a deterministic function of {{nowrap|<math>x</math>,}} but rather a [[random variable]] with [[conditional distribution]] <math>P(y | x)</math> for a fixed <math>x</math>.

It is also assumed that there is  a non-negative real-valued [[loss function]] <math>L(\hat{y}, y)</math> which measures how different the prediction <math>\hat{y}</math> of a hypothesis is from the true outcome <math>y</math>. For classification tasks, these loss functions can be [[scoring rule]]s.
The [[Risk (statistics)|risk]] associated with hypothesis <math>h(x)</math> is then defined as the [[Expected value|expectation]] of the loss function:
: <math>R(h) = \mathbf{E}[L(h(x), y)] = \int L(h(x), y)\,dP(x, y).</math>

A loss function commonly used in theory is the [[0-1 loss function]]: <math>L(\hat{y}, y) = \begin{cases} 1 & \mbox{ if }\quad \hat{y} \ne y \\ 0 & \mbox{ if }\quad \hat{y} = y \end{cases}</math>.

The ultimate goal of a learning algorithm is to find a hypothesis <math> h^*</math> among a fixed class of functions <math>\mathcal{H}</math> for which the risk <math>R(h)</math> is minimal:
: <math>h^* = \underset{h \in \mathcal{H}}{\operatorname{arg\, min}}\, {R(h)}.</math>

For classification problems, the [[Bayes classifier]] is defined to be the classifier minimizing the risk defined with the 0–1 loss function.