Editing Statistical learning theory (section)

==Formal description==
Take <math>X</math> to be the [[vector space]] of all possible inputs, and <math>Y</math> to be the vector space of all possible outputs. Statistical learning theory takes the perspective that there is some unknown [[probability distribution]] over the product space <math>Z = X \times Y</math>, i.e. there exists some unknown <math>p(z) = p(\mathbf{x},y)</math>. The training set is made up of <math>n</math> samples from this probability distribution, and is notated 
<math display="block">S = \{(\mathbf{x}_1,y_1), \dots ,(\mathbf{x}_n,y_n)\} = \{\mathbf{z}_1, \dots ,\mathbf{z}_n\}</math>
Every <math>\mathbf{x}_i</math> is an input vector from the training data, and <math>y_i</math> is the output that corresponds to it.

In this formalism, the inference problem consists of finding a function <math>f: X \to Y</math> such that <math>f(\mathbf{x}) \sim y</math>. Let <math>\mathcal{H}</math> be a space of functions <math>f: X \to Y</math> called the hypothesis space. The hypothesis space is the space of functions the algorithm will search through. Let <math>V(f(\mathbf{x}),y)</math> be the [[loss function]], a metric for the difference between the predicted value <math>f(\mathbf{x})</math> and the actual value <math>y</math>. The [[expected risk]] is defined to be
<math display="block">I[f] = \int_{X \times Y} V(f(\mathbf{x}),y)\, p(\mathbf{x},y) \,d\mathbf{x} \,dy</math>
The target function, the best possible function <math>f</math> that can be chosen, is given by the <math>f</math> that satisfies
<math display="block">f = \mathop{\operatorname{argmin}}_{h \in \mathcal{H}} I[h]</math>

Because the probability distribution <math>p(\mathbf{x},y)</math> is unknown, a proxy measure for the expected risk must be used. This measure is based on the training set, a sample from this unknown probability distribution. It is called the [[empirical risk]]
<math display="block">I_S[f] = \frac{1}{n} \sum_{i=1}^n V( f(\mathbf{x}_i),y_i)</math>
A learning algorithm that chooses the function <math>f_S</math> that minimizes the empirical risk is called [[empirical risk minimization]].