Editing Supervised learning (section)

===Structural risk minimization===

[[Structural risk minimization]] seeks to prevent overfitting by incorporating a [[Regularization (mathematics)|regularization penalty]] into the optimization. The regularization penalty can be viewed as implementing a form of [[Occam's razor]] that prefers simpler functions over more complex ones.

A wide variety of penalties have been employed that correspond to different definitions of complexity. For example, consider the case where the function <math>g</math> is a linear function of the form

:<math> g(x) = \sum_{j=1}^d \beta_j x_j</math>.

A popular regularization penalty is <math>\sum_j \beta_j^2</math>, which is the squared [[Euclidean norm]] of the weights, also known as the <math>L_2</math> norm. Other norms include the <math>L_1</math> norm, <math>\sum_j |\beta_j|</math>, and the [[L0 "norm"|<math>L_0</math> "norm"]], which is the number of non-zero <math>\beta_j</math>s. The penalty will be denoted by <math>C(g)</math>.

The supervised learning optimization problem is to find the function <math>g</math> that minimizes

:<math> J(g) = R_{emp}(g) + \lambda C(g).</math>

The parameter <math>\lambda</math> controls the bias-variance tradeoff. When <math>\lambda = 0</math>, this gives empirical risk minimization with low bias and high variance. When <math>\lambda</math> is large, the learning algorithm will have high bias and low variance. The value of <math>\lambda</math> can be chosen empirically via [[cross-validation (statistics)| cross-validation]].

The complexity penalty has a Bayesian interpretation as the negative log prior probability of <math>g</math>, <math>-\log P(g)</math>, in which case <math>J(g)</math> is the [[posterior probability]] of <math>g</math>.