Editing Statistical learning theory (section)

==Loss functions==
The choice of loss function is a determining factor on the function <math>f_S</math> that will be chosen by the learning algorithm. The loss function also affects the convergence rate for an algorithm. It is important for the loss function to be [[Convex function|convex]].<ref>{{ cite journal | last1 = Rosasco | first1 = Lorenzo | last2 = De Vito | first2 = Ernesto |last3 = Caponnetto | first3 = Andrea | last4 = Piana | first4 = Michele | last5 = Verri | first5 = Alessandro | date = 2004-05-01 | title = Are Loss Functions All the Same? | url = https://direct.mit.edu/neco/article/16/5/1063-1076/6828 |journal = Neural Computation | language = en | volume = 16 | issue = 5 | pages = 1063–1076 | doi = 10.1162/089976604773135104 | pmid = 15070510 |issn = 0899-7667}}</ref>

Different loss functions are used depending on whether the problem is one of regression or one of classification.

===Regression===
The most common loss function for regression is the square loss function (also known as the [[L2-norm]]). This familiar loss function is used in [[Ordinary least squares regression|Ordinary Least Squares regression]]. The form is:
<math display="block">V(f(\mathbf{x}),y) = (y - f(\mathbf{x}))^2</math>

The absolute value loss (also known as the [[L1-norm]]) is also sometimes used:
<math display="block">V(f(\mathbf{x}),y) = |y - f(\mathbf{x})|</math>

===Classification===
{{main|Statistical classification}}
In some sense the 0-1 [[indicator function]] is the most natural loss function for classification. It takes the value 0 if the predicted output is the same as the actual output, and it takes the value 1 if the predicted output is different from the actual output. For binary classification with <math>Y = \{-1, 1\}</math>, this is:
<math display="block">V(f(\mathbf{x}),y) = \theta(- y f(\mathbf{x}))</math>
where <math>\theta</math> is the [[Heaviside step function]].