Editing Empirical risk minimization (section)

== Properties ==
{{Expand section|date=February 2023}}
Guarantees for the performance of empirical risk minimization depend strongly on the function class selected as well as the distributional assumptions made.<ref name=lg>{{Cite book |last1=Györfi |first1=László |title=A Distribution-Free Theory of Nonparametric Regression |last2=Kohler |first2=Michael |last3=Krzyzak |first3=Adam |last4=Walk |first4=Harro |date=2010-12-01 |publisher=Springer |isbn=978-1-4419-2998-3 |edition=Softcover reprint of the original 1st |location=New York |language=en}}</ref> In general, distribution-free methods are too coarse, and do not lead to practical bounds. However, they are still useful in deriving asymptotic properties of learning algorithms, such as [[Consistency (statistics)|consistency]]. In particular, distribution-free bounds on the performance of empirical risk minimization given a fixed function class can be derived using bounds on the [[Vapnik–Chervonenkis dimension|VC complexity]] of the function class. 

For simplicity, considering the case of binary classification tasks, it is possible to bound the probability of the selected classifier, <math>\phi_n</math> being much worse than the best possible classifier <math>\phi^*</math>. Consider the risk <math>L</math> defined over the hypothesis class <math>\mathcal C</math> with [[growth function]] <math>\mathcal S(\mathcal C, n)</math> given a dataset of size <math>n</math>. Then, for every <math>\epsilon > 0</math>:<ref name='patt'>Devroye, L., Gyorfi, L. & Lugosi, G. A Probabilistic Theory of Pattern Recognition. Discrete Appl Math 73, 192–194 (1997)</ref>

<math display='block'> \mathbb P \left (L(\phi_n) - L(\phi^*) > \epsilon \right ) \leq \mathcal 8S(\mathcal C, n) \exp\{-n\epsilon^2 / 32\} </math>

Similar results hold for regression tasks.<ref name=lg/> These results are often based on [[uniform law of large numbers|uniform laws of large numbers]], which control the deviation of the empirical risk from the true risk, uniformly over the hypothesis class.<ref name='patt'/>

===Impossibility results===

It is also possible to show lower bounds on algorithm performance if no distributional assumptions are made.<ref>{{Cite journal |last1=Devroye |first1=Luc |last2=Györfi |first2=László |last3=Lugosi |first3=Gábor |date=1996 |title=A Probabilistic Theory of Pattern Recognition |url=https://link.springer.com/book/10.1007/978-1-4612-0711-5 |journal=Stochastic Modelling and Applied Probability |volume=31 |language=en |doi=10.1007/978-1-4612-0711-5 |isbn=978-1-4612-6877-2 |issn=0172-4568|url-access=subscription }}</ref> This is sometimes referred to as the ''[[No free lunch theorem]]''. Even though a specific learning algorithm may  provide the asymptotically optimal performance for any distribution, the finite sample performance is always poor for at least one data distribution. This means that no classifier can improve on the error for a given sample size for all distributions.<ref name='patt'/>

Specifically, let <math>\epsilon > 0</math> and consider a sample size <math>n</math> and classification rule <math>\phi_n</math>, there exists a distribution of <math>(X, Y)</math> with risk <math>L^* =0</math> (meaning that perfect prediction is possible) such that:<ref name='patt'/>
<math display='block'>\mathbb E L_n \geq  1/2 - \epsilon.</math>

It is further possible to show that the convergence rate of a learning algorithm is poor for some distributions. Specifically, given a sequence of decreasing positive numbers <math>a_i</math> converging to zero, it is possible to find a distribution such that:

<math display='block> \mathbb E L_n \geq a_i</math>

for all <math>n</math>. This result shows that universally good classification rules do not exist, in the sense that the rule must be low quality for at least one distribution.<ref name='patt'/>

=== Computational complexity ===
Empirical risk minimization for a classification problem with a [[0-1 loss function]] is known to be an [[NP-hard]] problem even for a relatively simple class of functions such as [[linear classifier]]s.<ref>V. Feldman, V. Guruswami, P. Raghavendra and Yi Wu (2009). [https://arxiv.org/abs/1012.0729 ''Agnostic Learning of Monomials by Halfspaces is Hard.''] (See the paper and references therein)</ref> Nevertheless, it can be solved efficiently when the minimal empirical risk is zero, i.e., data is [[linearly separable]].{{Cn|date=December 2023}}

In practice, machine learning algorithms cope with this issue either by employing a [[Convex optimization|convex approximation]] to the 0–1 loss function (like [[hinge loss]] for [[Support vector machine|SVM]]), which is easier to optimize, or by imposing assumptions on the distribution <math>P(x, y)</math> (and thus stop being agnostic learning algorithms to which the above result applies).

In the case of convexification, Zhang's lemma majors the excess risk of the original problem using the excess risk of the convexified problem.<ref>{{Cite web |title=Mathematics of Machine Learning Lecture 9 Notes {{!}} Mathematics of Machine Learning {{!}} Mathematics |url=https://ocw.mit.edu/courses/18-657-mathematics-of-machine-learning-fall-2015/resources/mit18_657f15_l9/ |access-date=2023-10-28 |website=MIT OpenCourseWare |language=en}}</ref> Minimizing the latter using convex optimization also allow to control the former.