Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Empirical risk minimization
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Background == The following situation is a general setting of many [[supervised learning]] problems. There are two spaces of objects <math>X</math> and <math>Y</math> and we would like to learn a function <math>\ h: X \to Y</math> (often called ''hypothesis'') which outputs an object <math>y \in Y</math>, given <math>x \in X</math>. To do so, there is a ''training set'' of <math>n</math> examples <math>\ (x_1, y_1), \ldots, (x_n, y_n)</math> where <math>x_i \in X</math> is an input and <math>y_i \in Y</math> is the corresponding response that is desired from <math> h(x_i)</math>. To put it more formally, assuming that there is a [[joint probability distribution]] <math>P(x, y)</math> over <math>X</math> and <math>Y</math>, and that the training set consists of <math>n</math> instances <math>\ (x_1, y_1), \ldots, (x_n, y_n)</math> drawn [[i.i.d.]] from <math>P(x, y)</math>. The assumption of a joint probability distribution allows for the modelling of uncertainty in predictions (e.g. from noise in data) because <math>y</math> is not a deterministic function of {{nowrap|<math>x</math>,}} but rather a [[random variable]] with [[conditional distribution]] <math>P(y | x)</math> for a fixed <math>x</math>. It is also assumed that there is a non-negative real-valued [[loss function]] <math>L(\hat{y}, y)</math> which measures how different the prediction <math>\hat{y}</math> of a hypothesis is from the true outcome <math>y</math>. For classification tasks, these loss functions can be [[scoring rule]]s. The [[Risk (statistics)|risk]] associated with hypothesis <math>h(x)</math> is then defined as the [[Expected value|expectation]] of the loss function: : <math>R(h) = \mathbf{E}[L(h(x), y)] = \int L(h(x), y)\,dP(x, y).</math> A loss function commonly used in theory is the [[0-1 loss function]]: <math>L(\hat{y}, y) = \begin{cases} 1 & \mbox{ if }\quad \hat{y} \ne y \\ 0 & \mbox{ if }\quad \hat{y} = y \end{cases}</math>. The ultimate goal of a learning algorithm is to find a hypothesis <math> h^*</math> among a fixed class of functions <math>\mathcal{H}</math> for which the risk <math>R(h)</math> is minimal: : <math>h^* = \underset{h \in \mathcal{H}}{\operatorname{arg\, min}}\, {R(h)}.</math> For classification problems, the [[Bayes classifier]] is defined to be the classifier minimizing the risk defined with the 0β1 loss function.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)