Editing Logistic regression (section)

===Fit===
The usual measure of [[goodness of fit]] for a logistic regression uses [[logistic loss]] (or [[log loss]]), the negative [[log-likelihood]]. For a given ''x<sub>k</sub>'' and ''y<sub>k</sub>'', write <math>p_k=p(x_k)</math>. The {{tmath|p_k}} are the probabilities that the corresponding {{tmath|y_k}} will equal one and {{tmath|1-p_k}} are the probabilities that they will be zero (see [[Bernoulli distribution]]). We wish to find the values of {{tmath|\beta_0}} and {{tmath|\beta_1}} which give the "best fit" to the data. In the case of linear regression, the sum of the squared deviations of the fit from the data points (''y<sub>k</sub>''), the [[squared error loss]], is taken as a measure of the goodness of fit, and the best fit is obtained when that function is ''minimized''.

The log loss for the ''k''-th point {{tmath|\ell_k}} is:
:<math>\ell_k = \begin{cases}
-\ln p_k & \text{ if } y_k = 1, \\
-\ln (1 - p_k) & \text{ if } y_k = 0.
\end{cases}</math>

The log loss can be interpreted as the "[[surprisal]]" of the actual outcome {{tmath|y_k}} relative to the prediction {{tmath|p_k}}, and is a measure of [[information content]]. Log loss is always greater than or equal to 0, equals 0 only in case of a perfect prediction (i.e., when <math>p_k = 1</math> and <math>y_k = 1</math>, or <math>p_k = 0</math> and <math>y_k = 0</math>), and approaches infinity as the prediction gets worse (i.e., when <math>y_k = 1</math> and <math>p_k \to 0</math> or <math>y_k = 0
</math> and <math>p_k \to 1</math>), meaning the actual outcome is "more surprising". Since the value of the logistic function is always strictly between zero and one, the log loss is always greater than zero and less than infinity. Unlike in a linear regression, where the model can have zero loss at a point by passing through a data point (and zero loss overall if all points are on a line), in a logistic regression it is not possible to have zero loss at any points, since {{tmath|y_k}} is either 0 or 1, but {{tmath|0 < p_k < 1}}.

These can be combined into a single expression:
:<math>\ell_k = -y_k\ln p_k - (1 - y_k)\ln (1 - p_k).</math>

This expression is more formally known as the [[cross-entropy]] of the predicted distribution <math>\big(p_k, (1-p_k)\big)</math> from the actual distribution <math>\big(y_k, (1-y_k)\big)</math>, as probability distributions on the two-element space of (pass, fail).

The sum of these, the total loss, is the overall negative log-likelihood {{tmath|-\ell}}, and the best fit is obtained for those choices of {{tmath|\beta_0}} and {{tmath|\beta_1}} for which {{tmath|-\ell}} is ''minimized''.

Alternatively, instead of ''minimizing'' the loss, one can ''maximize'' its inverse, the (positive) log-likelihood:
:<math>\ell = \sum_{k:y_k=1}\ln(p_k) + \sum_{k:y_k=0}\ln(1-p_k) = \sum_{k=1}^K \left(\,y_k \ln(p_k)+(1-y_k)\ln(1-p_k)\right)</math>
or equivalently maximize the [[likelihood function]] itself, which is the probability that the given data set is produced by a particular logistic function:
:<math>L = \prod_{k:y_k=1}p_k\,\prod_{k:y_k=0}(1-p_k)</math>
This method is known as [[maximum likelihood estimation]].