Editing Logistic regression (section)

==Example==
===Problem===
As a simple example, we can use a logistic regression with one explanatory variable and two categories to answer the following question:
<blockquote>
A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability of the student passing the exam?
</blockquote>

The reason for using logistic regression for this problem is that the values of the dependent variable, pass and fail, while represented by "1" and "0", are not [[cardinal number]]s. If the problem was changed so that pass/fail was replaced with the grade 0–100 (cardinal numbers), then simple [[regression analysis]] could be used.

The table shows the number of hours each student spent studying, and whether they passed (1) or failed (0).
{| class="wikitable"
|-
! Hours (''x<sub>k</sub>'')
| 0.50|| 0.75|| 1.00|| 1.25|| 1.50|| 1.75|| 1.75|| 2.00|| 2.25|| 2.50|| 2.75|| 3.00|| 3.25|| 3.50|| 4.00|| 4.25|| 4.50|| 4.75|| 5.00 || 5.50
|-
! Pass (''y<sub>k</sub>'')
| 0|| 0|| 0|| 0|| 0|| 0|| 1|| 0|| 1|| 0|| 1|| 0|| 1|| 0|| 1|| 1|| 1|| 1|| 1|| 1
|}

We wish to fit a logistic function to the data consisting of the hours studied (''x<sub>k</sub>'') and the outcome of the test (''y<sub>k</sub>''&nbsp;=1 for pass, 0 for fail). The data points are indexed by the subscript ''k'' which runs from <math>k=1</math> to <math>k=K=20</math>. The ''x'' variable is called the "[[explanatory variable]]", and the ''y'' variable is called the "[[categorical variable]]" consisting of two categories: "pass" or "fail" corresponding to the categorical values 1 and 0 respectively.

===Model===
[[File:Exam pass logistic curve.svg|thumb|400px|Graph of a logistic regression curve fitted to the (''x<sub>m</sub>'',''y<sub>m</sub>'') data. The curve shows the probability of passing an exam versus hours studying.]]

The [[logistic function]] is of the form:

:<math>p(x)=\frac{1}{1+e^{-(x-\mu)/s}}</math>

where ''μ'' is a [[location parameter]] (the midpoint of the curve, where <math>p(\mu)=1/2</math>) and ''s'' is a [[scale parameter]]. This expression may be rewritten as:

:<math>p(x)=\frac{1}{1+e^{-(\beta_0+\beta_1 x)}}</math>

where <math>\beta_0 = -\mu/s</math> and is known as the [[vertical intercept|intercept]] (it is the ''vertical'' intercept or ''y''-intercept of the line <math>y = \beta_0+\beta_1 x</math>), and <math>\beta_1= 1/s</math> (inverse scale parameter or [[rate parameter]]): these are the ''y''-intercept and slope of the log-odds as a function of ''x''. Conversely, <math>\mu=-\beta_0/\beta_1</math> and <math>s=1/\beta_1</math>.

Note that this model is actually an oversimplification, since it assumes everybody will pass if they learn long enough (limit = 1).

===Fit===
The usual measure of [[goodness of fit]] for a logistic regression uses [[logistic loss]] (or [[log loss]]), the negative [[log-likelihood]]. For a given ''x<sub>k</sub>'' and ''y<sub>k</sub>'', write <math>p_k=p(x_k)</math>. The {{tmath|p_k}} are the probabilities that the corresponding {{tmath|y_k}} will equal one and {{tmath|1-p_k}} are the probabilities that they will be zero (see [[Bernoulli distribution]]). We wish to find the values of {{tmath|\beta_0}} and {{tmath|\beta_1}} which give the "best fit" to the data. In the case of linear regression, the sum of the squared deviations of the fit from the data points (''y<sub>k</sub>''), the [[squared error loss]], is taken as a measure of the goodness of fit, and the best fit is obtained when that function is ''minimized''.

The log loss for the ''k''-th point {{tmath|\ell_k}} is:
:<math>\ell_k = \begin{cases}
-\ln p_k & \text{ if } y_k = 1, \\
-\ln (1 - p_k) & \text{ if } y_k = 0.
\end{cases}</math>

The log loss can be interpreted as the "[[surprisal]]" of the actual outcome {{tmath|y_k}} relative to the prediction {{tmath|p_k}}, and is a measure of [[information content]]. Log loss is always greater than or equal to 0, equals 0 only in case of a perfect prediction (i.e., when <math>p_k = 1</math> and <math>y_k = 1</math>, or <math>p_k = 0</math> and <math>y_k = 0</math>), and approaches infinity as the prediction gets worse (i.e., when <math>y_k = 1</math> and <math>p_k \to 0</math> or <math>y_k = 0
</math> and <math>p_k \to 1</math>), meaning the actual outcome is "more surprising". Since the value of the logistic function is always strictly between zero and one, the log loss is always greater than zero and less than infinity. Unlike in a linear regression, where the model can have zero loss at a point by passing through a data point (and zero loss overall if all points are on a line), in a logistic regression it is not possible to have zero loss at any points, since {{tmath|y_k}} is either 0 or 1, but {{tmath|0 < p_k < 1}}.

These can be combined into a single expression:
:<math>\ell_k = -y_k\ln p_k - (1 - y_k)\ln (1 - p_k).</math>

This expression is more formally known as the [[cross-entropy]] of the predicted distribution <math>\big(p_k, (1-p_k)\big)</math> from the actual distribution <math>\big(y_k, (1-y_k)\big)</math>, as probability distributions on the two-element space of (pass, fail).

The sum of these, the total loss, is the overall negative log-likelihood {{tmath|-\ell}}, and the best fit is obtained for those choices of {{tmath|\beta_0}} and {{tmath|\beta_1}} for which {{tmath|-\ell}} is ''minimized''.

Alternatively, instead of ''minimizing'' the loss, one can ''maximize'' its inverse, the (positive) log-likelihood:
:<math>\ell = \sum_{k:y_k=1}\ln(p_k) + \sum_{k:y_k=0}\ln(1-p_k) = \sum_{k=1}^K \left(\,y_k \ln(p_k)+(1-y_k)\ln(1-p_k)\right)</math>
or equivalently maximize the [[likelihood function]] itself, which is the probability that the given data set is produced by a particular logistic function:
:<math>L = \prod_{k:y_k=1}p_k\,\prod_{k:y_k=0}(1-p_k)</math>
This method is known as [[maximum likelihood estimation]].

===Parameter estimation===
Since ''ℓ'' is nonlinear in {{tmath|\beta_0}} and {{tmath|\beta_1}}, determining their optimum values will require numerical methods. One method of maximizing  ''ℓ'' is to require the derivatives of ''ℓ'' with respect to {{tmath|\beta_0}} and {{tmath|\beta_1}} to be zero:

:<math>0 = \frac{\partial \ell}{\partial \beta_0} = \sum_{k=1}^K(y_k-p_k)</math>

:<math>0 = \frac{\partial \ell}{\partial \beta_1} = \sum_{k=1}^K(y_k-p_k)x_k</math>

and the maximization procedure can be accomplished by solving the above two equations for {{tmath|\beta_0}} and {{tmath|\beta_1}}, which, again, will generally require the use of numerical methods.

The values of {{tmath|\beta_0}} and {{tmath|\beta_1}} which maximize ''ℓ'' and ''L'' using the above data are found to be:

:<math>\beta_0 \approx -4.1</math>
:<math>\beta_1 \approx 1.5</math>

which yields a value for ''μ'' and ''s'' of:

:<math>\mu = -\beta_0/\beta_1 \approx 2.7</math>
:<math>s = 1/\beta_1 \approx 0.67</math>

===Predictions===
The {{tmath|\beta_0}} and {{tmath|\beta_1}} coefficients may be entered into the logistic regression equation to estimate the probability of passing the exam.

For example, for a student who studies 2 hours, entering the value <math>x = 2</math> into the equation gives the estimated probability of passing the exam of 0.25:

: <math>
t = \beta_0+2\beta_1 \approx - 4.1 + 2 \cdot 1.5  = -1.1
</math>

: <math>
p = \frac{1}{1 + e^{-t} } \approx 0.25 = \text{Probability of passing exam}
</math>

Similarly, for a student who studies 4 hours, the estimated probability of passing the exam is 0.87:

: <math>t = \beta_0+4\beta_1 \approx - 4.1 + 4 \cdot 1.5  = 1.9</math>

: <math>p = \frac{1}{1 + e^{-t} } \approx 0.87 = \text{Probability of passing exam} </math>

This table shows the estimated probability of passing the exam for several values of hours studying.

{| class="wikitable"
|-
! rowspan="2" | Hours<br />of study<br />(''x'')
! colspan="3" | Passing exam
|-
! Log-odds (''t'') !! Odds (''e<sup>t</sup>'') !! Probability (''p'')
|- style="text-align: right;"
| 1|| −2.57 || 0.076 ≈ 1:13.1 || 0.07
|- style="text-align: right;"
| 2|| −1.07 || 0.34 ≈ 1:2.91 || 0.26
|- style="text-align: right;"
|{{tmath|\mu \approx 2.7}} || 0 ||1 || {{sfrac|1|2}} = 0.50
|- style="text-align: right;"
| 3|| 0.44 || 1.55 || 0.61
|- style="text-align: right;"
| 4|| 1.94 || 6.96 || 0.87
|- style="text-align: right;"
| 5|| 3.45 || 31.4 || 0.97
|}

===Model evaluation===
The logistic regression analysis gives the following output.

{| class="wikitable"
|-
!   !! Coefficient!! Std. Error !! ''z''-value  !! ''p''-value (Wald)
|- style="text-align:right;"
! Intercept (''β''<sub>0</sub>)
| −4.1 || 1.8  || −2.3   || 0.021
|- style="text-align:right;"
! Hours (''β''<sub>1</sub>)
| 1.5     || 0.9
   || 1.7   || 0.017
|}

By the [[Wald test]], the output indicates that hours studying is significantly associated with the probability of passing the exam (<math>p = 0.017</math>). Rather than the Wald method, the recommended method<ref name="NeymanPearson1933">{{citation | last1 = Neyman | first1 = J. | author-link1 = Jerzy Neyman| last2 = Pearson | first2 = E. S. | author-link2 = Egon Pearson| doi = 10.1098/rsta.1933.0009 | title = On the problem of the most efficient tests of statistical hypotheses | journal = [[Philosophical Transactions of the Royal Society of London A]] | volume = 231 | issue = 694–706 | pages = 289–337 | year = 1933 | jstor = 91247 |bibcode = 1933RSPTA.231..289N | url = http://www.stats.org.uk/statistical-inference/NeymanPearson1933.pdf | doi-access = free }}</ref> to calculate the ''p''-value for logistic regression is the [[likelihood-ratio test]] (LRT), which for these data give <math>p \approx 0.00064</math> (see {{slink||Deviance and likelihood ratio tests}} below).

===Generalizations===
This simple model is an example of binary logistic regression, and has one explanatory variable and a binary categorical variable which can assume one of two categorical values. [[Multinomial logistic regression]] is the generalization of binary logistic regression to include any number of explanatory variables and any number of categories.