Editing Logistic regression (section)

==Definition==
A dataset contains ''N'' points. Each point ''i'' consists of a set of ''m'' input variables ''x''<sub>1,''i''</sub> ... ''x''<sub>''m,i''</sub> (also called [[independent variable]]s, explanatory variables, predictor variables, features, or attributes), and a [[binary variable|binary]] outcome variable ''Y''<sub>''i''</sub> (also known as a [[dependent variable]], response variable, output variable, or class), i.e. it can assume only the two possible values 0 (often meaning "no" or "failure") or 1 (often meaning "yes" or "success"). The goal of logistic regression is to use the dataset to create a predictive model of the outcome variable.

As in linear regression, the outcome variables ''Y''<sub>''i''</sub> are assumed to depend on the explanatory variables ''x''<sub>1,''i''</sub> ... ''x''<sub>''m,i''</sub>.

; Explanatory variables
The explanatory variables may be of any [[statistical data type|type]]: [[real-valued]], [[binary variable|binary]], [[categorical variable|categorical]], etc.  The main distinction is between [[continuous variable]]s and [[discrete variable]]s.

(Discrete variables referring to more than two possible choices are typically coded using [[Dummy variable (statistics)|dummy variables]] (or [[indicator variable]]s), that is, separate explanatory variables taking the value 0 or 1 are created for each possible value of the discrete variable, with a 1 meaning "variable does have the given value" and a 0 meaning "variable does not have that value".)

;Outcome variables
Formally, the outcomes ''Y''<sub>''i''</sub> are described as being [[Bernoulli distribution|Bernoulli-distributed]] data, where each outcome is determined by an unobserved probability ''p''<sub>''i''</sub> that is specific to the outcome at hand, but related to the explanatory variables.  This can be expressed in any of the following equivalent forms:

::<math>
\begin{align}
Y_i\mid x_{1,i},\ldots,x_{m,i} \ & \sim  \operatorname{Bernoulli}(p_i) \\[5pt]
\operatorname{\mathbb E}[Y_i\mid x_{1,i},\ldots,x_{m,i}] &= p_i  \\[5pt]
\Pr(Y_i=y\mid x_{1,i},\ldots,x_{m,i}) &=
\begin{cases}
p_i & \text{if }y=1 \\
1-p_i & \text{if }y=0
\end{cases}
\\[5pt]
\Pr(Y_i=y\mid x_{1,i},\ldots,x_{m,i}) &= p_i^y (1-p_i)^{(1-y)}
\end{align}
</math>

The meanings of these four lines are:
# The first line expresses the [[probability distribution]] of each ''Y''<sub>''i''</sub> : conditioned on the explanatory variables, it follows a [[Bernoulli distribution]] with parameters ''p''<sub>''i''</sub>, the probability of the outcome of 1 for trial ''i''. As noted above, each separate trial has its own probability of success, just as each trial has its own explanatory variables.  The probability of success ''p''<sub>''i''</sub> is not observed, only the outcome of an individual Bernoulli trial using that probability.
# The second line expresses the fact that the [[expected value]] of each ''Y''<sub>''i''</sub> is equal to the probability of success ''p''<sub>''i''</sub>, which is a general property of the Bernoulli distribution.  In other words, if we run a large number of Bernoulli trials using the same probability of success ''p''<sub>''i''</sub>, then take the average of all the 1 and 0 outcomes, then the result would be close to ''p''<sub>''i''</sub>.  This is because doing an average this way simply computes the proportion of successes seen, which we expect to converge to the underlying probability of success.
# The third line writes out the [[probability mass function]] of the Bernoulli distribution, specifying the probability of seeing each of the two possible outcomes.
# The fourth line is another way of writing the probability mass function, which avoids having to write separate cases and is more convenient for certain types of calculations.  This relies on the fact that ''Y''<sub>''i''</sub> can take only the value 0 or 1.  In each case, one of the exponents will be 1, "choosing" the value under it, while the other is 0, "canceling out" the value under it.  Hence, the outcome is either ''p''<sub>''i''</sub> or 1&nbsp;−&nbsp;''p''<sub>''i''</sub>, as in the previous line.

; Linear predictor function
The basic idea of logistic regression is to use the mechanism already developed for [[linear regression]] by modeling the probability ''p''<sub>''i''</sub> using a [[linear predictor function]], i.e. a [[linear combination]] of the explanatory variables and a set of [[regression coefficient]]s that are specific to the model at hand but the same for all trials.  The linear predictor function <math>f(i)</math> for a particular data point ''i'' is written as:

:<math>f(i) = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_m x_{m,i},</math>

where <math>\beta_0, \ldots, \beta_m</math> are [[regression coefficient]]s indicating the relative effect of a particular explanatory variable on the outcome.

The model is usually put into a more compact form as follows:
* The regression coefficients ''β''<sub>0</sub>, ''β''<sub>1</sub>, ..., ''β''<sub>''m''</sub> are grouped into a single vector '''''β''''' of size ''m''&nbsp;+&nbsp;1.
* For each data point ''i'', an additional explanatory pseudo-variable ''x''<sub>0,''i''</sub> is added, with a fixed value of 1, corresponding to the [[Y-intercept|intercept]] coefficient ''β''<sub>0</sub>.
* The resulting explanatory variables ''x''<sub>0,''i''</sub>, ''x''<sub>1,''i''</sub>, ..., ''x''<sub>''m,i''</sub> are then grouped into a single vector '''''X<sub>i</sub>''''' of size ''m''&nbsp;+&nbsp;1.

This makes it possible to write the linear predictor function as follows:

:<math>f(i)= \boldsymbol\beta \cdot \mathbf{X}_i,</math>

using the notation for a [[dot product]] between two vectors.
[[File:Logistic Regression in SPSS.png|thumb|356x356px|This is an example of an SPSS output for a logistic regression model using three explanatory variables (coffee use per week, energy drink use per week, and soda use per week) and two categories (male and female).]]

=== Many explanatory variables, two categories ===

The above example of binary logistic regression on one explanatory variable can be generalized to binary logistic regression on any number of explanatory variables ''x<sub>1</sub>, x<sub>2</sub>,...'' and any number of categorical values <math>y=0,1,2,\dots</math>.

To begin with, we may consider a logistic model with ''M'' explanatory variables, ''x<sub>1</sub>'', ''x<sub>2</sub>'' ... ''x<sub>M</sub>'' and, as in the example above, two categorical values (''y'' = 0 and 1). For the simple binary logistic regression model, we assumed a [[linear model|linear relationship]] between the predictor variable and the log-odds (also called [[logit]]) of the event that <math>y=1</math>. This linear relationship may be extended to the case of ''M'' explanatory variables:

:<math>t = \log_b \frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_M x_M </math>

where ''t'' is the log-odds and <math>\beta_i</math> are parameters of the model. An additional generalization has been introduced in which the base of the model (''b'') is not restricted to [[Euler's number]] ''e''. In most applications, the base <math>b</math> of the logarithm is usually taken to be ''[[E (mathematical constant)|e]]''. However, in some cases it can be easier to communicate results by working in base 2 or base 10.

For a more compact notation, we will specify the explanatory variables and the ''β'' coefficients as {{tmath|(M+1)}}-dimensional vectors:

:<math>\boldsymbol{x}=\{x_0,x_1,x_2,\dots,x_M\}</math>
:<math>\boldsymbol{\beta}=\{\beta_0,\beta_1,\beta_2,\dots,\beta_M\}</math>

with an added explanatory variable ''x<sub>0</sub>'' =1. The logit may now be written as:

:<math>t =\sum_{m=0}^{M} \beta_m x_m = \boldsymbol{\beta} \cdot x</math>

Solving for the probability ''p'' that <math>y=1</math> yields:

:<math>p(\boldsymbol{x}) = \frac{b^{\boldsymbol{\beta} \cdot \boldsymbol{x}}}{1+b^{\boldsymbol{\beta} \cdot \boldsymbol{x}}}= \frac{1}{1+b^{-\boldsymbol{\beta} \cdot \boldsymbol{x}}}=S_b(t)</math>,

where <math>S_b</math> is the [[sigmoid function]] with base <math>b</math>. The above formula shows that once the <math>\beta_m</math> are fixed, we can easily compute either the log-odds that <math>y=1</math> for a given observation, or the probability that <math>y=1</math> for a given observation. The main use-case of a logistic model is to be given an observation <math>\boldsymbol{x}</math>, and estimate the probability <math>p(\boldsymbol{x})</math> that <math>y=1</math>. The optimum beta coefficients may again be found by maximizing the log-likelihood. For ''K'' measurements, defining <math>\boldsymbol{x}_k</math> as the explanatory vector of the ''k''-th measurement, and <math>y_k</math> as the categorical outcome of that measurement, the log likelihood may be written in a form very similar to the simple <math>M=1</math> case above:

:<math>\ell = \sum_{k=1}^K y_k \log_b(p(\boldsymbol{x_k}))+\sum_{k=1}^K (1-y_k) \log_b(1-p(\boldsymbol{x_k}))</math>

As in the simple example above, finding the optimum ''β'' parameters will require numerical methods. One useful technique is to equate the derivatives of the log likelihood with respect to each of the ''β'' parameters to zero yielding a set of equations which will hold at the maximum of the log likelihood:

:<math>\frac{\partial  \ell}{\partial  \beta_m} = 0 = \sum_{k=1}^K y_k x_{mk} - \sum_{k=1}^K p(\boldsymbol{x}_k)x_{mk}</math>

where ''x<sub>mk</sub>'' is the value of the ''x<sub>m</sub>'' explanatory variable from the ''k-th'' measurement.

Consider an example with <math>M=2</math> explanatory variables, <math>b=10</math>, and coefficients <math>\beta_0=-3</math>, <math>\beta_1=1</math>, and <math>\beta_2=2</math> which have been determined by the above method. To be concrete, the model is:

:<math>t=\log_{10}\frac{p}{1 - p} = -3 + x_1 + 2 x_2</math>
:<math>p = \frac{b^{\boldsymbol{\beta} \cdot \boldsymbol{x}}}{1+b^{\boldsymbol{\beta} \cdot x}} = \frac{b^{\beta_0 + \beta_1 x_1 + \beta_2 x_2}}{1+b^{\beta_0 + \beta_1 x_1 + \beta_2 x_2} } = \frac{1}{1 + b^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2)}}</math>,

where ''p'' is the probability of the event that <math>y=1</math>. This can be interpreted as follows:

* <math>\beta_0 = -3</math> is the [[y-intercept|''y''-intercept]]. It is the log-odds of the event that <math>y=1</math>, when the predictors <math>x_1=x_2=0</math>. By exponentiating, we can see that when <math>x_1=x_2=0</math> the odds of the event that <math>y=1</math> are 1-to-1000, or <math>10^{-3}</math>. Similarly, the probability of the event that <math>y=1</math> when <math>x_1=x_2=0</math> can be computed as <math> 1/(1000 + 1) = 1/1001.</math>
* <math>\beta_1 = 1</math> means that increasing <math>x_1</math> by 1 increases the log-odds by <math>1</math>. So if <math>x_1</math> increases by 1, the odds that <math>y=1</math> increase by a factor of <math>10^1</math>. The '''probability''' of <math>y=1</math> has also increased, but it has not increased by as much as the odds have increased.
* <math>\beta_2 = 2</math> means that increasing <math>x_2</math> by 1 increases the log-odds by <math>2</math>. So if <math>x_2</math> increases by 1, the odds that <math>y=1</math> increase by a factor of <math>10^2.</math> Note how the effect of <math>x_2</math> on the log-odds is twice as great as the effect of <math>x_1</math>, but the effect on the odds is 10 times greater. But the effect on the '''probability''' of <math>y=1</math> is not as much as 10 times greater, it's only the effect on the odds that is 10 times greater.

=== Multinomial logistic regression: Many explanatory variables and many categories ===
{{main|Multinomial logistic regression}}

In the above cases of two categories (binomial logistic regression), the categories were indexed by "0" and "1", and we had two probabilities: The probability that the outcome was in category 1 was given by <math>p(\boldsymbol{x})</math>and the probability that the outcome was in category 0 was given by <math>1-p(\boldsymbol{x})</math>. The sum of these probabilities equals 1, which must be true, since "0" and "1" are the only possible categories in this setup.

In general, if we have {{tmath|M+1}} explanatory variables (including ''x<sub>0</sub>'') and {{tmath|N+1}} categories, we will need {{tmath|N+1}} separate probabilities,  one for each category, indexed by ''n'', which describe the probability that the categorical outcome ''y'' will be in category ''y=n'', conditional on the vector of covariates '''x'''. The sum of these probabilities over all categories must equal 1. Using the mathematically convenient base ''e'', these probabilities are:

:<math>p_n(\boldsymbol{x}) = \frac{e^{\boldsymbol{\beta}_n\cdot \boldsymbol{x}}}{1+\sum_{u=1}^N e^{\boldsymbol{\beta}_u\cdot \boldsymbol{x}}}</math> for <math>n=1,2,\dots,N</math>
:<math>p_0(\boldsymbol{x}) = 1-\sum_{n=1}^N p_n(\boldsymbol{x})=\frac{1}{1+\sum_{u=1}^N e^{\boldsymbol{\beta}_u\cdot \boldsymbol{x}}}</math>

Each of the probabilities except <math>p_0(\boldsymbol{x})</math> will have their own set of regression coefficients <math>\boldsymbol{\beta}_n</math>.  It can be seen that, as required, the sum of the <math>p_n(\boldsymbol{x})</math> over all categories ''n'' is 1. The selection of <math>p_0(\boldsymbol{x})</math> to be defined in terms of the other probabilities is artificial. Any of the probabilities could have been selected to be so defined. This special value of ''n'' is termed the "pivot index", and the log-odds (''t<sub>n</sub>'') are expressed in terms of the pivot probability and are again expressed as a linear combination of the explanatory variables:

:<math>t_n = \ln\left(\frac{p_n(\boldsymbol{x})}{p_0(\boldsymbol{x})}\right) = \boldsymbol{\beta}_n \cdot \boldsymbol{x}</math>

Note also that for the simple case of <math>N=1</math>, the two-category case is recovered, with <math>p(\boldsymbol{x})=p_1(\boldsymbol{x})</math> and <math>p_0(\boldsymbol{x})=1-p_1(\boldsymbol{x})</math>.

The log-likelihood that a particular set of ''K'' measurements or data points will be generated by the above probabilities can now be calculated. Indexing each measurement by ''k'', let the ''k''-th set of measured explanatory variables be denoted by <math>\boldsymbol{x}_k</math> and their categorical outcomes be denoted by <math>y_k</math> which can be equal to any integer in [0,N]. The log-likelihood is then:

:<math>\ell = \sum_{k=1}^K \sum_{n=0}^N \Delta(n,y_k)\,\ln(p_n(\boldsymbol{x}_k))</math>

where <math>\Delta(n,y_k)</math> is an [[indicator function]] which equals 1 if ''y<sub>k</sub> = n'' and zero otherwise. In the case of two explanatory variables, this indicator function was defined as ''y<sub>k</sub>'' when ''n'' = 1 and ''1-y<sub>k</sub>'' when ''n'' = 0. This was convenient, but not necessary.<ref>For example, the indicator function in this case could be defined as <math>\Delta(n,y)=1-(y-n)^2</math></ref> Again, the optimum beta coefficients may be found by maximizing the log-likelihood function generally using numerical methods. A possible method of solution is to set the derivatives of the log-likelihood with respect to each beta coefficient equal to zero and solve for the beta coefficients:

:<math>\frac{\partial \ell}{\partial  \beta_{nm}} = 0 = \sum_{k=1}^K \Delta(n,y_k)x_{mk} - \sum_{k=1}^K p_n(\boldsymbol{x}_k)x_{mk}</math>

where <math>\beta_{nm}</math> is the ''m''-th coefficient of the <math>\boldsymbol{\beta}_n</math> vector and <math>x_{mk}</math> is the ''m''-th explanatory variable of the ''k''-th measurement. Once the beta coefficients have been estimated from the data, we will be able to estimate the probability that any subsequent set of explanatory variables will result in any of the possible outcome categories.