Editing Logistic regression (section)

=== Many explanatory variables, two categories ===

The above example of binary logistic regression on one explanatory variable can be generalized to binary logistic regression on any number of explanatory variables ''x<sub>1</sub>, x<sub>2</sub>,...'' and any number of categorical values <math>y=0,1,2,\dots</math>.

To begin with, we may consider a logistic model with ''M'' explanatory variables, ''x<sub>1</sub>'', ''x<sub>2</sub>'' ... ''x<sub>M</sub>'' and, as in the example above, two categorical values (''y'' = 0 and 1). For the simple binary logistic regression model, we assumed a [[linear model|linear relationship]] between the predictor variable and the log-odds (also called [[logit]]) of the event that <math>y=1</math>. This linear relationship may be extended to the case of ''M'' explanatory variables:

:<math>t = \log_b \frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_M x_M </math>

where ''t'' is the log-odds and <math>\beta_i</math> are parameters of the model. An additional generalization has been introduced in which the base of the model (''b'') is not restricted to [[Euler's number]] ''e''. In most applications, the base <math>b</math> of the logarithm is usually taken to be ''[[E (mathematical constant)|e]]''. However, in some cases it can be easier to communicate results by working in base 2 or base 10.

For a more compact notation, we will specify the explanatory variables and the ''β'' coefficients as {{tmath|(M+1)}}-dimensional vectors:

:<math>\boldsymbol{x}=\{x_0,x_1,x_2,\dots,x_M\}</math>
:<math>\boldsymbol{\beta}=\{\beta_0,\beta_1,\beta_2,\dots,\beta_M\}</math>

with an added explanatory variable ''x<sub>0</sub>'' =1. The logit may now be written as:

:<math>t =\sum_{m=0}^{M} \beta_m x_m = \boldsymbol{\beta} \cdot x</math>

Solving for the probability ''p'' that <math>y=1</math> yields:

:<math>p(\boldsymbol{x}) = \frac{b^{\boldsymbol{\beta} \cdot \boldsymbol{x}}}{1+b^{\boldsymbol{\beta} \cdot \boldsymbol{x}}}= \frac{1}{1+b^{-\boldsymbol{\beta} \cdot \boldsymbol{x}}}=S_b(t)</math>,

where <math>S_b</math> is the [[sigmoid function]] with base <math>b</math>. The above formula shows that once the <math>\beta_m</math> are fixed, we can easily compute either the log-odds that <math>y=1</math> for a given observation, or the probability that <math>y=1</math> for a given observation. The main use-case of a logistic model is to be given an observation <math>\boldsymbol{x}</math>, and estimate the probability <math>p(\boldsymbol{x})</math> that <math>y=1</math>. The optimum beta coefficients may again be found by maximizing the log-likelihood. For ''K'' measurements, defining <math>\boldsymbol{x}_k</math> as the explanatory vector of the ''k''-th measurement, and <math>y_k</math> as the categorical outcome of that measurement, the log likelihood may be written in a form very similar to the simple <math>M=1</math> case above:

:<math>\ell = \sum_{k=1}^K y_k \log_b(p(\boldsymbol{x_k}))+\sum_{k=1}^K (1-y_k) \log_b(1-p(\boldsymbol{x_k}))</math>

As in the simple example above, finding the optimum ''β'' parameters will require numerical methods. One useful technique is to equate the derivatives of the log likelihood with respect to each of the ''β'' parameters to zero yielding a set of equations which will hold at the maximum of the log likelihood:

:<math>\frac{\partial  \ell}{\partial  \beta_m} = 0 = \sum_{k=1}^K y_k x_{mk} - \sum_{k=1}^K p(\boldsymbol{x}_k)x_{mk}</math>

where ''x<sub>mk</sub>'' is the value of the ''x<sub>m</sub>'' explanatory variable from the ''k-th'' measurement.

Consider an example with <math>M=2</math> explanatory variables, <math>b=10</math>, and coefficients <math>\beta_0=-3</math>, <math>\beta_1=1</math>, and <math>\beta_2=2</math> which have been determined by the above method. To be concrete, the model is:

:<math>t=\log_{10}\frac{p}{1 - p} = -3 + x_1 + 2 x_2</math>
:<math>p = \frac{b^{\boldsymbol{\beta} \cdot \boldsymbol{x}}}{1+b^{\boldsymbol{\beta} \cdot x}} = \frac{b^{\beta_0 + \beta_1 x_1 + \beta_2 x_2}}{1+b^{\beta_0 + \beta_1 x_1 + \beta_2 x_2} } = \frac{1}{1 + b^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2)}}</math>,

where ''p'' is the probability of the event that <math>y=1</math>. This can be interpreted as follows:

* <math>\beta_0 = -3</math> is the [[y-intercept|''y''-intercept]]. It is the log-odds of the event that <math>y=1</math>, when the predictors <math>x_1=x_2=0</math>. By exponentiating, we can see that when <math>x_1=x_2=0</math> the odds of the event that <math>y=1</math> are 1-to-1000, or <math>10^{-3}</math>. Similarly, the probability of the event that <math>y=1</math> when <math>x_1=x_2=0</math> can be computed as <math> 1/(1000 + 1) = 1/1001.</math>
* <math>\beta_1 = 1</math> means that increasing <math>x_1</math> by 1 increases the log-odds by <math>1</math>. So if <math>x_1</math> increases by 1, the odds that <math>y=1</math> increase by a factor of <math>10^1</math>. The '''probability''' of <math>y=1</math> has also increased, but it has not increased by as much as the odds have increased.
* <math>\beta_2 = 2</math> means that increasing <math>x_2</math> by 1 increases the log-odds by <math>2</math>. So if <math>x_2</math> increases by 1, the odds that <math>y=1</math> increase by a factor of <math>10^2.</math> Note how the effect of <math>x_2</math> on the log-odds is twice as great as the effect of <math>x_1</math>, but the effect on the odds is 10 times greater. But the effect on the '''probability''' of <math>y=1</math> is not as much as 10 times greater, it's only the effect on the odds that is 10 times greater.