Editing Logistic regression (section)

===As a latent-variable model===
The logistic model has an equivalent formulation as a [[latent-variable model]].  This formulation is common in the theory of [[discrete choice]] models and makes it easier to extend to certain more complicated models with multiple, correlated choices, as well as to compare logistic regression to the closely related [[probit model]].

Imagine that, for each trial ''i'', there is a continuous [[latent variable]] ''Y''<sub>''i''</sub><sup>*</sup> (i.e. an unobserved [[random variable]]) that is distributed as follows:

: <math> Y_i^\ast = \boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon_i \, </math>
where
: <math>\varepsilon_i \sim \operatorname{Logistic}(0,1) \, </math>
i.e. the latent variable can be written directly in terms of the linear predictor function and an additive random [[error variable]] that is distributed according to a standard [[logistic distribution]].

Then ''Y''<sub>''i''</sub> can be viewed as an indicator for whether this latent variable is positive:
: <math> Y_i = \begin{cases} 1 & \text{if }Y_i^\ast > 0 \ \text{ i.e. } {- \varepsilon_i} < \boldsymbol\beta \cdot \mathbf{X}_i, \\
0 &\text{otherwise.} \end{cases} </math>

The choice of modeling the error variable specifically with a standard logistic distribution, rather than a general logistic distribution with the location and scale set to arbitrary values, seems restrictive, but in fact, it is not.  It must be kept in mind that we can choose the regression coefficients ourselves, and very often can use them to offset changes in the parameters of the error variable's distribution.  For example, a logistic error-variable distribution with a non-zero location parameter ''μ'' (which sets the mean) is equivalent to a distribution with a zero location parameter, where ''μ'' has been added to the intercept coefficient.  Both situations produce the same value for ''Y''<sub>''i''</sub><sup>*</sup> regardless of settings of explanatory variables.  Similarly, an arbitrary scale parameter ''s'' is equivalent to setting the scale parameter to 1 and then dividing all regression coefficients by ''s''.  In the latter case, the resulting value of ''Y''<sub>''i''</sub><sup>''*''</sup> will be smaller by a factor of ''s'' than in the former case, for all sets of explanatory variables — but critically, it will always remain on the same side of 0, and hence lead to the same ''Y''<sub>''i''</sub> choice.

(This predicts that the irrelevancy of the scale parameter may not carry over into more complex models where more than two choices are available.)

It turns out that this formulation is exactly equivalent to the preceding one, phrased in terms of the [[generalized linear model]] and without any [[latent variable]]s.  This can be shown as follows, using the fact that the [[cumulative distribution function]] (CDF) of the standard [[logistic distribution]] is the [[logistic function]], which is the inverse of the [[logit function]], i.e.

:<math>\Pr(\varepsilon_i < x) = \operatorname{logit}^{-1}(x)</math>

Then:

:<math>
\begin{align}
\Pr(Y_i=1\mid\mathbf{X}_i) &= \Pr(Y_i^\ast > 0\mid\mathbf{X}_i) \\[5pt]
&= \Pr(\boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon_i > 0) \\[5pt]
&= \Pr(\varepsilon_i > -\boldsymbol\beta \cdot \mathbf{X}_i) \\[5pt]
&= \Pr(\varepsilon_i < \boldsymbol\beta \cdot \mathbf{X}_i) & & \text{(because the logistic distribution is symmetric)} \\[5pt]
&= \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) & \\[5pt]
&= p_i & & \text{(see above)}
\end{align}
</math>

This formulation—which is standard in [[discrete choice]] models—makes clear the relationship between logistic regression (the "logit model") and the [[probit model]], which uses an error variable distributed according to a standard [[normal distribution]] instead of a standard logistic distribution.  Both the logistic and normal distributions are symmetric with a basic unimodal, "bell curve" shape.  The only difference is that the logistic distribution has somewhat [[heavy-tailed distribution|heavier tails]], which means that it is less sensitive to outlying data (and hence somewhat more [[robust statistics|robust]] to model mis-specifications or erroneous data).