Editing Logistic regression (section)

==={{anchor|log-linear model}}As a "log-linear" model===
Yet another formulation combines the two-way latent variable formulation above with the original formulation higher up without latent variables, and in the process provides a link to one of the standard formulations of the [[multinomial logit]].

Here, instead of writing the [[logit]] of the probabilities ''p''<sub>''i''</sub> as a linear predictor, we separate the linear predictor into two, one for each of the two outcomes:

: <math>
\begin{align}
\ln \Pr(Y_i=0) &= \boldsymbol\beta_0 \cdot \mathbf{X}_i - \ln Z \\
\ln \Pr(Y_i=1) &= \boldsymbol\beta_1 \cdot \mathbf{X}_i - \ln Z
\end{align}
</math>

Two separate sets of regression coefficients have been introduced, just as in the two-way latent variable model, and the two equations appear a form that writes the [[logarithm]] of the associated probability as a linear predictor, with an extra term <math>- \ln Z</math> at the end.  This term, as it turns out, serves as the [[normalizing factor]] ensuring that the result is a distribution.  This can be seen by exponentiating both sides:

: <math>
\begin{align}
\Pr(Y_i=0) &= \frac{1}{Z} e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} \\[5pt]
\Pr(Y_i=1) &= \frac{1}{Z} e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}
\end{align}
</math>

In this form it is clear that the purpose of ''Z'' is to ensure that the resulting distribution over ''Y''<sub>''i''</sub> is in fact a [[probability distribution]], i.e. it sums to 1.  This means that ''Z'' is simply the sum of all un-normalized probabilities, and by dividing each probability by ''Z'', the probabilities become "[[normalizing constant|normalized]]".  That is:

:<math> Z = e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}</math>

and the resulting equations are

:<math>
\begin{align}
\Pr(Y_i=0) &= \frac{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}} \\[5pt]
\Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}.
\end{align}
</math>

Or generally:

:<math>\Pr(Y_i=c) = \frac{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i}}{\sum_h e^{\boldsymbol\beta_h \cdot \mathbf{X}_i}}</math>

This shows clearly how to generalize this formulation to more than two outcomes, as in [[multinomial logit]].
This general formulation is exactly the [[softmax function]] as in

:<math>\Pr(Y_i=c) = \operatorname{softmax}(c, \boldsymbol\beta_0 \cdot \mathbf{X}_i, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \dots) .</math>

In order to prove that this is equivalent to the previous model, the above model is overspecified, in that <math>\Pr(Y_i=0)</math> and <math>\Pr(Y_i=1)</math> cannot be independently specified: rather <math>\Pr(Y_i=0) + \Pr(Y_i=1) = 1</math> so knowing one automatically determines the other.  As a result, the model is [[nonidentifiable]], in that multiple combinations of '''''β'''''<sub>0</sub> and '''''β'''''<sub>1</sub> will produce the same probabilities for all possible explanatory variables.  In fact, it can be seen that adding any constant vector to both of them will produce the same probabilities:

:<math>
\begin{align}
\Pr(Y_i=1) &= \frac{e^{(\boldsymbol\beta_1 +\mathbf{C}) \cdot \mathbf{X}_i}}{e^{(\boldsymbol\beta_0  +\mathbf{C})\cdot \mathbf{X}_i} + e^{(\boldsymbol\beta_1 +\mathbf{C}) \cdot \mathbf{X}_i}} \\[5pt]
&= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i}} \\[5pt]
&= \frac{e^{\mathbf{C} \cdot \mathbf{X}_i}e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{e^{\mathbf{C} \cdot \mathbf{X}_i}(e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i})} \\[5pt]
&= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}.
\end{align}
</math>

As a result, we can simplify matters, and restore identifiability, by picking an arbitrary value for one of the two vectors.  We choose to set <math>\boldsymbol\beta_0 = \mathbf{0} .</math>  Then,

:<math>e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} = e^{\mathbf{0} \cdot \mathbf{X}_i} = 1</math>

and so

:<math>
\Pr(Y_i=1) = \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{1 + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}} = \frac{1}{1+e^{-\boldsymbol\beta_1 \cdot \mathbf{X}_i}} = p_i</math>

which shows that this formulation is indeed equivalent to the previous formulation. (As in the two-way latent variable formulation, any settings where <math>\boldsymbol\beta = \boldsymbol\beta_1 - \boldsymbol\beta_0</math> will produce equivalent results.)

Most treatments of the [[multinomial logit]] model start out either by extending the "log-linear" formulation presented here or the two-way latent variable formulation presented above, since both clearly show the way that the model could be extended to multi-way outcomes.  In general, the presentation with latent variables is more common in [[econometrics]] and [[political science]], where [[discrete choice]] models and [[utility theory]] reign, while the "log-linear" formulation here is more common in [[computer science]], e.g. [[machine learning]] and [[natural language processing]].