Editing Logistic regression (section)

=== Proof ===

In order to show this, we use the method of [[Lagrange multipliers]]. The Lagrangian is equal to the entropy plus the sum of the products of Lagrange multipliers times various constraint expressions. The general multinomial case will be considered, since the proof is not made that much simpler by considering simpler cases. Equating the derivative of the Lagrangian with respect to the various probabilities to zero yields a functional form for those probabilities which corresponds to those used in logistic regression.<ref name="Mount2011"/>

As in the above section on [[#Multinomial logistic regression : Many explanatory variable and many categories|multinomial logistic regression]], we will consider {{tmath|M+1}} explanatory variables denoted {{tmath|x_m}} and which include <math>x_0=1</math>. There will be a total of ''K'' data points, indexed by <math>k=\{1,2,\dots,K\}</math>, and the data points are given by <math>x_{mk}</math> and {{tmath|y_k}}. The ''x<sub>mk</sub>'' will also be represented as an {{tmath|(M+1)}}-dimensional vector  <math>\boldsymbol{x}_k = \{x_{0k},x_{1k},\dots,x_{Mk}\}</math>. There will be {{tmath|N+1}} possible values of the categorical variable ''y'' ranging from 0 to N.

Let ''p<sub>n</sub>('''x''')'' be the probability, given explanatory variable vector '''x''', that the outcome will be <math>y=n</math>. Define <math>p_{nk}=p_n(\boldsymbol{x}_k)</math> which is the probability that for the ''k''-th measurement, the categorical outcome is ''n''.

The Lagrangian will be expressed as a function of the probabilities ''p<sub>nk</sub>'' and will minimized by equating the derivatives of the Lagrangian with respect to these probabilities to zero. An important point is that the probabilities are treated equally and the fact that they sum to 1 is part of the Lagrangian formulation, rather than being assumed from the beginning.

The first contribution to the Lagrangian is the [[Entropy (information theory)|entropy]]:

:<math>\mathcal{L}_{ent}=-\sum_{k=1}^K\sum_{n=0}^N p_{nk}\ln(p_{nk})</math>

The log-likelihood is:

:<math>\ell=\sum_{k=1}^K\sum_{n=0}^N \Delta(n,y_k)\ln(p_{nk})</math>

Assuming the multinomial logistic function, the derivative of the log-likelihood with respect the beta coefficients was found to be:

:<math>\frac{\partial  \ell}{\partial  \beta_{nm}}=\sum_{k=1}^K ( p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})</math>

A very important point here is that this expression is (remarkably) not an explicit function of the beta coefficients. It is only a function of the probabilities ''p<sub>nk</sub>'' and the data. Rather than being specific to the assumed multinomial logistic case, it is taken to be a general statement of the condition at which the log-likelihood is maximized and makes no reference to the functional form of ''p<sub>nk</sub>''. There are then (''M''+1)(''N''+1) fitting constraints and the fitting constraint term in the Lagrangian is then:

:<math>\mathcal{L}_{fit}=\sum_{n=0}^N\sum_{m=0}^M \lambda_{nm}\sum_{k=1}^K (p_{nk}x_{mk}-\Delta(n,y_k)x_{mk})</math>

where the ''&lambda;<sub>nm</sub>'' are the appropriate Lagrange multipliers. There are ''K'' normalization constraints which may be written:

:<math>\sum_{n=0}^N p_{nk}=1</math>

so that the normalization term in the Lagrangian is:

:<math>\mathcal{L}_{norm}=\sum_{k=1}^K \alpha_k \left(1-\sum_{n=1}^N p_{nk}\right) </math>

where the ''α<sub>k</sub>'' are the appropriate Lagrange multipliers. The Lagrangian is then the sum of the above three terms:

:<math>\mathcal{L}=\mathcal{L}_{ent} + \mathcal{L}_{fit} + \mathcal{L}_{norm}</math>

Setting the derivative of the Lagrangian with respect to one of the probabilities to zero yields:

:<math>\frac{\partial \mathcal{L}}{\partial  p_{n'k'}}=0=-\ln(p_{n'k'})-1+\sum_{m=0}^M (\lambda_{n'm}x_{mk'})-\alpha_{k'}</math>

Using the more condensed vector notation:

:<math>\sum_{m=0}^M \lambda_{nm}x_{mk} = \boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k</math>

and dropping the primes on the ''n'' and ''k'' indices, and then solving for <math>p_{nk}</math> yields:

:<math>p_{nk}=e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}/Z_k</math>

where:

:<math>Z_k=e^{1+\alpha_k}</math>

Imposing the normalization constraint, we can solve for the ''Z<sub>k</sub>'' and write the probabilities as:

:<math>p_{nk}=\frac{e^{\boldsymbol{\lambda}_n\cdot\boldsymbol{x}_k}}{\sum_{u=0}^N e^{\boldsymbol{\lambda}_u\cdot\boldsymbol{x}_k}}</math>

The <math>\boldsymbol{\lambda}_n</math> are not all independent. We can add any constant {{tmath|(M+1)}}-dimensional vector to each of the <math>\boldsymbol{\lambda}_n</math> without changing the value of the <math>p_{nk}</math> probabilities so that there are only ''N'' rather than {{tmath|N+1}} independent <math>\boldsymbol{\lambda}_n</math>. In the [[#Multinomial logistic regression : Many explanatory variable and many categories|multinomial logistic regression]] section above, the <math>\boldsymbol{\lambda}_0</math> was subtracted from each <math>\boldsymbol{\lambda}_n</math> which set the exponential term involving <math>\boldsymbol{\lambda}_0</math> to 1, and the beta coefficients were given by <math>\boldsymbol{\beta}_n=\boldsymbol{\lambda}_n-\boldsymbol{\lambda}_0</math>.