Editing Logistic regression (section)

==Interpretations==
There are various equivalent specifications and interpretations of logistic regression, which fit into different types of more general models, and allow different generalizations.

===As a generalized linear model===
The particular model used by logistic regression, which distinguishes it from standard [[linear regression]] and from other types of [[regression analysis]] used for [[binary-valued]] outcomes, is the way the probability of a particular outcome is linked to the linear predictor function:

:<math>\operatorname{logit}(\operatorname{\mathbb E}[Y_i\mid x_{1,i},\ldots,x_{m,i}]) = \operatorname{logit}(p_i) = \ln \left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_m x_{m,i}</math>

Written using the more compact notation described above, this is:

:<math>\operatorname{logit}(\operatorname{\mathbb E}[Y_i\mid \mathbf{X}_i]) = \operatorname{logit}(p_i)=\ln\left(\frac{p_i}{1-p_i}\right) = \boldsymbol\beta \cdot \mathbf{X}_i</math>

This formulation expresses logistic regression as a type of [[generalized linear model]], which predicts variables with various types of [[probability distribution]]s by fitting a linear predictor function of the above form to some sort of arbitrary transformation of the expected value of the variable.

The intuition for transforming using the logit function (the natural log of the odds) was explained above{{Clarify|reason=What exactly was explained?|date=February 2023}}.  It also has the practical effect of converting the probability (which is bounded to be between 0 and 1) to a variable that ranges over <math>(-\infty,+\infty)</math> — thereby matching the potential range of the linear prediction function on the right side of the equation.

Both the probabilities ''p''<sub>''i''</sub> and the regression coefficients are unobserved, and the means of determining them is not part of the model itself.  They are typically determined by some sort of optimization procedure, e.g. [[maximum likelihood estimation]], that finds values that best fit the observed data (i.e. that give the most accurate predictions for the data already observed), usually subject to [[regularization (mathematics)|regularization]] conditions that seek to exclude unlikely values, e.g. extremely large values for any of the regression coefficients.  The use of a regularization condition is equivalent to doing [[maximum a posteriori]] (MAP) estimation, an extension of maximum likelihood.  (Regularization is most commonly done using [[Ridge regression|a squared regularizing function]], which is equivalent to placing a zero-mean [[Gaussian distribution|Gaussian]] [[prior distribution]] on the coefficients, but other regularizers are also possible.)  Whether or not regularization is used, it is usually not possible to find a closed-form solution; instead, an iterative numerical method must be used, such as [[iteratively reweighted least squares]] (IRLS) or, more commonly these days, a [[quasi-Newton method]] such as the [[L-BFGS|L-BFGS method]].<ref>{{cite conference |url=https://dl.acm.org/citation.cfm?id=1118871 |title=A comparison of algorithms for maximum entropy parameter estimation |last1=Malouf |first1=Robert |date= 2002|book-title= Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002) |pages= 49–55 |doi=10.3115/1118853.1118871 |doi-access=free }}</ref>

The interpretation of the ''β''<sub>''j''</sub> parameter estimates is as the additive effect on the log of the [[odds]] for a unit change in the ''j'' the explanatory variable.  In the case of a dichotomous explanatory variable, for instance, gender <math>e^\beta</math> is the estimate of the odds of having the outcome for, say, males compared with females.

An equivalent formula uses the inverse of the logit function, which is the [[logistic function]], i.e.:

:<math>\operatorname{\mathbb E}[Y_i\mid \mathbf{X}_i] = p_i = \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) = \frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i}}</math>

The formula can also be written as a [[probability distribution]] (specifically, using a [[probability mass function]]):

: <math>\Pr(Y_i=y\mid \mathbf{X}_i) = {p_i}^y(1-p_i)^{1-y} =\left(\frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i}}{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^{y} \left(1-\frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i}}{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^{1-y} = \frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i \cdot y}  }{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i}}</math>

===As a latent-variable model===
The logistic model has an equivalent formulation as a [[latent-variable model]].  This formulation is common in the theory of [[discrete choice]] models and makes it easier to extend to certain more complicated models with multiple, correlated choices, as well as to compare logistic regression to the closely related [[probit model]].

Imagine that, for each trial ''i'', there is a continuous [[latent variable]] ''Y''<sub>''i''</sub><sup>*</sup> (i.e. an unobserved [[random variable]]) that is distributed as follows:

: <math> Y_i^\ast = \boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon_i \, </math>
where
: <math>\varepsilon_i \sim \operatorname{Logistic}(0,1) \, </math>
i.e. the latent variable can be written directly in terms of the linear predictor function and an additive random [[error variable]] that is distributed according to a standard [[logistic distribution]].

Then ''Y''<sub>''i''</sub> can be viewed as an indicator for whether this latent variable is positive:
: <math> Y_i = \begin{cases} 1 & \text{if }Y_i^\ast > 0 \ \text{ i.e. } {- \varepsilon_i} < \boldsymbol\beta \cdot \mathbf{X}_i, \\
0 &\text{otherwise.} \end{cases} </math>

The choice of modeling the error variable specifically with a standard logistic distribution, rather than a general logistic distribution with the location and scale set to arbitrary values, seems restrictive, but in fact, it is not.  It must be kept in mind that we can choose the regression coefficients ourselves, and very often can use them to offset changes in the parameters of the error variable's distribution.  For example, a logistic error-variable distribution with a non-zero location parameter ''μ'' (which sets the mean) is equivalent to a distribution with a zero location parameter, where ''μ'' has been added to the intercept coefficient.  Both situations produce the same value for ''Y''<sub>''i''</sub><sup>*</sup> regardless of settings of explanatory variables.  Similarly, an arbitrary scale parameter ''s'' is equivalent to setting the scale parameter to 1 and then dividing all regression coefficients by ''s''.  In the latter case, the resulting value of ''Y''<sub>''i''</sub><sup>''*''</sup> will be smaller by a factor of ''s'' than in the former case, for all sets of explanatory variables — but critically, it will always remain on the same side of 0, and hence lead to the same ''Y''<sub>''i''</sub> choice.

(This predicts that the irrelevancy of the scale parameter may not carry over into more complex models where more than two choices are available.)

It turns out that this formulation is exactly equivalent to the preceding one, phrased in terms of the [[generalized linear model]] and without any [[latent variable]]s.  This can be shown as follows, using the fact that the [[cumulative distribution function]] (CDF) of the standard [[logistic distribution]] is the [[logistic function]], which is the inverse of the [[logit function]], i.e.

:<math>\Pr(\varepsilon_i < x) = \operatorname{logit}^{-1}(x)</math>

Then:

:<math>
\begin{align}
\Pr(Y_i=1\mid\mathbf{X}_i) &= \Pr(Y_i^\ast > 0\mid\mathbf{X}_i) \\[5pt]
&= \Pr(\boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon_i > 0) \\[5pt]
&= \Pr(\varepsilon_i > -\boldsymbol\beta \cdot \mathbf{X}_i) \\[5pt]
&= \Pr(\varepsilon_i < \boldsymbol\beta \cdot \mathbf{X}_i) & & \text{(because the logistic distribution is symmetric)} \\[5pt]
&= \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) & \\[5pt]
&= p_i & & \text{(see above)}
\end{align}
</math>

This formulation—which is standard in [[discrete choice]] models—makes clear the relationship between logistic regression (the "logit model") and the [[probit model]], which uses an error variable distributed according to a standard [[normal distribution]] instead of a standard logistic distribution.  Both the logistic and normal distributions are symmetric with a basic unimodal, "bell curve" shape.  The only difference is that the logistic distribution has somewhat [[heavy-tailed distribution|heavier tails]], which means that it is less sensitive to outlying data (and hence somewhat more [[robust statistics|robust]] to model mis-specifications or erroneous data).

===Two-way latent-variable model===
Yet another formulation uses two separate latent variables:

: <math>
\begin{align}
Y_i^{0\ast} &= \boldsymbol\beta_0 \cdot \mathbf{X}_i + \varepsilon_0 \, \\
Y_i^{1\ast} &= \boldsymbol\beta_1 \cdot \mathbf{X}_i + \varepsilon_1 \,
\end{align}
</math>

where

: <math>
\begin{align}
\varepsilon_0 & \sim \operatorname{EV}_1(0,1) \\
\varepsilon_1 & \sim \operatorname{EV}_1(0,1)
\end{align}
</math>

where ''EV''<sub>1</sub>(0,1) is a standard type-1 [[extreme value distribution]]: i.e.

:<math>\Pr(\varepsilon_0=x) = \Pr(\varepsilon_1=x) = e^{-x} e^{-e^{-x}}</math>

Then

: <math> Y_i = \begin{cases} 1 & \text{if }Y_i^{1\ast} > Y_i^{0\ast}, \\
0 &\text{otherwise.} \end{cases} </math>

This model has a separate latent variable and a separate set of regression coefficients for each possible outcome of the dependent variable.  The reason for this separation is that it makes it easy to extend logistic regression to multi-outcome categorical variables, as in the [[multinomial logit]] model. In such a model, it is natural to model each possible outcome using a different set of regression coefficients.  It is also possible to motivate each of the separate latent variables as the theoretical [[utility]] associated with making the associated choice, and thus motivate logistic regression in terms of [[utility theory]]. (In terms of utility theory, a rational actor always chooses the choice with the greatest associated utility.) This is the approach taken by economists when formulating [[discrete choice]] models, because it both provides a theoretically strong foundation and facilitates intuitions about the model, which in turn makes it easy to consider various sorts of extensions. (See the example below.)

The choice of the type-1 [[extreme value distribution]] seems fairly arbitrary, but it makes the mathematics work out, and it may be possible to justify its use through [[rational choice theory]].

It turns out that this model is equivalent to the previous model, although this seems non-obvious, since there are now two sets of regression coefficients and error variables, and the error variables have a different distribution.  In fact, this model reduces directly to the previous one with the following substitutions:
:<math>\boldsymbol\beta = \boldsymbol\beta_1 - \boldsymbol\beta_0</math>
:<math>\varepsilon = \varepsilon_1 - \varepsilon_0</math>
An intuition for this comes from the fact that, since we choose based on the maximum of two values, only their difference matters, not the exact values — and this effectively removes one [[Degrees of freedom (statistics)|degree of freedom]]. Another critical fact is that the difference of two type-1 extreme-value-distributed variables is a logistic distribution, i.e. <math>\varepsilon = \varepsilon_1 - \varepsilon_0 \sim \operatorname{Logistic}(0,1) .</math> We can demonstrate the equivalent as follows:

:<math>\begin{align}
\Pr(Y_i=1\mid\mathbf{X}_i) = {} & \Pr \left (Y_i^{1\ast} > Y_i^{0\ast}\mid\mathbf{X}_i \right ) & \\[5pt]
= {} & \Pr \left (Y_i^{1\ast} - Y_i^{0\ast} > 0\mid\mathbf{X}_i \right ) & \\[5pt]
= {} & \Pr \left (\boldsymbol\beta_1 \cdot \mathbf{X}_i + \varepsilon_1 - \left (\boldsymbol\beta_0 \cdot \mathbf{X}_i + \varepsilon_0 \right ) > 0 \right ) & \\[5pt]
= {} & \Pr \left ((\boldsymbol\beta_1 \cdot \mathbf{X}_i - \boldsymbol\beta_0 \cdot \mathbf{X}_i) + (\varepsilon_1 - \varepsilon_0) > 0 \right ) & \\[5pt]
= {} & \Pr((\boldsymbol\beta_1 - \boldsymbol\beta_0) \cdot \mathbf{X}_i + (\varepsilon_1 - \varepsilon_0) > 0) & \\[5pt]
= {} & \Pr((\boldsymbol\beta_1 - \boldsymbol\beta_0) \cdot \mathbf{X}_i + \varepsilon > 0) & & \text{(substitute } \varepsilon\text{ as above)} \\[5pt]
= {} & \Pr(\boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon > 0) & & \text{(substitute }\boldsymbol\beta\text{ as above)} \\[5pt]
= {} & \Pr(\varepsilon > -\boldsymbol\beta \cdot \mathbf{X}_i) & & \text{(now, same as above model)}\\[5pt]
= {} & \Pr(\varepsilon < \boldsymbol\beta \cdot \mathbf{X}_i) & \\[5pt]
= {} & \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) \\[5pt]
= {} & p_i
\end{align}</math>

====Example====
: {{Original research|example|discuss=Talk:Logistic_regression#Utility_theory_/_Elections_example_is_irrelevant|date=May 2022}} 
As an example, consider a province-level election where the choice is between a right-of-center party, a left-of-center party, and a secessionist party (e.g. the [[Parti Québécois]], which wants [[Quebec]] to secede from [[Canada]]).  We would then use three latent variables, one for each choice.  Then, in accordance with [[utility theory]], we can then interpret the latent variables as expressing the [[utility]] that results from making each of the choices.  We can also interpret the regression coefficients as indicating the strength that the associated factor (i.e. explanatory variable) has in contributing to the utility — or more correctly, the amount by which a unit change in an explanatory variable changes the utility of a given choice.  A voter might expect that the right-of-center party would lower taxes, especially on rich people.  This would give low-income people no benefit, i.e. no change in utility (since they usually don't pay taxes); would cause moderate benefit (i.e. somewhat more money, or moderate utility increase) for middle-incoming people; would cause significant benefits for high-income people.  On the other hand, the left-of-center party might be expected to raise taxes and offset it with increased welfare and other assistance for the lower and middle classes.  This would cause significant positive benefit to low-income people, perhaps a weak benefit to middle-income people, and significant negative benefit to high-income people.  Finally, the secessionist party would take no direct actions on the economy, but simply secede. A low-income or middle-income voter might expect basically no clear utility gain or loss from this, but a high-income voter might expect negative utility since he/she is likely to own companies, which will have a harder time doing business in such an environment and probably lose money.

These intuitions can be expressed as follows:
{{table alignment}}
{|class="wikitable col2right col3left"
|+Estimated strength of regression coefficient for different outcomes (party choices) and different values of explanatory variables
|-
! !! Center-right !! Center-left !! Secessionist
|-
! High-income
| strong + || strong − || strong −
|-
! Middle-income
| moderate + || weak + || {{CNone|none}}
|-
! Low-income
| {{CNone|none|style=text-align:right;}} || strong + || {{CNone|none}}
|-
|}

This clearly shows that
# Separate sets of regression coefficients need to exist for each choice.  When phrased in terms of utility, this can be seen very easily. Different choices have different effects on net utility; furthermore, the effects vary in complex ways that depend on the characteristics of each individual, so there need to be separate sets of coefficients for each characteristic, not simply a single extra per-choice characteristic.
# Even though income is a continuous variable, its effect on utility is too complex for it to be treated as a single variable.  Either it needs to be directly split up into ranges, or higher powers of income need to be added so that [[polynomial regression]] on income is effectively done.

==={{anchor|log-linear model}}As a "log-linear" model===
Yet another formulation combines the two-way latent variable formulation above with the original formulation higher up without latent variables, and in the process provides a link to one of the standard formulations of the [[multinomial logit]].

Here, instead of writing the [[logit]] of the probabilities ''p''<sub>''i''</sub> as a linear predictor, we separate the linear predictor into two, one for each of the two outcomes:

: <math>
\begin{align}
\ln \Pr(Y_i=0) &= \boldsymbol\beta_0 \cdot \mathbf{X}_i - \ln Z \\
\ln \Pr(Y_i=1) &= \boldsymbol\beta_1 \cdot \mathbf{X}_i - \ln Z
\end{align}
</math>

Two separate sets of regression coefficients have been introduced, just as in the two-way latent variable model, and the two equations appear a form that writes the [[logarithm]] of the associated probability as a linear predictor, with an extra term <math>- \ln Z</math> at the end.  This term, as it turns out, serves as the [[normalizing factor]] ensuring that the result is a distribution.  This can be seen by exponentiating both sides:

: <math>
\begin{align}
\Pr(Y_i=0) &= \frac{1}{Z} e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} \\[5pt]
\Pr(Y_i=1) &= \frac{1}{Z} e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}
\end{align}
</math>

In this form it is clear that the purpose of ''Z'' is to ensure that the resulting distribution over ''Y''<sub>''i''</sub> is in fact a [[probability distribution]], i.e. it sums to 1.  This means that ''Z'' is simply the sum of all un-normalized probabilities, and by dividing each probability by ''Z'', the probabilities become "[[normalizing constant|normalized]]".  That is:

:<math> Z = e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}</math>

and the resulting equations are

:<math>
\begin{align}
\Pr(Y_i=0) &= \frac{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}} \\[5pt]
\Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}.
\end{align}
</math>

Or generally:

:<math>\Pr(Y_i=c) = \frac{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i}}{\sum_h e^{\boldsymbol\beta_h \cdot \mathbf{X}_i}}</math>

This shows clearly how to generalize this formulation to more than two outcomes, as in [[multinomial logit]].
This general formulation is exactly the [[softmax function]] as in

:<math>\Pr(Y_i=c) = \operatorname{softmax}(c, \boldsymbol\beta_0 \cdot \mathbf{X}_i, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \dots) .</math>

In order to prove that this is equivalent to the previous model, the above model is overspecified, in that <math>\Pr(Y_i=0)</math> and <math>\Pr(Y_i=1)</math> cannot be independently specified: rather <math>\Pr(Y_i=0) + \Pr(Y_i=1) = 1</math> so knowing one automatically determines the other.  As a result, the model is [[nonidentifiable]], in that multiple combinations of '''''β'''''<sub>0</sub> and '''''β'''''<sub>1</sub> will produce the same probabilities for all possible explanatory variables.  In fact, it can be seen that adding any constant vector to both of them will produce the same probabilities:

:<math>
\begin{align}
\Pr(Y_i=1) &= \frac{e^{(\boldsymbol\beta_1 +\mathbf{C}) \cdot \mathbf{X}_i}}{e^{(\boldsymbol\beta_0  +\mathbf{C})\cdot \mathbf{X}_i} + e^{(\boldsymbol\beta_1 +\mathbf{C}) \cdot \mathbf{X}_i}} \\[5pt]
&= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i}} \\[5pt]
&= \frac{e^{\mathbf{C} \cdot \mathbf{X}_i}e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{e^{\mathbf{C} \cdot \mathbf{X}_i}(e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i})} \\[5pt]
&= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}.
\end{align}
</math>

As a result, we can simplify matters, and restore identifiability, by picking an arbitrary value for one of the two vectors.  We choose to set <math>\boldsymbol\beta_0 = \mathbf{0} .</math>  Then,

:<math>e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} = e^{\mathbf{0} \cdot \mathbf{X}_i} = 1</math>

and so

:<math>
\Pr(Y_i=1) = \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{1 + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}} = \frac{1}{1+e^{-\boldsymbol\beta_1 \cdot \mathbf{X}_i}} = p_i</math>

which shows that this formulation is indeed equivalent to the previous formulation. (As in the two-way latent variable formulation, any settings where <math>\boldsymbol\beta = \boldsymbol\beta_1 - \boldsymbol\beta_0</math> will produce equivalent results.)

Most treatments of the [[multinomial logit]] model start out either by extending the "log-linear" formulation presented here or the two-way latent variable formulation presented above, since both clearly show the way that the model could be extended to multi-way outcomes.  In general, the presentation with latent variables is more common in [[econometrics]] and [[political science]], where [[discrete choice]] models and [[utility theory]] reign, while the "log-linear" formulation here is more common in [[computer science]], e.g. [[machine learning]] and [[natural language processing]].

===As a single-layer perceptron===
The model has an equivalent formulation

:<math>p_i = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}}. \, </math>

This functional form is commonly called a single-layer [[perceptron]] or single-layer [[artificial neural network]]. A single-layer neural network computes a continuous output instead of a [[step function]]. The derivative of ''p<sub>i</sub>'' with respect to  ''X''&nbsp;=&nbsp;(''x''<sub>1</sub>, ..., ''x''<sub>''k''</sub>) is computed from the general form:

: <math>y = \frac{1}{1+e^{-f(X)}}</math>

where ''f''(''X'') is an [[analytic function]] in ''X''. With this choice, the single-layer neural network is identical to the logistic regression model. This function has a continuous derivative, which allows it to be used in [[backpropagation]]. This function is also preferred because its derivative is easily calculated:

: <math>\frac{\mathrm{d}y}{\mathrm{d}X} = y(1-y)\frac{\mathrm{d}f}{\mathrm{d}X}. \, </math>

===In terms of binomial data===
A closely related model assumes that each ''i'' is associated not with a single Bernoulli trial but with ''n''<sub>''i''</sub> [[independent identically distributed]] trials, where the observation ''Y''<sub>''i''</sub> is the number of successes observed (the sum of the individual Bernoulli-distributed random variables), and hence follows a [[binomial distribution]]:

:<math>Y_i \,\sim  \operatorname{Bin}(n_i,p_i),\text{ for }i = 1, \dots , n</math>

An example of this distribution is the fraction of seeds (''p''<sub>''i''</sub>) that germinate after ''n''<sub>''i''</sub> are planted.

In terms of [[expected value]]s, this model is expressed as follows:

:<math>p_i = \operatorname{\mathbb E}\left[\left.\frac{Y_i}{n_{i}}\,\right|\,\mathbf{X}_i \right]\,, </math>

so that

:<math>\operatorname{logit}\left(\operatorname{\mathbb E}\left[\left.\frac{Y_i}{n_i}\,\right|\,\mathbf{X}_i \right]\right) = \operatorname{logit}(p_i) = \ln \left(\frac{p_i}{1-p_i}\right) = \boldsymbol\beta \cdot \mathbf{X}_i\,,</math>

Or equivalently:

:<math>\Pr(Y_i=y\mid \mathbf{X}_i) = {n_i \choose y} p_i^y(1-p_i)^{n_i-y} ={n_i \choose y} \left(\frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^y \left(1-\frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^{n_i-y}\,.</math>

This model can be fit using the same sorts of methods as the above more basic model.