Editing Logistic regression (section)

===As a generalized linear model===
The particular model used by logistic regression, which distinguishes it from standard [[linear regression]] and from other types of [[regression analysis]] used for [[binary-valued]] outcomes, is the way the probability of a particular outcome is linked to the linear predictor function:

:<math>\operatorname{logit}(\operatorname{\mathbb E}[Y_i\mid x_{1,i},\ldots,x_{m,i}]) = \operatorname{logit}(p_i) = \ln \left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_m x_{m,i}</math>

Written using the more compact notation described above, this is:

:<math>\operatorname{logit}(\operatorname{\mathbb E}[Y_i\mid \mathbf{X}_i]) = \operatorname{logit}(p_i)=\ln\left(\frac{p_i}{1-p_i}\right) = \boldsymbol\beta \cdot \mathbf{X}_i</math>

This formulation expresses logistic regression as a type of [[generalized linear model]], which predicts variables with various types of [[probability distribution]]s by fitting a linear predictor function of the above form to some sort of arbitrary transformation of the expected value of the variable.

The intuition for transforming using the logit function (the natural log of the odds) was explained above{{Clarify|reason=What exactly was explained?|date=February 2023}}.  It also has the practical effect of converting the probability (which is bounded to be between 0 and 1) to a variable that ranges over <math>(-\infty,+\infty)</math> — thereby matching the potential range of the linear prediction function on the right side of the equation.

Both the probabilities ''p''<sub>''i''</sub> and the regression coefficients are unobserved, and the means of determining them is not part of the model itself.  They are typically determined by some sort of optimization procedure, e.g. [[maximum likelihood estimation]], that finds values that best fit the observed data (i.e. that give the most accurate predictions for the data already observed), usually subject to [[regularization (mathematics)|regularization]] conditions that seek to exclude unlikely values, e.g. extremely large values for any of the regression coefficients.  The use of a regularization condition is equivalent to doing [[maximum a posteriori]] (MAP) estimation, an extension of maximum likelihood.  (Regularization is most commonly done using [[Ridge regression|a squared regularizing function]], which is equivalent to placing a zero-mean [[Gaussian distribution|Gaussian]] [[prior distribution]] on the coefficients, but other regularizers are also possible.)  Whether or not regularization is used, it is usually not possible to find a closed-form solution; instead, an iterative numerical method must be used, such as [[iteratively reweighted least squares]] (IRLS) or, more commonly these days, a [[quasi-Newton method]] such as the [[L-BFGS|L-BFGS method]].<ref>{{cite conference |url=https://dl.acm.org/citation.cfm?id=1118871 |title=A comparison of algorithms for maximum entropy parameter estimation |last1=Malouf |first1=Robert |date= 2002|book-title= Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002) |pages= 49–55 |doi=10.3115/1118853.1118871 |doi-access=free }}</ref>

The interpretation of the ''β''<sub>''j''</sub> parameter estimates is as the additive effect on the log of the [[odds]] for a unit change in the ''j'' the explanatory variable.  In the case of a dichotomous explanatory variable, for instance, gender <math>e^\beta</math> is the estimate of the odds of having the outcome for, say, males compared with females.

An equivalent formula uses the inverse of the logit function, which is the [[logistic function]], i.e.:

:<math>\operatorname{\mathbb E}[Y_i\mid \mathbf{X}_i] = p_i = \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) = \frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i}}</math>

The formula can also be written as a [[probability distribution]] (specifically, using a [[probability mass function]]):

: <math>\Pr(Y_i=y\mid \mathbf{X}_i) = {p_i}^y(1-p_i)^{1-y} =\left(\frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i}}{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^{y} \left(1-\frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i}}{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i}}\right)^{1-y} = \frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i \cdot y}  }{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i}}</math>