Editing Logistic regression (section)

==Model fitting==
===Maximum likelihood estimation (MLE)===

The regression coefficients are usually estimated using [[maximum likelihood estimation]].<ref name=Menard/><ref>{{cite journal |first1=Christian |last1=Gourieroux |first2=Alain |last2=Monfort |title=Asymptotic Properties of the Maximum Likelihood Estimator in Dichotomous Logit Models |journal=Journal of Econometrics |volume=17 |issue=1 |year=1981 |pages=83–97 |doi=10.1016/0304-4076(81)90060-9 }}</ref> Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function so an iterative process must be used instead; for example [[Newton's method]]. This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until no more improvement is made, at which point the process is said to have converged.<ref name="Menard" />

In some instances, the model may not reach convergence. Non-convergence of a model indicates that the coefficients are not meaningful because the iterative process was unable to find appropriate solutions. A failure to converge may occur for a number of reasons: having a large ratio of predictors to cases, [[multicollinearity]], [[sparse matrix|sparseness]], or complete [[Separation (statistics)|separation]].
* Having a large ratio of variables to cases results in an overly conservative Wald statistic (discussed below) and can lead to non-convergence. [[Regularization (mathematics)|Regularized]] logistic regression is specifically intended to be used in this situation.
* Multicollinearity refers to unacceptably high correlations between predictors. As multicollinearity increases, coefficients remain unbiased but standard errors increase and the likelihood of model convergence decreases.<ref name=Menard/> To detect multicollinearity amongst the predictors, one can conduct a linear regression analysis with the predictors of interest for the sole purpose of examining the tolerance statistic <ref name=Menard/>  used to assess whether multicollinearity is unacceptably high.
* Sparseness in the data refers to having a large proportion of empty cells (cells with zero counts). Zero cell counts are particularly problematic with categorical predictors. With continuous predictors, the model can infer values for the zero cell counts, but this is not the case with categorical predictors. The model will not converge with zero cell counts for categorical predictors because the natural logarithm of zero is an undefined value so that the final solution to the model cannot be reached. To remedy this problem, researchers may collapse categories in a theoretically meaningful way or add a constant to all cells.<ref name=Menard/>
* Another numerical problem that may lead to a lack of convergence is complete separation, which refers to the instance in which the predictors perfectly predict the criterion&nbsp;– all cases are accurately classified and the likelihood maximized with infinite coefficients. In such instances, one should re-examine the data, as there may be some kind of error.<ref name=Hosmer/>{{explain|date=May 2017|reason= Why is there likely some kind of error? How can this be remedied?}}
* One can also take semi-parametric or non-parametric approaches, e.g., via local-likelihood or nonparametric quasi-likelihood methods, which avoid assumptions of a parametric form for the index function and is robust to the choice of the link function (e.g., probit or logit).<ref name="sciencedirect.com">{{cite journal| doi=10.1016/j.csda.2016.10.024 | volume=108 | title=Nonparametric estimation of dynamic discrete choice models for time series data | year=2017 | journal=Computational Statistics & Data Analysis | pages=97–120 | last1 = Park | first1 = Byeong U. | last2 = Simar | first2 = Léopold | last3 = Zelenyuk | first3 = Valentin| url=https://espace.library.uq.edu.au/view/UQ:415620/UQ415620_OA.pdf }}</ref>

=== Iteratively reweighted least squares (IRLS) ===

Binary logistic regression (<math>y=0</math> or <math> y=1</math>) can, for example, be calculated using ''iteratively reweighted least squares'' (IRLS), which is equivalent to maximizing the [[log-likelihood]] of a [[Bernoulli distribution|Bernoulli distributed]]  process using [[Newton's method]]. If the problem is written in vector matrix form, with parameters <math>\mathbf{w}^T=[\beta_0,\beta_1,\beta_2, \ldots]</math>, explanatory variables <math>\mathbf{x}(i)=[1, x_1(i), x_2(i), \ldots]^T</math> and expected value of the Bernoulli distribution <math>\mu(i)=\frac{1}{1+e^{-\mathbf{w}^T\mathbf{x}(i)}}</math>, the parameters <math>\mathbf{w}</math> can be found using the following iterative algorithm:

:<math>\mathbf{w}_{k+1} = \left(\mathbf{X}^T\mathbf{S}_k\mathbf{X}\right)^{-1}\mathbf{X}^T \left(\mathbf{S}_k \mathbf{X} \mathbf{w}_k + \mathbf{y} - \mathbf{\boldsymbol\mu}_k\right) </math>

where <math>\mathbf{S}=\operatorname{diag}(\mu(i)(1-\mu(i)))</math> is a diagonal weighting matrix, <math>\boldsymbol\mu=[\mu(1), \mu(2),\ldots]</math> the vector of expected values,

:<math>\mathbf{X}=\begin{bmatrix}
1 & x_1(1) & x_2(1) & \ldots\\
1 & x_1(2) & x_2(2) & \ldots\\
\vdots & \vdots & \vdots   
\end{bmatrix}</math>

The regressor matrix and <math>\mathbf{y}(i)=[y(1),y(2),\ldots]^T</math> the vector of response variables. More details can be found in the literature.<ref>{{cite book|last1=Murphy|first1=Kevin P.|title=Machine Learning – A Probabilistic Perspective|publisher=The MIT Press|date=2012|page=245|isbn=978-0-262-01802-9}}</ref>

===Bayesian===
[[File:Logistic-sigmoid-vs-scaled-probit.svg|right|300px|thumb|Comparison of [[logistic function]] with a scaled inverse [[probit function]] (i.e. the [[cumulative distribution function|CDF]] of the [[normal distribution]]), comparing <math>\sigma(x)</math> vs. <math display="inline">\Phi(\sqrt{\frac{\pi}{8}}x)</math>, which makes the slopes the same at the origin.  This shows the [[heavy-tailed distribution|heavier tails]] of the logistic distribution.]]

In a [[Bayesian statistics]] context, [[prior distribution]]s are normally placed on the regression coefficients, for example in the form of [[Gaussian distribution]]s.  There is no [[conjugate prior]] of the [[likelihood function]] in logistic regression.  When Bayesian inference was performed analytically, this made the [[posterior distribution]] difficult to calculate except in very low dimensions.  Now, though, automatic software such as [[OpenBUGS]], [[Just another Gibbs sampler|JAGS]], [[PyMC]], [[Stan (software)|Stan]] or [[Turing.jl]] allows these posteriors to be computed using simulation, so lack of conjugacy is not a concern.  However, when the sample size or the number of parameters is large, full Bayesian simulation can be slow, and people often use approximate methods such as [[variational Bayesian methods]] and [[expectation propagation]].

==="Rule of ten"===
{{main|One in ten rule}}

Widely used, the "[[one in ten rule]]", states that logistic regression models give stable values for the explanatory variables if based on a minimum of about 10 events per explanatory variable (EPV); where ''event'' denotes the cases belonging to the less frequent category in the dependent variable. Thus a study designed to use <math>k</math> explanatory variables for an event (e.g. [[myocardial infarction]]) expected to occur in a proportion <math>p</math> of participants in the study will require a total of <math>10k/p</math> participants. However, there is considerable debate about the reliability of this rule, which is based on simulation studies and lacks a secure theoretical underpinning.<ref>{{cite journal|pmid=27881078|pmc=5122171|year=2016|last1=Van Smeden|first1=M.|title=No rationale for 1 variable per 10 events criterion for binary logistic regression analysis|journal=BMC Medical Research Methodology|volume=16|issue=1|page=163|last2=De Groot|first2=J. A.|last3=Moons|first3=K. G.|last4=Collins|first4=G. S.|last5=Altman|first5=D. G.|last6=Eijkemans|first6=M. J.|last7=Reitsma|first7=J. B.|doi=10.1186/s12874-016-0267-3 |doi-access=free }}</ref> According to some authors<ref>{{cite journal|last=Peduzzi|first=P|author2=Concato, J |author3=Kemper, E |author4=Holford, TR |author5=Feinstein, AR |title=A simulation study of the number of events per variable in logistic regression analysis|journal=[[Journal of Clinical Epidemiology]]|date=December 1996|volume=49|issue=12|pages=1373–9|pmid=8970487|doi=10.1016/s0895-4356(96)00236-3|doi-access=free}}</ref> the rule is overly conservative in some circumstances, with the authors stating, "If we (somewhat subjectively) regard confidence interval coverage less than 93 percent, type I error greater than 7 percent, or relative bias greater than 15 percent as problematic, our results indicate that problems are fairly frequent with 2–4 EPV, uncommon with 5–9 EPV, and still observed with 10–16 EPV. The worst instances of each problem were not severe with 5–9 EPV and usually comparable to those with 10–16 EPV".<ref>{{cite journal|last1=Vittinghoff|first1=E.|last2=McCulloch|first2=C. E.|title=Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression|journal=American Journal of Epidemiology|date=12 January 2007|volume=165|issue=6|pages=710–718|doi=10.1093/aje/kwk052|pmid=17182981|doi-access=free}}</ref>

Others have found results that are not consistent with the above, using different criteria.  A useful criterion is whether the fitted model will be expected to achieve the same predictive discrimination in a new sample as it appeared to achieve in the model development sample.  For that criterion, 20 events per candidate variable may be required.<ref name=plo14mod/>  Also, one can argue that 96 observations are needed only to estimate the model's intercept precisely enough that the margin of error in predicted probabilities is ±0.1 with a 0.95 confidence level.<ref name=rms/>