Editing Logistic regression (section)

===Goodness of fit summary===

[[Goodness of fit]] in linear regression models is generally measured using [[R square|R<sup>2</sup>]]. Since this has no direct analog in logistic regression, various methods<ref name=Greene>{{cite book |last=Greene |first=William N. |title=Econometric Analysis |edition=Fifth |publisher=Prentice-Hall |year=2003 |isbn=978-0-13-066189-0 }}</ref>{{rp|ch.21}} including the following can be used instead.

====Deviance and likelihood ratio tests====
In linear regression analysis, one is concerned with partitioning variance via the [[Partition of sums of squares|sum of squares]] calculations – variance in the criterion is essentially divided into variance accounted for by the predictors and residual variance. In logistic regression analysis, [[Deviance (statistics)|deviance]] is used in lieu of a sum of squares calculations.<ref name=Cohen/> Deviance is analogous to the sum of squares calculations in linear regression<ref name=Hosmer/>  and is a measure of the lack of fit to the data in a logistic regression model.<ref name=Cohen/> When a "saturated" model is available (a model with a theoretically perfect fit), deviance is calculated by comparing a given model with the saturated model.<ref name=Hosmer/>  This computation gives the [[likelihood-ratio test]]:<ref name=Hosmer/>

:<math> D = -2\ln \frac{\text{likelihood of the fitted model}} {\text{likelihood of the saturated model}}.</math>

In the above equation, {{mvar|D}} represents the deviance and ln represents the natural logarithm. The log of this likelihood ratio (the ratio of the fitted model to the saturated model) will produce a negative value, hence the need for a negative sign. {{mvar|D}} can be shown to follow an approximate [[chi-squared distribution]].<ref name=Hosmer/>  Smaller values indicate better fit as the fitted model deviates less from the saturated model. When assessed upon a chi-square distribution, nonsignificant chi-square values indicate very little unexplained variance and thus, good model fit. Conversely, a significant chi-square value indicates that a significant amount of the variance is unexplained.

When the saturated model is not available (a common case), deviance is calculated simply as −2·(log likelihood of the fitted model), and the reference to the saturated model's log likelihood can be removed from all that follows without harm.

Two measures of deviance are particularly important in logistic regression: null deviance and model deviance. The null deviance represents the difference between a model with only the intercept (which means "no predictors") and the saturated model. The model deviance represents the difference between a model with at least one predictor and the saturated model.<ref name=Cohen/> In this respect, the null model provides a baseline upon which to compare predictor models. Given that deviance is a measure of the difference between a given model and the saturated model, smaller values indicate better fit. Thus, to assess the contribution of a predictor or set of predictors, one can subtract the model deviance from the null deviance and assess the difference on a <math>\chi^2_{s-p},</math>  chi-square distribution with [[Degrees of freedom (statistics)|degrees of freedom]]<ref name=Hosmer/> equal to the difference in the number of parameters estimated.

Let

:<math>\begin{align}
    D_{\text{null}} &=-2\ln \frac{\text{likelihood of null model}} {\text{likelihood of the saturated model}}\\[6pt]
   D_{\text{fitted}} &=-2\ln \frac{\text{likelihood of fitted model}} {\text{likelihood of the saturated model}}.
\end{align}
</math>

Then the difference of both is:

:<math>\begin{align} 
D_\text{null}- D_\text{fitted} &= -2 \left(\ln \frac{\text{likelihood of null model}} {\text{likelihood of the saturated model}}-\ln \frac{\text{likelihood of fitted model}} {\text{likelihood of the saturated model}}\right)\\[6pt]
&= -2 \ln \frac{ \left( \dfrac{\text{likelihood of null model}}{\text{likelihood of the saturated model}}\right)}{\left(\dfrac{\text{likelihood of fitted model}}{\text{likelihood of the saturated model}}\right)}\\[6pt]
&= -2 \ln \frac{\text{likelihood of the null model}}{\text{likelihood of fitted model}}.
\end{align}</math>

If the model deviance is significantly smaller than the null deviance then one can conclude that the predictor or set of predictors significantly improve the model's fit. This is analogous to the {{mvar|F}}-test used in linear regression analysis to assess the significance of prediction.<ref name=Cohen/>

====Pseudo-R-squared====
{{main article| Pseudo-R-squared}}
In linear regression the squared multiple correlation, {{mvar|R}}<sup>2</sup> is used to assess goodness of fit as it represents the proportion of variance in the criterion that is explained by the predictors.<ref name=Cohen/> In logistic regression analysis, there is no agreed upon analogous measure, but there are several competing measures each with limitations.<ref name=Cohen/><ref name=":0">{{cite web |url=https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf |title=Measures of fit for logistic regression |last=Allison |first=Paul D. |publisher=Statistical Horizons LLC and the University of Pennsylvania}}</ref>

Four of the most commonly used indices and one less commonly used one are examined on this page:
* Likelihood ratio {{mvar|R}}<sup>2</sup>{{sub|L}}
* Cox and Snell {{mvar|R}}<sup>2</sup>{{sub|CS}}
* Nagelkerke {{mvar|R}}<sup>2</sup>{{sub|N}}
* McFadden {{mvar|R}}<sup>2</sup>{{sub|McF}}
* Tjur {{mvar|R}}<sup>2</sup>{{sub|T}}

====Hosmer–Lemeshow test====
The [[Hosmer–Lemeshow test]] uses a test statistic that asymptotically follows a [[chi-squared distribution|<math>\chi^2</math> distribution]] to assess whether or not the observed event rates match expected event rates in subgroups of the model population.  This test is considered to be obsolete by some statisticians because of its dependence on arbitrary binning of predicted probabilities and relative low power.<ref>{{cite journal|last1=Hosmer|first1=D.W.|title=A comparison of goodness-of-fit tests for the logistic regression model|journal=Stat Med|date=1997|volume=16|issue=9|pages=965–980|doi=10.1002/(sici)1097-0258(19970515)16:9<965::aid-sim509>3.3.co;2-f|pmid=9160492}}</ref>