Editing Logistic regression (section)

== Error and significance of fit ==

=== Deviance and likelihood ratio test ─ a simple case  ===
In any fitting procedure, the addition of another fitting parameter to a model (e.g. the beta parameters in a logistic regression model) will almost always improve the ability of the model to predict the measured outcomes. This will be true even if the additional term has no predictive value, since the model will simply be "[[overfitting]]" to the noise in the data.  The question arises as to whether the improvement gained by the addition of another fitting parameter is significant enough to recommend the inclusion of the additional term, or whether the improvement is simply that which may be expected from overfitting.

In short, for logistic regression, a statistic known as the [[deviance (statistics)|deviance]] is defined which is a measure of the error between the logistic model fit and the outcome data. In the limit of a large number of data points, the deviance is [[Chi-squared distribution|chi-squared]] distributed, which allows a [[chi-squared test]] to be implemented in order to determine the significance of the explanatory variables.

Linear regression and logistic regression have many similarities. For example, in simple linear regression, a set of ''K'' data points (''x<sub>k</sub>'', ''y<sub>k</sub>'') are fitted to a proposed model function of the form <math>y=b_0+b_1 x</math>. The fit is obtained by choosing the ''b'' parameters which minimize the sum of the squares of the residuals (the squared error term) for each data point:

:<math>\varepsilon^2=\sum_{k=1}^K (b_0+b_1 x_k-y_k)^2.</math>

The minimum value which constitutes the fit will be denoted by <math>\hat{\varepsilon}^2</math>

The idea of a [[null model]] may be introduced, in which it is assumed that the ''x'' variable is of no use in predicting the y<sub>k</sub> outcomes: The data points are fitted to a null model function of the form ''y''&nbsp;=&nbsp;''b''<sub>0</sub> with a squared error term:

:<math>\varepsilon^2=\sum_{k=1}^K (b_0-y_k)^2.</math>

The fitting process consists of choosing a value of ''b''<sub>0</sub> which minimizes <math>\varepsilon^2</math> of the fit to the null model, denoted by  <math>\varepsilon_\varphi^2</math> where the <math>\varphi</math> subscript denotes the null model. It is seen that the null model is optimized by <math>b_0=\overline{y}</math> where <math>\overline{y}</math> is the mean of the ''y<sub>k</sub>'' values, and the optimized <math>\varepsilon_\varphi^2</math> is:

:<math>\hat{\varepsilon}_\varphi^2=\sum_{k=1}^K (\overline{y}-y_k)^2</math>

which is proportional to the square of the (uncorrected) sample standard deviation of the ''y<sub>k</sub>'' data points.

We can imagine a case where the ''y<sub>k</sub>'' data points are randomly assigned to the various ''x<sub>k</sub>'', and then fitted using the proposed model. Specifically, we can consider the fits of the proposed model to every permutation of the ''y<sub>k</sub>'' outcomes. It can be shown that the optimized error of any of these fits will never be less than the optimum error of the null model, and that the difference between these minimum error will follow a [[chi-squared distribution]], with degrees of freedom equal those of the proposed model minus those of the null model which, in this case, will be <math>2-1=1</math>. Using the [[chi-squared test]], we may then estimate how many of these permuted sets of ''y<sub>k</sub>'' will yield a minimum error less than or equal to the minimum  error using the original ''y<sub>k</sub>'', and so we can estimate how significant an improvement is given by the inclusion of the ''x'' variable in the proposed model.

For logistic regression, the measure of goodness-of-fit is the likelihood function ''L'', or its logarithm, the log-likelihood ''ℓ''. The likelihood function ''L'' is analogous to the <math>\varepsilon^2</math> in the linear regression case, except that the likelihood is maximized rather than minimized. Denote the maximized log-likelihood of the proposed model by <math>\hat{\ell}</math>.

In the case of simple binary logistic regression, the set of ''K'' data points are fitted in a probabilistic sense to a function of the form:

:<math>p(x)=\frac{1}{1+e^{-t}}</math>

where {{tmath|p(x)}} is the probability that <math>y=1</math>. The log-odds are given by:

:<math>t=\beta_0+\beta_1 x</math>

and the log-likelihood is:

:<math>\ell=\sum_{k=1}^K \left( y_k \ln(p(x_k))+(1-y_k) \ln(1-p(x_k))\right)</math>

For the null model, the probability that <math>y=1</math> is given by:

:<math>p_\varphi(x)=\frac{1}{1+e^{-t_\varphi}}</math>

The log-odds for the null model are given by:

:<math>t_\varphi=\beta_0</math>

and the log-likelihood is:

:<math>\ell_\varphi=\sum_{k=1}^K \left( y_k \ln(p_\varphi)+(1-y_k) \ln(1-p_\varphi)\right)</math>  

Since we have <math>p_\varphi=\overline{y}</math> at the maximum of ''L'', the maximum log-likelihood for the null model is

:<math>\hat{\ell}_\varphi=K(\,\overline{y} \ln(\overline{y}) + (1-\overline{y})\ln(1-\overline{y}))</math>

The optimum <math>\beta_0</math> is:

:<math>\beta_0=\ln\left(\frac{\overline{y}}{1-\overline{y}}\right)</math>

where <math>\overline{y}</math> is again the mean of the ''y<sub>k</sub>'' values. Again, we can conceptually consider the fit of the proposed model to every permutation of the ''y<sub>k</sub>'' and it can be shown that the maximum log-likelihood of these permutation fits will never be smaller than that of the null model:

:<math> \hat{\ell} \ge \hat{\ell}_\varphi</math>

Also, as an analog to the error of the linear regression case, we may define the [[deviance (statistics)|deviance]] of a logistic regression fit as:

:<math>D=\ln\left(\frac{\hat{L}^2}{\hat{L}_\varphi^2}\right) = 2(\hat{\ell}-\hat{\ell}_\varphi)</math>

which will always be positive or zero. The reason for this choice is that not only is the deviance a good measure of the goodness of fit, it is also approximately chi-squared distributed, with the approximation improving as the number of data points (''K'') increases, becoming exactly chi-square distributed in the limit of an infinite number of data points. As in the case of linear regression, we may use this fact to estimate the probability that a random set of data points will give a better fit than the fit obtained by the proposed model, and so have an estimate how significantly the model is improved by including the ''x<sub>k</sub>'' data points in the proposed model.

For the simple model of student test scores described above, the maximum value of the log-likelihood of the null model is <math>\hat{\ell}_\varphi= -13.8629\ldots</math> The maximum value of the log-likelihood for the simple model is <math>\hat{\ell}=-8.02988\ldots</math> so that the deviance is <math>D = 2(\hat{\ell}-\hat{\ell}_\varphi)=11.6661\ldots</math>

Using the [[chi-squared test]] of significance, the integral of the [[chi-squared distribution]] with one degree of freedom from 11.6661... to infinity is equal to 0.00063649...

This effectively means that about 6 out of a 10,000 fits to random ''y<sub>k</sub>'' can be expected to have a better fit (smaller deviance) than the given ''y<sub>k</sub>'' and so we can conclude that the inclusion of the ''x'' variable and data in the proposed model is a very significant improvement over the null model. In other words, we reject the [[null hypothesis]] with <math>1-D\approx 99.94 \%</math> confidence.

===Goodness of fit summary===

[[Goodness of fit]] in linear regression models is generally measured using [[R square|R<sup>2</sup>]]. Since this has no direct analog in logistic regression, various methods<ref name=Greene>{{cite book |last=Greene |first=William N. |title=Econometric Analysis |edition=Fifth |publisher=Prentice-Hall |year=2003 |isbn=978-0-13-066189-0 }}</ref>{{rp|ch.21}} including the following can be used instead.

====Deviance and likelihood ratio tests====
In linear regression analysis, one is concerned with partitioning variance via the [[Partition of sums of squares|sum of squares]] calculations – variance in the criterion is essentially divided into variance accounted for by the predictors and residual variance. In logistic regression analysis, [[Deviance (statistics)|deviance]] is used in lieu of a sum of squares calculations.<ref name=Cohen/> Deviance is analogous to the sum of squares calculations in linear regression<ref name=Hosmer/>  and is a measure of the lack of fit to the data in a logistic regression model.<ref name=Cohen/> When a "saturated" model is available (a model with a theoretically perfect fit), deviance is calculated by comparing a given model with the saturated model.<ref name=Hosmer/>  This computation gives the [[likelihood-ratio test]]:<ref name=Hosmer/>

:<math> D = -2\ln \frac{\text{likelihood of the fitted model}} {\text{likelihood of the saturated model}}.</math>

In the above equation, {{mvar|D}} represents the deviance and ln represents the natural logarithm. The log of this likelihood ratio (the ratio of the fitted model to the saturated model) will produce a negative value, hence the need for a negative sign. {{mvar|D}} can be shown to follow an approximate [[chi-squared distribution]].<ref name=Hosmer/>  Smaller values indicate better fit as the fitted model deviates less from the saturated model. When assessed upon a chi-square distribution, nonsignificant chi-square values indicate very little unexplained variance and thus, good model fit. Conversely, a significant chi-square value indicates that a significant amount of the variance is unexplained.

When the saturated model is not available (a common case), deviance is calculated simply as −2·(log likelihood of the fitted model), and the reference to the saturated model's log likelihood can be removed from all that follows without harm.

Two measures of deviance are particularly important in logistic regression: null deviance and model deviance. The null deviance represents the difference between a model with only the intercept (which means "no predictors") and the saturated model. The model deviance represents the difference between a model with at least one predictor and the saturated model.<ref name=Cohen/> In this respect, the null model provides a baseline upon which to compare predictor models. Given that deviance is a measure of the difference between a given model and the saturated model, smaller values indicate better fit. Thus, to assess the contribution of a predictor or set of predictors, one can subtract the model deviance from the null deviance and assess the difference on a <math>\chi^2_{s-p},</math>  chi-square distribution with [[Degrees of freedom (statistics)|degrees of freedom]]<ref name=Hosmer/> equal to the difference in the number of parameters estimated.

Let

:<math>\begin{align}
    D_{\text{null}} &=-2\ln \frac{\text{likelihood of null model}} {\text{likelihood of the saturated model}}\\[6pt]
   D_{\text{fitted}} &=-2\ln \frac{\text{likelihood of fitted model}} {\text{likelihood of the saturated model}}.
\end{align}
</math>

Then the difference of both is:

:<math>\begin{align} 
D_\text{null}- D_\text{fitted} &= -2 \left(\ln \frac{\text{likelihood of null model}} {\text{likelihood of the saturated model}}-\ln \frac{\text{likelihood of fitted model}} {\text{likelihood of the saturated model}}\right)\\[6pt]
&= -2 \ln \frac{ \left( \dfrac{\text{likelihood of null model}}{\text{likelihood of the saturated model}}\right)}{\left(\dfrac{\text{likelihood of fitted model}}{\text{likelihood of the saturated model}}\right)}\\[6pt]
&= -2 \ln \frac{\text{likelihood of the null model}}{\text{likelihood of fitted model}}.
\end{align}</math>

If the model deviance is significantly smaller than the null deviance then one can conclude that the predictor or set of predictors significantly improve the model's fit. This is analogous to the {{mvar|F}}-test used in linear regression analysis to assess the significance of prediction.<ref name=Cohen/>

====Pseudo-R-squared====
{{main article| Pseudo-R-squared}}
In linear regression the squared multiple correlation, {{mvar|R}}<sup>2</sup> is used to assess goodness of fit as it represents the proportion of variance in the criterion that is explained by the predictors.<ref name=Cohen/> In logistic regression analysis, there is no agreed upon analogous measure, but there are several competing measures each with limitations.<ref name=Cohen/><ref name=":0">{{cite web |url=https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf |title=Measures of fit for logistic regression |last=Allison |first=Paul D. |publisher=Statistical Horizons LLC and the University of Pennsylvania}}</ref>

Four of the most commonly used indices and one less commonly used one are examined on this page:
* Likelihood ratio {{mvar|R}}<sup>2</sup>{{sub|L}}
* Cox and Snell {{mvar|R}}<sup>2</sup>{{sub|CS}}
* Nagelkerke {{mvar|R}}<sup>2</sup>{{sub|N}}
* McFadden {{mvar|R}}<sup>2</sup>{{sub|McF}}
* Tjur {{mvar|R}}<sup>2</sup>{{sub|T}}

====Hosmer–Lemeshow test====
The [[Hosmer–Lemeshow test]] uses a test statistic that asymptotically follows a [[chi-squared distribution|<math>\chi^2</math> distribution]] to assess whether or not the observed event rates match expected event rates in subgroups of the model population.  This test is considered to be obsolete by some statisticians because of its dependence on arbitrary binning of predicted probabilities and relative low power.<ref>{{cite journal|last1=Hosmer|first1=D.W.|title=A comparison of goodness-of-fit tests for the logistic regression model|journal=Stat Med|date=1997|volume=16|issue=9|pages=965–980|doi=10.1002/(sici)1097-0258(19970515)16:9<965::aid-sim509>3.3.co;2-f|pmid=9160492}}</ref>

===Coefficient significance===
After fitting the model, it is likely that researchers will want to examine the contribution of individual predictors. To do so, they will want to examine the regression coefficients. In linear regression, the regression coefficients represent the change in the criterion for each unit change in the predictor.<ref name=Cohen/> In logistic regression, however, the regression coefficients represent the change in the logit for each unit change in the predictor. Given that the logit is not intuitive, researchers are likely to focus on a predictor's effect on the exponential function of the regression coefficient – the odds ratio (see [[#Logistic function, odds, odds ratio, and logit|definition]]). In linear regression, the significance of a regression coefficient is assessed by computing a ''t'' test. In logistic regression, there are several different tests designed to assess the significance of an individual predictor, most notably the likelihood ratio test and the Wald statistic.

====Likelihood ratio test====
The [[likelihood-ratio test]] discussed above to assess model fit is also the recommended procedure to assess the contribution of individual "predictors" to a given model.<ref name=Hosmer/><ref name=Menard/><ref name=Cohen/> In the case of a single predictor model, one simply compares the deviance of the predictor model with that of the null model on a chi-square distribution with a single degree of freedom. If the predictor model has significantly smaller deviance (c.f. chi-square using the difference in degrees of freedom of the two models), then one can conclude that there is a significant association between the "predictor" and the outcome. Although some common statistical packages (e.g. SPSS) do provide likelihood ratio test statistics, without this computationally intensive test it would be more difficult to assess the contribution of individual predictors in the multiple logistic regression case.{{Citation needed|date=October 2019}} To assess the contribution of individual predictors one can enter the predictors hierarchically, comparing each new model with the previous to determine the contribution of each predictor.<ref name=Cohen/> There is some debate among statisticians about the appropriateness of so-called "stepwise" procedures.{{weasel inline|date=October 2019}} The fear is that they may not preserve nominal statistical properties and may become misleading.<ref>{{cite book |first=Frank E. |last=Harrell |title=Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis |location=New York |publisher=Springer |year=2010 |isbn=978-1-4419-2918-1 }}{{page needed|date=October 2019}}</ref>

====Wald statistic====
Alternatively, when assessing the contribution of individual predictors in a given model, one may examine the significance of the [[Wald test|Wald statistic]]. The Wald statistic, analogous to the ''t''-test in linear regression, is used to assess the significance of coefficients. The Wald statistic is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient and is asymptotically distributed as a chi-square distribution.<ref name=Menard/>

: <math>W_j = \frac{\beta^2_j} {SE^2_{\beta_j}}</math>

Although several statistical packages (e.g., SPSS, SAS) report the Wald statistic to assess the contribution of individual predictors, the Wald statistic has limitations. When the regression coefficient is large, the standard error of the regression coefficient also tends to be larger increasing the probability of [[Type I and Type II errors|Type-II error]]. The Wald statistic also tends to be biased when data are sparse.<ref name=Cohen/>

====Case-control sampling====
Suppose cases are rare. Then we might wish to sample them more frequently than their prevalence in the population. For example, suppose there is a disease that affects 1 person in 10,000 and to collect our data we need to do a complete physical. It may be too expensive to do thousands of physicals of healthy people in order to obtain data for only a few diseased individuals. Thus, we may evaluate more diseased individuals, perhaps all of the rare outcomes. This is also retrospective sampling, or equivalently it is called unbalanced data. As a rule of thumb, sampling controls at a rate of five times the number of cases will produce sufficient control data.<ref name="islr">https://class.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/classification.pdf slide 16</ref>

Logistic regression is unique in that it may be estimated on unbalanced data, rather than randomly sampled data, and still yield correct coefficient estimates of the effects of each independent variable on the outcome.  That is to say, if we form a logistic model from such data, if the model is correct in the general population, the <math>\beta_j</math> parameters are all correct except for <math>\beta_0</math>. We can correct <math>\beta_0</math> if we know the true prevalence as follows:<ref name="islr"/>

: <math>\widehat{\beta}_0^* = \widehat{\beta}_0+\log \frac \pi {1 - \pi} - \log{ \tilde{\pi} \over {1 - \tilde{\pi}} } </math>

where <math>\pi</math> is the true prevalence and <math>\tilde{\pi}</math> is the prevalence in the sample.