Editing Logistic regression (section)

===Coefficient significance===
After fitting the model, it is likely that researchers will want to examine the contribution of individual predictors. To do so, they will want to examine the regression coefficients. In linear regression, the regression coefficients represent the change in the criterion for each unit change in the predictor.<ref name=Cohen/> In logistic regression, however, the regression coefficients represent the change in the logit for each unit change in the predictor. Given that the logit is not intuitive, researchers are likely to focus on a predictor's effect on the exponential function of the regression coefficient – the odds ratio (see [[#Logistic function, odds, odds ratio, and logit|definition]]). In linear regression, the significance of a regression coefficient is assessed by computing a ''t'' test. In logistic regression, there are several different tests designed to assess the significance of an individual predictor, most notably the likelihood ratio test and the Wald statistic.

====Likelihood ratio test====
The [[likelihood-ratio test]] discussed above to assess model fit is also the recommended procedure to assess the contribution of individual "predictors" to a given model.<ref name=Hosmer/><ref name=Menard/><ref name=Cohen/> In the case of a single predictor model, one simply compares the deviance of the predictor model with that of the null model on a chi-square distribution with a single degree of freedom. If the predictor model has significantly smaller deviance (c.f. chi-square using the difference in degrees of freedom of the two models), then one can conclude that there is a significant association between the "predictor" and the outcome. Although some common statistical packages (e.g. SPSS) do provide likelihood ratio test statistics, without this computationally intensive test it would be more difficult to assess the contribution of individual predictors in the multiple logistic regression case.{{Citation needed|date=October 2019}} To assess the contribution of individual predictors one can enter the predictors hierarchically, comparing each new model with the previous to determine the contribution of each predictor.<ref name=Cohen/> There is some debate among statisticians about the appropriateness of so-called "stepwise" procedures.{{weasel inline|date=October 2019}} The fear is that they may not preserve nominal statistical properties and may become misleading.<ref>{{cite book |first=Frank E. |last=Harrell |title=Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis |location=New York |publisher=Springer |year=2010 |isbn=978-1-4419-2918-1 }}{{page needed|date=October 2019}}</ref>

====Wald statistic====
Alternatively, when assessing the contribution of individual predictors in a given model, one may examine the significance of the [[Wald test|Wald statistic]]. The Wald statistic, analogous to the ''t''-test in linear regression, is used to assess the significance of coefficients. The Wald statistic is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient and is asymptotically distributed as a chi-square distribution.<ref name=Menard/>

: <math>W_j = \frac{\beta^2_j} {SE^2_{\beta_j}}</math>

Although several statistical packages (e.g., SPSS, SAS) report the Wald statistic to assess the contribution of individual predictors, the Wald statistic has limitations. When the regression coefficient is large, the standard error of the regression coefficient also tends to be larger increasing the probability of [[Type I and Type II errors|Type-II error]]. The Wald statistic also tends to be biased when data are sparse.<ref name=Cohen/>

====Case-control sampling====
Suppose cases are rare. Then we might wish to sample them more frequently than their prevalence in the population. For example, suppose there is a disease that affects 1 person in 10,000 and to collect our data we need to do a complete physical. It may be too expensive to do thousands of physicals of healthy people in order to obtain data for only a few diseased individuals. Thus, we may evaluate more diseased individuals, perhaps all of the rare outcomes. This is also retrospective sampling, or equivalently it is called unbalanced data. As a rule of thumb, sampling controls at a rate of five times the number of cases will produce sufficient control data.<ref name="islr">https://class.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/classification.pdf slide 16</ref>

Logistic regression is unique in that it may be estimated on unbalanced data, rather than randomly sampled data, and still yield correct coefficient estimates of the effects of each independent variable on the outcome.  That is to say, if we form a logistic model from such data, if the model is correct in the general population, the <math>\beta_j</math> parameters are all correct except for <math>\beta_0</math>. We can correct <math>\beta_0</math> if we know the true prevalence as follows:<ref name="islr"/>

: <math>\widehat{\beta}_0^* = \widehat{\beta}_0+\log \frac \pi {1 - \pi} - \log{ \tilde{\pi} \over {1 - \tilde{\pi}} } </math>

where <math>\pi</math> is the true prevalence and <math>\tilde{\pi}</math> is the prevalence in the sample.