Editing Logistic regression (section)

=== Deviance and likelihood ratio test ─ a simple case  ===
In any fitting procedure, the addition of another fitting parameter to a model (e.g. the beta parameters in a logistic regression model) will almost always improve the ability of the model to predict the measured outcomes. This will be true even if the additional term has no predictive value, since the model will simply be "[[overfitting]]" to the noise in the data.  The question arises as to whether the improvement gained by the addition of another fitting parameter is significant enough to recommend the inclusion of the additional term, or whether the improvement is simply that which may be expected from overfitting.

In short, for logistic regression, a statistic known as the [[deviance (statistics)|deviance]] is defined which is a measure of the error between the logistic model fit and the outcome data. In the limit of a large number of data points, the deviance is [[Chi-squared distribution|chi-squared]] distributed, which allows a [[chi-squared test]] to be implemented in order to determine the significance of the explanatory variables.

Linear regression and logistic regression have many similarities. For example, in simple linear regression, a set of ''K'' data points (''x<sub>k</sub>'', ''y<sub>k</sub>'') are fitted to a proposed model function of the form <math>y=b_0+b_1 x</math>. The fit is obtained by choosing the ''b'' parameters which minimize the sum of the squares of the residuals (the squared error term) for each data point:

:<math>\varepsilon^2=\sum_{k=1}^K (b_0+b_1 x_k-y_k)^2.</math>

The minimum value which constitutes the fit will be denoted by <math>\hat{\varepsilon}^2</math>

The idea of a [[null model]] may be introduced, in which it is assumed that the ''x'' variable is of no use in predicting the y<sub>k</sub> outcomes: The data points are fitted to a null model function of the form ''y''&nbsp;=&nbsp;''b''<sub>0</sub> with a squared error term:

:<math>\varepsilon^2=\sum_{k=1}^K (b_0-y_k)^2.</math>

The fitting process consists of choosing a value of ''b''<sub>0</sub> which minimizes <math>\varepsilon^2</math> of the fit to the null model, denoted by  <math>\varepsilon_\varphi^2</math> where the <math>\varphi</math> subscript denotes the null model. It is seen that the null model is optimized by <math>b_0=\overline{y}</math> where <math>\overline{y}</math> is the mean of the ''y<sub>k</sub>'' values, and the optimized <math>\varepsilon_\varphi^2</math> is:

:<math>\hat{\varepsilon}_\varphi^2=\sum_{k=1}^K (\overline{y}-y_k)^2</math>

which is proportional to the square of the (uncorrected) sample standard deviation of the ''y<sub>k</sub>'' data points.

We can imagine a case where the ''y<sub>k</sub>'' data points are randomly assigned to the various ''x<sub>k</sub>'', and then fitted using the proposed model. Specifically, we can consider the fits of the proposed model to every permutation of the ''y<sub>k</sub>'' outcomes. It can be shown that the optimized error of any of these fits will never be less than the optimum error of the null model, and that the difference between these minimum error will follow a [[chi-squared distribution]], with degrees of freedom equal those of the proposed model minus those of the null model which, in this case, will be <math>2-1=1</math>. Using the [[chi-squared test]], we may then estimate how many of these permuted sets of ''y<sub>k</sub>'' will yield a minimum error less than or equal to the minimum  error using the original ''y<sub>k</sub>'', and so we can estimate how significant an improvement is given by the inclusion of the ''x'' variable in the proposed model.

For logistic regression, the measure of goodness-of-fit is the likelihood function ''L'', or its logarithm, the log-likelihood ''ℓ''. The likelihood function ''L'' is analogous to the <math>\varepsilon^2</math> in the linear regression case, except that the likelihood is maximized rather than minimized. Denote the maximized log-likelihood of the proposed model by <math>\hat{\ell}</math>.

In the case of simple binary logistic regression, the set of ''K'' data points are fitted in a probabilistic sense to a function of the form:

:<math>p(x)=\frac{1}{1+e^{-t}}</math>

where {{tmath|p(x)}} is the probability that <math>y=1</math>. The log-odds are given by:

:<math>t=\beta_0+\beta_1 x</math>

and the log-likelihood is:

:<math>\ell=\sum_{k=1}^K \left( y_k \ln(p(x_k))+(1-y_k) \ln(1-p(x_k))\right)</math>

For the null model, the probability that <math>y=1</math> is given by:

:<math>p_\varphi(x)=\frac{1}{1+e^{-t_\varphi}}</math>

The log-odds for the null model are given by:

:<math>t_\varphi=\beta_0</math>

and the log-likelihood is:

:<math>\ell_\varphi=\sum_{k=1}^K \left( y_k \ln(p_\varphi)+(1-y_k) \ln(1-p_\varphi)\right)</math>  

Since we have <math>p_\varphi=\overline{y}</math> at the maximum of ''L'', the maximum log-likelihood for the null model is

:<math>\hat{\ell}_\varphi=K(\,\overline{y} \ln(\overline{y}) + (1-\overline{y})\ln(1-\overline{y}))</math>

The optimum <math>\beta_0</math> is:

:<math>\beta_0=\ln\left(\frac{\overline{y}}{1-\overline{y}}\right)</math>

where <math>\overline{y}</math> is again the mean of the ''y<sub>k</sub>'' values. Again, we can conceptually consider the fit of the proposed model to every permutation of the ''y<sub>k</sub>'' and it can be shown that the maximum log-likelihood of these permutation fits will never be smaller than that of the null model:

:<math> \hat{\ell} \ge \hat{\ell}_\varphi</math>

Also, as an analog to the error of the linear regression case, we may define the [[deviance (statistics)|deviance]] of a logistic regression fit as:

:<math>D=\ln\left(\frac{\hat{L}^2}{\hat{L}_\varphi^2}\right) = 2(\hat{\ell}-\hat{\ell}_\varphi)</math>

which will always be positive or zero. The reason for this choice is that not only is the deviance a good measure of the goodness of fit, it is also approximately chi-squared distributed, with the approximation improving as the number of data points (''K'') increases, becoming exactly chi-square distributed in the limit of an infinite number of data points. As in the case of linear regression, we may use this fact to estimate the probability that a random set of data points will give a better fit than the fit obtained by the proposed model, and so have an estimate how significantly the model is improved by including the ''x<sub>k</sub>'' data points in the proposed model.

For the simple model of student test scores described above, the maximum value of the log-likelihood of the null model is <math>\hat{\ell}_\varphi= -13.8629\ldots</math> The maximum value of the log-likelihood for the simple model is <math>\hat{\ell}=-8.02988\ldots</math> so that the deviance is <math>D = 2(\hat{\ell}-\hat{\ell}_\varphi)=11.6661\ldots</math>

Using the [[chi-squared test]] of significance, the integral of the [[chi-squared distribution]] with one degree of freedom from 11.6661... to infinity is equal to 0.00063649...

This effectively means that about 6 out of a 10,000 fits to random ''y<sub>k</sub>'' can be expected to have a better fit (smaller deviance) than the given ''y<sub>k</sub>'' and so we can conclude that the inclusion of the ''x'' variable and data in the proposed model is a very significant improvement over the null model. In other words, we reject the [[null hypothesis]] with <math>1-D\approx 99.94 \%</math> confidence.