Editing Coefficient of determination (section)

== Extensions ==
=== Adjusted ''R''<sup>2</sup> ===
{{See also|Effect size#Omega-squared (ω2){{!}}Omega-squared (ω<sup>2</sup>)}}
The use of an adjusted ''R''<sup>2</sup> (one common notation is <math>\bar R^2</math>, pronounced "R bar squared"; another is <math>R^2_{\text{a}}</math> or <math>R^2_{\text{adj}}</math>) is an attempt to account for the phenomenon of the ''R''<sup>2</sup> automatically increasing when extra explanatory variables are added to the model. There are many different ways of adjusting.<ref name="raju">{{Cite journal |last1=Raju |first1=Nambury S. |last2=Bilgic |first2=Reyhan |last3=Edwards |first3=Jack E. |last4=Fleer |first4=Paul F. |date=1997 |title=Methodology review: Estimation of population validity and cross-validity, and the use of equal weights in prediction |url=https://doi.org/10.1177/01466216970214001 |journal=Applied Psychological Measurement |volume=21 |issue=4 |pages=291–305 |doi=10.1177/01466216970214001 |issn=0146-6216 |s2cid=122308344}}</ref> By far the most used one, to the point that it is typically just referred to as adjusted ''R'', is the correction proposed by [[Mordecai Ezekiel]].<ref name="raju"/><ref><!-- Ezekiel (1930, 1947)  Methods Of Correlation Analysis-->{{cite Q|Q120123877}}, pp. 208–211.</ref><ref>{{Cite journal| doi = 10.1080/00220970109600656| issn = 0022-0973 | volume = 69| issue = 2| pages = 203–224| last1 = Yin| first1 = Ping| last2 = Fan| first2 = Xitao| title = Estimating ''R'' <sup>2</sup> Shrinkage in Multiple Regression: A Comparison of Different Analytical Methods| journal = The Journal of Experimental Education| date = January 2001| s2cid = 121614674| url = https://digitalcommons.usu.edu/context/etd/article/7222/viewcontent/1999_Yin_Ping.pdf }}</ref> 
The adjusted ''R''<sup>2</sup> is defined as
: <math>\bar R^2 = {1-{SS_\text{res}/\text{df}_\text{res} \over SS_\text{tot}/\text{df}_\text{tot}}}</math>
where df<sub>''res''</sub> is the [[Degrees of freedom (statistics)|degrees of freedom]] of the estimate of the population variance around the model, and df<sub>''tot''</sub> is the degrees of freedom  of the estimate of the population variance around the mean. df<sub>''res''</sub> is given in terms of the sample size ''n'' and the number of variables ''p'' in the model, {{nowrap|1=df<sub>''res''</sub> = ''n'' − ''p'' − 1}}. df<sub>''tot''</sub> is given in the same way, but with ''p'' being zero for the mean, i.e. {{nowrap|1=df<sub>''tot''</sub> = ''n'' − 1}}.

Inserting the degrees of freedom and using the definition of ''R''<sup>2</sup>, it can be rewritten as:
: <math>\bar R^2 = 1-(1-R^2){n-1 \over n-p-1}</math>
where ''p'' is the total number of explanatory variables in the model (excluding the intercept), and ''n'' is the sample size.

The adjusted ''R''<sup>2</sup> can be negative, and its value will always be less than or equal to that of ''R''<sup>2</sup>. Unlike ''R''<sup>2</sup>, the adjusted ''R''<sup>2</sup> increases only when the increase in ''R''<sup>2</sup> (due to the inclusion of a new explanatory variable) is more than one would expect to see by chance. If a set of explanatory variables with a predetermined hierarchy of importance are introduced into a regression one at a time, with the adjusted ''R''<sup>2</sup> computed each time, the level at which adjusted ''R''<sup>2</sup> reaches a maximum, and decreases afterward, would be the regression with the ideal combination of having the best fit without excess/unnecessary terms.

[[File:Bias and variance contributing to total error.svg|thumb|640x403px|right|Schematic of the bias and variance contribution into the total error]]

The adjusted ''R''<sup>2</sup> can be interpreted as an instance of the [[bias-variance tradeoff]]. When we consider the performance of a model, a lower error represents a better performance. When the model becomes more complex, the variance will increase whereas the square of bias will decrease, and these two metrices add up to be the total error. Combining these two trends, the bias-variance tradeoff describes a relationship between the performance of the model and its complexity, which is shown as a u-shape curve on the right. For the adjusted ''R''<sup>2</sup> specifically, the model complexity (i.e. number of parameters) affects the ''R''<sup>2</sup> and the term / frac and thereby captures their attributes in the overall performance of the model.

''R''<sup>2</sup> can be interpreted as the variance of the model, which is influenced by the model complexity. A high ''R''<sup>2</sup> indicates a lower bias error because the model can better explain the change of Y with predictors. For this reason, we make fewer (erroneous) assumptions, and this results in a lower bias error. Meanwhile, to accommodate fewer assumptions, the model tends to be more complex. Based on bias-variance tradeoff, a higher complexity will lead to a decrease in bias and a better performance (below the optimal line). In {{overline|''R''}}<sup>2</sup>, the term ({{nowrap|1=1 − ''R''<sup>2</sup>}}) will be lower with high complexity and resulting in a higher {{overline|''R''}}<sup>2</sup>, consistently indicating a better performance.

On the other hand, the term/frac term is reversely affected by the model complexity. The term/frac will increase when adding regressors (i.e. increased model complexity) and lead to worse performance. Based on bias-variance tradeoff, a higher model complexity (beyond the optimal line) leads to increasing errors and a worse performance. 

Considering the calculation of {{overline|''R''}}<sup>2</sup>, more parameters will increase the ''R''<sup>2</sup> and lead to an increase in {{overline|''R''}}<sup>2</sup>. Nevertheless, adding more parameters will increase the term/frac and thus decrease {{overline|''R''}}<sup>2</sup>. These two trends construct a reverse u-shape relationship between model complexity and {{overline|''R''}}<sup>2</sup>, which is in consistent with the u-shape trend of model complexity versus overall performance. Unlike ''R''<sup>2</sup>, which will always increase when model complexity increases, {{overline|''R''}}<sup>2</sup> will increase only when the bias eliminated by the added regressor is greater than the variance introduced simultaneously. Using {{overline|''R''}}<sup>2</sup> instead of ''R''<sup>2</sup> could thereby prevent overfitting. 

Following the same logic, adjusted ''R''<sup>2</sup> can be interpreted as a less biased estimator of the population ''R''<sup>2</sup>, whereas the observed sample ''R''<sup>2</sup> is a positively biased estimate of the population value.<ref name=":0">{{Cite journal|last=Shieh|first=Gwowen|date=2008-04-01|title=Improved shrinkage estimation of squared multiple correlation coefficient and squared cross-validity coefficient|journal=Organizational Research Methods|volume=11|issue=2|pages=387–407|doi=10.1177/1094428106292901|s2cid=55098407|issn=1094-4281}}</ref> Adjusted ''R''<sup>2</sup> is more appropriate when evaluating model fit (the variance in the dependent variable accounted for by the independent variables) and in comparing alternative models in the [[feature selection]] stage of model building.<ref name=":0" />

The principle behind the adjusted ''R''<sup>2</sup> statistic can be seen by rewriting the ordinary ''R''<sup>2</sup> as
: <math>R^2 = {1-{\text{VAR}_\text{res} \over \text{VAR}_\text{tot}}}</math>
where <math>\text{VAR}_\text{res} = SS_\text{res}/n</math> and <math>\text{VAR}_\text{tot} = SS_\text{tot}/n</math> are the sample variances of the estimated residuals and the dependent variable respectively, which can be seen as biased estimates of the population variances of the errors and of the dependent variable. These estimates are replaced by statistically [[Bias of an estimator#Sample variance|unbiased]] versions: <math>\text{VAR}_\text{res} = SS_\text{res}/(n-p)</math> and <math>\text{VAR}_\text{tot} = SS_\text{tot}/(n-1)</math>.

Despite using unbiased estimators for the population variances of the error and the dependent variable,  adjusted ''R''<sup>2</sup> is not an unbiased estimator of the population ''R''<sup>2</sup>,<ref name=":0"/> which results by using the population variances of the errors and the dependent variable instead of estimating them.  [[Ingram Olkin]] and [[John W. Pratt]] derived the [[minimum-variance unbiased estimator]] for the population ''R''<sup>2</sup>,<ref>{{Cite journal| doi = 10.1214/aoms/1177706717| issn = 0003-4851 | volume = 29| issue = 1| pages = 201–211| last1 = Olkin| first1 = Ingram| last2 = Pratt| first2 = John W.| title = Unbiased estimation of certain correlation coefficients| journal = The Annals of Mathematical Statistics| date = March 1958| url = https://projecteuclid.org/euclid.aoms/1177706717| doi-access = free}}</ref> which is known as Olkin–Pratt estimator. Comparisons of different approaches for adjusting ''R''<sup>2</sup> concluded that in most situations either an approximate version of the Olkin–Pratt estimator <ref name=":0"/> or the exact Olkin–Pratt estimator  <ref>{{Cite journal| doi = 10.1525/collabra.343| issn = 2474-7394| volume = 6| issue = 45| last = Karch| first = Julian| title = Improving on Adjusted R-Squared| journal = Collabra: Psychology| date = 2020-09-29| doi-access = free| hdl = 1887/3161248| hdl-access = free}}</ref> should be preferred over (Ezekiel) adjusted ''R''<sup>2</sup>.

=== Coefficient of partial determination ===
{{See also|Partial correlation}}

The coefficient of partial determination can be defined as the proportion of variation that cannot be explained in a reduced model, but can be explained by the predictors specified in a full(er) model.<ref>Richard Anderson-Sprecher, "[http://www.tandfonline.com/doi/abs/10.1080/00031305.1994.10476036 Model Comparisons and ''R''<sup>2</sup>]", ''[[The American Statistician]]'', Volume 48, Issue 2, 1994, pp. 113–117.</ref><ref name="Nagelkerke 1991" /><ref>{{Cite web|url=https://stats.stackexchange.com/q/7775 |title=regression – R implementation of coefficient of partial determination|website=Cross Validated}}</ref> This coefficient is used to provide insight into whether or not one or more additional predictors may be useful in a more fully specified regression model.

The calculation for the partial ''R''<sup>2</sup> is relatively straightforward after estimating two models and generating the [[ANOVA]] tables for them. The calculation for the partial ''R''<sup>2</sup> is
: <math>\frac{SS_\text{ res, reduced} - SS_\text{ res, full}}{SS_\text{ res, reduced}},</math>
which is analogous to the usual coefficient of determination:
: <math>\frac{SS_\text{tot} - SS_\text{res}}{SS_\text{tot}}.</math>

=== Generalizing and decomposing ''R''<sup>2</sup> ===

As explained above, model selection heuristics such as the adjusted ''R''<sup>2</sup> criterion and the [[F-test]] examine whether the total ''R''<sup>2</sup> sufficiently increases to determine if a new regressor should be added to the model. If a regressor is added to the model that is highly correlated with other regressors which have already been included, then the total ''R''<sup>2</sup> will hardly increase, even if the new regressor is of relevance. As a result, the above-mentioned heuristics will ignore relevant regressors when cross-correlations are high.<ref name="Hoornweg2018SUS">{{cite book |last1=Hoornweg |first1=Victor |title=Science: Under Submission | chapter=Part II: On Keeping Parameters Fixed | date=2018 |publisher=Hoornweg Press |isbn=978-90-829188-0-9 |chapter-url=http://www.victorhoornweg.com}}</ref>

[[File:Geometric R squared .svg|thumb|Geometric representation of ''r''<sup>2</sup>.]]
Alternatively, one can decompose a generalized version of ''R''<sup>2</sup> to quantify the relevance of deviating from a hypothesis.<ref name = "Hoornweg2018SUS" /> As Hoornweg (2018) shows, several [[Shrinkage (statistics)|shrinkage estimators]] – such as [[Bayesian linear regression]], [[ridge regression]], and the (adaptive) [[Lasso (statistics)#Lasso method|lasso]] – make use of this decomposition of ''R''<sup>2</sup> when they gradually shrink parameters from the unrestricted OLS solutions towards the hypothesized values. Let us first define the linear regression model as 
: <math>y=X\beta+\varepsilon.</math>
It is assumed that the matrix ''X'' is standardized with Z-scores and that the column vector <math>y</math> is centered to have a mean of zero. Let the column vector <math>\beta_0</math> refer to the hypothesized regression parameters and let the column vector <math>b</math> denote the estimated parameters. We can then define 
: <math>R^2=1-\frac{(y-Xb)'(y-Xb)}{(y-X\beta_0)'(y-X\beta_0)}.</math>
An ''R''<sup>2</sup> of 75% means that the in-sample accuracy improves by 75% if the data-optimized ''b'' solutions are used instead of the hypothesized <math>\beta_0</math> values. In the special case that <math>\beta_0</math> is a vector of zeros, we obtain the traditional ''R''<sup>2</sup> again.

The individual effect on ''R''<sup>2</sup> of deviating from a hypothesis can be computed with <math>R^\otimes</math> ('R-outer'). This <math>p</math> times <math>p</math> matrix is given by
: <math>R^{\otimes}=(X'\tilde y_0)(X'\tilde y_0)' (X'X)^{-1}(\tilde y_0'\tilde y_0)^{-1},</math>
where <math>\tilde y_0=y-X\beta_0</math>. The diagonal elements of <math>R^\otimes</math> exactly add up to ''R''<sup>2</sup>. If regressors are uncorrelated and <math>\beta_0</math> is a vector of zeros, then the <math>j^\text{th}</math> diagonal element of <math>R^{\otimes}</math> simply corresponds to the ''r''<sup>2</sup> value between <math>x_j</math> and <math>y</math>. When regressors <math>x_i</math> and <math>x_j</math> are correlated, <math>R^\otimes_{ii}</math> might increase at the cost of a decrease in <math>R^{\otimes}_{jj}</math>. As a result, the diagonal elements of <math>R^{\otimes}</math> may be smaller than 0 and, in more exceptional cases, larger than 1. To deal with such uncertainties, several shrinkage estimators implicitly take a weighted average of the diagonal elements of <math>R^{\otimes}</math> to quantify the relevance of deviating from a hypothesized value.<ref name = "Hoornweg2018SUS" /> Click on the [[Lasso (statistics)#Interpretations of lasso|lasso]] for an example.

=== ''R''<sup>2</sup> in logistic regression ===
In the case of [[logistic regression]], usually fit by [[maximum likelihood]], there are several choices of [[Logistic regression#Pseudo-R-squared|pseudo-''R''<sup>2</sup>]].

One is the generalized ''R''<sup>2</sup> originally proposed by Cox & Snell,<ref>{{cite book |last1=Cox|first1=D. D.|last2=Snell|first2=E. J.|author2-link= Joyce Snell |year=1989|title=The Analysis of Binary Data|edition=2nd|publisher=Chapman and Hall}}</ref> and independently by Magee:<ref>{{cite journal|last=Magee|first=L.|year=1990|title=''R''<sup>2</sup> measures based on Wald and likelihood ratio joint significance tests|journal=The American Statistician|volume=44|issue=3 |pages=250–3|doi=10.1080/00031305.1990.10475731}}</ref>
: <math>R^2 = 1 - \left({ \mathcal{L}(0) \over \mathcal{L}(\widehat{\theta}) }\right)^{2/n}</math>
where <math>\mathcal{L}(0)</math> is the likelihood of the model with only the intercept, <math>{\mathcal{L}(\widehat{\theta})}</math> is the likelihood of the estimated model (i.e., the model with a given set of parameter estimates) and ''n'' is the sample size. It is easily rewritten to:
: <math>R^2 = 1 - e^{\frac{2}{n} (\ln(\mathcal{L}(0)) - \ln(\mathcal{L}(\widehat{\theta})))} = 1 - e^{-D/n}</math>
where ''D'' is the test statistic of the [[likelihood ratio test]].

[[Nico Nagelkerke]] noted that it had the following properties:<ref>{{cite book |last=Nagelkerke |first=Nico J. D. |year=1992 |title=Maximum Likelihood Estimation of Functional Relationships, Pays-Bas |series=Lecture Notes in Statistics |volume=69 |isbn=978-0-387-97721-8}}</ref><ref name="Nagelkerke 1991">{{cite journal |doi=10.1093/biomet/78.3.691 |title=A Note on a General Definition of the Coefficient of Determination |date=September 1991 |last1=Nagelkerke |first1=N. J. D. |journal=Biometrika |volume=78 |issue=3 |pages=691–692 |jstor=2337038 |url= http://www.cesarzamudio.com/uploads/1/7/9/1/17916581/nagelkerke_n.j.d._1991_-_a_note_on_a_general_definition_of_the_coefficient_of_determination.pdf}}</ref>
# It is consistent with the classical coefficient of determination when both can be computed;
# Its value is maximised by the maximum likelihood estimation of a model;
# It is asymptotically independent of the sample size;
# The interpretation is the proportion of the variation explained by the model;
# The values are between 0 and 1, with 0 denoting that model does not explain any variation and 1 denoting that it perfectly explains the observed variation;
# It does not have any unit.

However, in the case of a logistic model, where <math>\mathcal{L}(\widehat{\theta})</math> cannot be greater than 1, ''R''<sup>2</sup> is between 0 and <math> R^2_\max = 1- (\mathcal{L}(0))^{2/n} </math>: thus, Nagelkerke&nbsp;suggested the possibility to define a scaled ''R''<sup>2</sup> as ''R''<sup>2</sup>/''R''<sup>2</sup><sub>max</sub>.<ref name="Nagelkerke 1991" />