Editing Coefficient of determination (section)

== Interpretation ==
''R''<sup>2</sup> is a measure of the [[goodness of fit]] of a model.<ref>{{cite book |last1=Casella |first1=Georges |title=Statistical inference |date=2002 |publisher=Duxbury/Thomson Learning |location=Pacific Grove, Calif. |isbn=9788131503942 |page=556 |edition=Second}}</ref> In regression, the ''R''<sup>2</sup> coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An ''R''<sup>2</sup> of 1 indicates that the regression predictions perfectly fit the data.

Values of ''R''<sup>2</sup> outside the range 0 to 1 occur when the model fits the data worse than the worst possible [[least-squares]] predictor (equivalent to a horizontal hyperplane at a height equal to the mean of the observed data). This occurs when a wrong model was chosen, or nonsensical constraints were applied by mistake. If equation 1 of Kvålseth<ref>{{Cite journal|last=Kvalseth|first=Tarald O.|date=1985|title=Cautionary Note about R2|jstor=2683704|journal=The American Statistician|volume=39|issue=4|pages=279–285|doi=10.2307/2683704}}</ref> is used (this is the equation used most often), ''R''<sup>2</sup> can be less than zero. If equation 2 of Kvålseth is used, ''R''<sup>2</sup> can be greater than one.

In all instances where ''R''<sup>2</sup> is used, the predictors are calculated by ordinary least-squares regression: that is, by minimizing ''SS''<sub>res</sub>. In this case, ''R''<sup>2</sup> increases as the number of variables in the model is increased (''R''<sup>2</sup> is [[Monotonic function|monotone increasing]] with the number of variables included—it will never decrease). This illustrates a drawback to one possible use of ''R''<sup>2</sup>, where one might keep adding variables ([[kitchen sink regression]]) to increase the ''R''<sup>2</sup> value. For example, if one is trying to predict the sales of a model of car from the car's gas mileage, price, and engine power, one can include probably irrelevant factors such as the first letter of the model's name or the height of the lead engineer designing the car because the ''R''<sup>2</sup> will never decrease as variables are added and will likely experience an increase due to chance alone.

This leads to the alternative approach of looking at the [[#Adjusted R2|adjusted ''R''<sup>2</sup>]]. The explanation of this statistic is almost the same as ''R''<sup>2</sup> but it penalizes the statistic as extra variables are included in the model. For cases other than fitting by ordinary least squares, the ''R''<sup>2</sup> statistic can be calculated as above and may still be a useful measure. If fitting is by [[weighted least squares]] or [[generalized least squares]], alternative versions of ''R''<sup>2</sup> can be calculated appropriate to those statistical frameworks, while the "raw" ''R''<sup>2</sup> may still be useful if it is more easily interpreted. Values for ''R''<sup>2</sup> can be calculated for any type of predictive model, which need not have a statistical basis.

=== In a multiple linear model ===
Consider a linear model with [[multiple regression|more than a single explanatory variable]], of the form
: <math>Y_i = \beta_0 + \sum_{j=1}^p \beta_j X_{i,j} + \varepsilon_i,</math>
where, for the ''i''th case, <math>{Y_i}</math> is the response variable, <math>X_{i,1},\dots,X_{i,p}</math> are ''p'' regressors, and <math>\varepsilon_i</math> is a mean zero [[errors and residuals in statistics|error]] term. The quantities <math>\beta_0,\dots,\beta_p</math> are unknown coefficients, whose values are estimated by [[least squares]]. The coefficient of determination ''R''<sup>2</sup> is a measure of the global fit of the model. Specifically, ''R''<sup>2</sup> is an element of [0,&nbsp;1] and represents the proportion of variability in ''Y''<sub>''i''</sub> that may be attributed to some linear combination of the regressors ([[explanatory variable]]s) in ''X''.<ref>{{Cite web|url=https://www.mathworks.com/help/matlab/data_analysis/linear-regression.html#bswinlz|title=Linear Regression – MATLAB & Simulink|website=www.mathworks.com}}</ref>

''R''<sup>2</sup> is often interpreted as the proportion of response variation "explained" by the regressors in the model. Thus, ''R''<sup>2</sup>&nbsp;=&nbsp;1 indicates that the fitted model explains all variability in <math>y</math>, while ''R''<sup>2</sup>&nbsp;=&nbsp;0 indicates no 'linear' relationship (for straight line regression, this means that the straight line model is a constant line (slope&nbsp;=&nbsp;0, intercept&nbsp;=&nbsp;<math>\bar{y}</math>) between the response variable and regressors). An interior value such as ''R''<sup>2</sup>&nbsp;=&nbsp;0.7 may be interpreted as follows: "Seventy percent of the variance in the response variable can be explained by the explanatory variables. The remaining thirty percent can be attributed to unknown, [[lurking variable]]s or inherent variability."

A caution that applies to ''R''<sup>2</sup>, as to other statistical descriptions of [[correlation]] and association is that "[[correlation does not imply causation]]." In other words, while correlations may sometimes provide valuable clues in uncovering causal relationships among variables, a non-zero estimated correlation between two variables is not, on its own, evidence that changing the value of one variable would result in changes in the values of other variables. For example, the practice of carrying matches (or a lighter) is correlated with incidence of lung cancer, but carrying matches does not cause cancer (in the standard sense of "cause").

In case of a single regressor, fitted by least squares, ''R''<sup>2</sup> is the square of the [[Pearson product-moment correlation coefficient]] relating the regressor and the response variable. More generally, ''R''<sup>2</sup> is the square of the correlation between the constructed predictor and the response variable. With more than one regressor, the ''R''<sup>2</sup> can be referred to as the [[coefficient of multiple determination]].

=== Inflation of ''R''<sup>2</sup> ===
In [[least squares]] regression using typical data, ''R''<sup>2</sup> is at least weakly increasing with an increase in number of regressors in the model. Because increases in the number of regressors increase the value of ''R''<sup>2</sup>, ''R''<sup>2</sup> alone cannot be used as a meaningful comparison of models with very different numbers of independent variables. For a meaningful comparison between two models, an [[F-test]] can be performed on the [[residual sum of squares]] {{Citation needed|date=October 2021}}, similar to the F-tests in [[Granger causality]], though this is not always appropriate{{Explain|date=October 2021}}. As a reminder of this, some authors denote ''R''<sup>2</sup> by ''R''<sub>''q''</sub><sup>2</sup>, where ''q'' is the number of columns in ''X'' (the number of explanators including the constant).

To demonstrate this property, first recall that the objective of least squares linear regression is
: <math>\min_b SS_\text{res}(b) \Rightarrow \min_b \sum_i (y_i - X_ib)^2\,</math>
where ''X<sub>i</sub>'' is a row vector of values of explanatory variables for case ''i'' and ''b'' is a column vector of coefficients of the respective elements of ''X<sub>i</sub>''.

The optimal value of the objective is weakly smaller as more explanatory variables are added and hence additional columns of <math>X</math> (the explanatory data matrix whose ''i''th row is ''X<sub>i</sub>'') are added, by the fact that less constrained minimization leads to an optimal cost which is weakly smaller than more constrained minimization does. Given the previous conclusion and noting that <math>SS_{tot}</math> depends only on ''y'', the non-decreasing property of ''R''<sup>2</sup> follows directly from the definition above.

The intuitive reason that using an additional explanatory variable cannot lower the ''R''<sup>2</sup> is this: Minimizing <math>SS_\text{res}</math> is equivalent to maximizing ''R''<sup>2</sup>. When the extra variable is included, the data always have the option of giving it an estimated coefficient of zero, leaving the predicted values and the ''R''<sup>2</sup> unchanged. The only way that the optimization problem will give a non-zero coefficient is if doing so improves the&nbsp;''R''<sup>2</sup>.

The above gives an analytical explanation of the inflation of ''R''<sup>2</sup>. Next, an example based on ordinary least square from a geometric perspective is shown below. <ref>{{cite book |last1=Faraway |first1=Julian James |title=Linear models with R |date=2005 |publisher=Chapman & Hall/CRC |isbn=9781584884255 |url=https://www.utstat.toronto.edu/~brunner/books/LinearModelsWithR.pdf}}</ref>

[[File:Screen shot proj fig.jpg|thumb|400x266px|right|This is an example of residuals of regression models in smaller and larger spaces based on ordinary least square regression.]]

A simple case to be considered first:
: <math>Y=\beta_0+\beta_1\cdot X_1+\varepsilon\,</math>
This equation describes the [[ordinary least squares regression]] model with one regressor. The prediction is shown as the red vector in the figure on the right. Geometrically, it is the projection of true value onto a model space in <math>\mathbb{R}</math> (without intercept). The residual is shown as the red line.
: <math>Y=\beta_0+\beta_1\cdot X_1+\beta_2\cdot X_2 + \varepsilon\,</math>
This equation corresponds to the ordinary least squares regression model with two regressors. The prediction is shown as the blue vector in the figure on the right. Geometrically, it is the projection of true value onto a larger model space in <math>\mathbb{R}^2</math> (without intercept). Noticeably, the values of <math>\beta_0</math> and <math>\beta_0</math> are not the same as in the equation for smaller model space as long as <math>X_1</math> and <math>X_2</math> are not zero vectors. Therefore, the equations are expected to yield different predictions (i.e., the blue vector is expected to be different from the red vector). The least squares regression criterion ensures that the residual is minimized. In the figure, the blue line representing the residual is orthogonal to the model space in <math>\mathbb{R}^2</math>, giving the minimal distance from the space. 

The smaller model space is a subspace of the larger one, and thereby the residual of the smaller model is guaranteed to be larger. Comparing the red and blue lines in the figure, the blue line is orthogonal to the space, and any other line would be larger than the blue one. Considering the calculation for ''R''<sup>2</sup>, a smaller value of <math>SS_{tot}</math> will lead to a larger value of ''R''<sup>2</sup>, meaning that adding regressors will result in inflation of ''R''<sup>2</sup>.

=== Caveats ===
''R''<sup>2</sup> does not indicate whether:
* the independent variables are a cause of the changes in the [[dependent variable]];
* [[omitted-variable bias]] exists;
* the correct [[regression analysis|regression]] was used;
* the most appropriate set of independent variables has been chosen;
* there is [[Multicollinearity|collinearity]] present in the data on the explanatory variables;
* the model might be improved by using transformed versions of the existing set of independent variables;
* there are enough data points to make a solid conclusion;
* there are a few [[outlier]]s in an otherwise good sample.
[[File:Thiel-Sen estimator.svg|thumb|Comparison of the [[Theil–Sen estimator]] (black) and [[simple linear regression]] (blue) for a set of points with [[outlier]]s. Because of the many outliers, neither of the regression lines fits the data well, as measured by the fact that neither gives a very high ''R''<sup>2</sup>.]]