Editing Pearson correlation coefficient (section)

==In least squares regression analysis==
{{For|more general, non-linear dependency|Coefficient of determination#In a multiple linear model}}
The square of the sample correlation coefficient is typically denoted ''r''<sup>2</sup> and is a special case of the [[coefficient of determination]]. In this case, it estimates the fraction of the variance in ''Y'' that is explained by ''X'' in a [[simple linear regression]]. So if we have the observed dataset <math>Y_1, \dots , Y_n</math> and the fitted dataset <math>\hat Y_1, \dots , \hat Y_n</math> then as a starting point the total variation in the ''Y''<sub>''i''</sub> around their average value can be decomposed as follows

:<math>\sum_i (Y_i - \bar{Y})^2 = \sum_i (Y_i-\hat{Y}_i)^2 + \sum_i (\hat{Y}_i-\bar{Y})^2,</math>

where the <math>\hat{Y}_i</math> are the fitted values from the regression analysis.  This can be rearranged to give

:<math>1 = \frac{\sum_i (Y_i-\hat{Y}_i)^2}{\sum_i (Y_i - \bar{Y})^2} + \frac{\sum_i (\hat{Y}_i-\bar{Y})^2}{\sum_i (Y_i - \bar{Y})^2}.</math>

The two summands above are the fraction of variance in ''Y'' that is explained by ''X'' (right) and that is unexplained by ''X'' (left).

Next, we apply a property of [[least squares]] regression models, that the sample covariance between <math>\hat{Y}_i</math> and <math>Y_i-\hat{Y}_i</math> is zero.  Thus, the sample correlation coefficient between the observed and fitted response values in the regression can be written (calculation is under expectation, assumes Gaussian statistics)

:<math>
\begin{align}
r(Y,\hat{Y}) &= \frac{\sum_i(Y_i-\bar{Y})(\hat{Y}_i-\bar{Y})}{\sqrt{\sum_i(Y_i-\bar{Y})^2\cdot \sum_i(\hat{Y}_i-\bar{Y})^2}}\\[6pt]
&= \frac{\sum_i(Y_i-\hat{Y}_i+\hat{Y}_i-\bar{Y})(\hat{Y}_i-\bar{Y})}{\sqrt{\sum_i(Y_i-\bar{Y})^2\cdot \sum_i(\hat{Y}_i-\bar{Y})^2}}\\[6pt]
&= \frac{ \sum_i [(Y_i-\hat{Y}_i)(\hat{Y}_i-\bar{Y}) +(\hat{Y}_i-\bar{Y})^2 ]}{\sqrt{\sum_i(Y_i-\bar{Y})^2\cdot \sum_i(\hat{Y}_i-\bar{Y})^2}}\\[6pt]
&= \frac{ \sum_i (\hat{Y}_i-\bar{Y})^2 }{\sqrt{\sum_i(Y_i-\bar{Y})^2\cdot \sum_i(\hat{Y}_i-\bar{Y})^2}}\\[6pt]
&= \sqrt{\frac{\sum_i(\hat{Y}_i-\bar{Y})^2}{\sum_i(Y_i-\bar{Y})^2}}.
\end{align}
</math>

Thus

:<math>r(Y,\hat{Y})^2 = \frac{\sum_i(\hat{Y}_i-\bar{Y})^2}{\sum_i(Y_i-\bar{Y})^2}</math>

where <math>r(Y,\hat{Y})^2</math> is the proportion of variance in ''Y'' explained by a linear function of ''X''.

In the derivation above, the fact that
:<math>\sum_i (Y_i-\hat{Y}_i)(\hat{Y}_i-\bar{Y}) = 0</math>
can be proved by noticing that the partial derivatives of the [[residual sum of squares]] ({{math|RSS}}) over ''β''<sub>0</sub> and ''β''<sub>1</sub> are equal to 0 in the least squares model, where
:<math>\text{RSS} = \sum_i (Y_i - \hat{Y}_i)^2</math>.

In the end, the equation can be written as

:<math>r(Y,\hat{Y})^2 = \frac{\text{SS}_\text{reg}}{\text{SS}_\text{tot}}</math>
where
*<math>\text{SS}_\text{reg} = \sum_i (\hat{Y}_i-\bar{Y})^2</math>
*<math>\text{SS}_\text{tot} = \sum_i (Y_i-\bar{Y})^2</math>.
The symbol <math>\text{SS}_\text{reg}</math> is called the regression sum of squares, also called the [[explained sum of squares]], and <math>\text{SS}_\text{tot}</math> is the [[total sum of squares]] (proportional to the [[variance]] of the data).