Editing Pearson correlation coefficient (section)

==Definition==
Pearson's correlation coefficient is the [[covariance]] of the two variables divided by the product of their standard deviations. The form of the definition involves a "product moment", that is, the mean (the first [[Moment (mathematics)|moment]] about the origin) of the product of the mean-adjusted random variables; hence the modifier ''product-moment'' in the name.{{verify source|date=February 2024}}

===For a population===
Pearson's correlation coefficient, when applied to a [[statistical population|population]], is commonly represented by the Greek letter ''ρ'' (rho) and may be referred to as the ''population correlation coefficient'' or the ''population Pearson correlation coefficient''. Given a pair of random variables <math>(X,Y)</math> (for example, Height and Weight), the formula for ''ρ''<ref name="RealCorBasic">Real Statistics Using Excel, "[http://www.real-statistics.com/correlation/basic-concepts-correlation/ Basic Concepts of Correlation]", retrieved 22 February 2015.</ref> is<ref>{{Cite web|last=Weisstein|first=Eric W.|title=Statistical Correlation|url=https://mathworld.wolfram.com/StatisticalCorrelation.html|access-date=2020-08-22|website=Wolfram MathWorld|language=en}}</ref>

<math display=block> \rho_{X,Y}= \frac{\operatorname{cov}(X,Y)}{\sigma_X \sigma_Y}</math>

where
*<math> \operatorname{cov} </math> is the [[covariance]]
*<math> \sigma_X </math> is the [[standard deviation]] of  <math> X </math>
*<math> \sigma_Y </math> is the standard deviation of  <math> Y </math>.

The formula for <math>\operatorname{cov}(X,Y)</math> can be expressed in terms of [[mean]] and [[Expected Value|expectation]]. Since<ref name="RealCorBasic"/>

:<math>\operatorname{cov}(X,Y) = \operatorname\mathbb{E}[(X-\mu_X)(Y-\mu_Y)],</math>

the formula for <math>\rho</math> can also be written as

<math display=block> \rho_{X,Y} = \frac{\operatorname\mathbb{E}[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X\sigma_Y}</math>

where
*<math> \sigma_Y </math> and <math> \sigma_X </math> are defined as above
*<math> \mu_X </math> is the mean of <math> X </math>
*<math> \mu_Y </math> is the mean of <math> Y </math>
*<math> \operatorname\mathbb{E} </math> is the expectation.

The formula for <math>\rho</math> can be expressed in terms of uncentered moments.  Since

:<math>\begin{align}
       \mu_X ={} &\operatorname\mathbb{E}[X] \\
       \mu_Y ={} &\operatorname\mathbb{E}[Y] \\
  \sigma_X^2 ={} &\operatorname\mathbb{E}\left[\left(X - \operatorname\mathbb{E}[X]\right)^2\right] = \operatorname\mathbb{E}\left[X^2\right] - \left(\operatorname\mathbb{E}[X]\right)^2 \\
   \sigma_Y^2 ={} &\operatorname\mathbb{E}\left[\left(Y - \operatorname\mathbb{E}[Y]\right)^2\right] = \operatorname\mathbb{E}\left[Y^2\right] - \left(\operatorname\mathbb{E}[Y]\right)^2 \\
\operatorname{cov}(X,Y) ={} &\operatorname\mathbb{E}[\left(X - \mu_X\right)\left(Y - \mu_Y\right)] = \operatorname\mathbb{E}[\left(X - \operatorname\mathbb{E}[X]\right)\left(Y - \operatorname\mathbb{E}[Y]\right)] = \operatorname\mathbb{E}[XY] - \operatorname\mathbb{E}[X]\operatorname\mathbb{E}[Y] ,
\end{align}</math>

the formula for <math>\rho</math> can also be written as
<math display="block">\rho_{X,Y} =
  \frac{\operatorname\mathbb{E}[XY] - \operatorname\mathbb{E}[X]\operatorname\mathbb{E}[Y]}{\sqrt{\operatorname\mathbb{E}\left[X^2\right] - \left(\operatorname\mathbb{E}[X] \right)^2} ~ \sqrt{\operatorname\mathbb{E}\left[Y^2\right] - \left(\operatorname\mathbb{E}[Y] \right)^2}}.</math>

===For a sample===
Pearson's correlation coefficient, when applied to a [[sample (statistics)|sample]], is commonly represented by <math>r_{xy}</math> and may be referred to as the ''sample correlation coefficient'' or the ''sample Pearson correlation coefficient''. We can obtain a formula for <math>r_{xy}</math> by substituting estimates of the covariances and variances based on a sample into the formula above. Given paired data <math>\left\{ (x_1,y_1),\ldots,(x_n,y_n) \right\}</math> consisting of <math>n</math> pairs, <math>r_{xy}</math> is defined as

<math display=block>r_{xy} =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}}</math>

where
*<math>n</math> is sample size
*<math>x_i, y_i</math> are the individual sample points indexed with ''i''
*<math display="inline">\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i</math> (the sample mean); and analogously for <math>\bar{y}</math>.

Rearranging gives us this<ref name="RealCorBasic"/> formula for <math>r_{xy}</math>:

:<math>r_{xy} = \frac{\sum_i x_i y_i-n\bar{x}\bar{y}}
{\sqrt{\sum_i x_i^2-n\bar{x}^2}~\sqrt{\sum_i y_i^2-n\bar{y}^2}},</math>

where <math>n, x_i, y_i, \bar{x}, \bar{y}</math> are defined as above.

Rearranging again gives us this formula for <math>r_{xy}</math>:

:<math>r_{xy} = \frac{n\sum x_i y_i - \sum x_i\sum y_i}
{\sqrt{n\sum x_i^2-\left(\sum x_i\right)^2}~\sqrt{n\sum y_i^2-\left(\sum y_i\right)^2}},</math>

where <math>n, x_i, y_i</math> are defined as above.

This formula suggests a convenient single-pass algorithm for calculating sample correlations, though depending on the numbers involved, it can sometimes be [[numerical stability|numerically unstable]].

An equivalent expression gives the formula for <math>r_{xy}</math> as the mean of the products of the [[standard score]]s as follows:

:<math>r_{xy} = \frac{1}{n-1} \sum ^n _{i=1} \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right)</math>

where
*<math>n, x_i, y_i, \bar{x}, \bar{y}</math> are defined as above, and <math>s_x, s_y</math> are defined below
*<math display="inline">\left( \frac{x_i - \bar{x}}{s_x} \right)</math> is the standard score (and analogously for the standard score of <math>y</math>).

Alternative formulae for <math>r_{xy}</math> are also available. For example, one can use the following formula for <math>r_{xy}</math>:

:<math>r_{xy} =\frac{\sum x_iy_i-n \bar{x} \bar{y}}{(n-1) s_x s_y}</math>
where
*<math>n, x_i, y_i, \bar{x}, \bar{y}</math> are defined as above and:
*<math display="inline">s_x = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2}</math> (the [[sample standard deviation]]); and analogously for <math>s_y</math>.

=== For jointly gaussian distributions ===
If <math>(X, Y)</math> is [[Joint probability distribution|jointly]] [[Gaussian distribution|gaussian]], with mean zero and [[variance]] <math>\Sigma</math>, then <math>\Sigma = \begin{bmatrix}
\sigma_X^2 & \rho_{X,Y}\sigma_X\sigma_Y \\
\rho_{X,Y}\sigma_X\sigma_Y & \sigma_Y^2 \\
\end{bmatrix}</math>.

===Practical issues===
Under heavy noise conditions, extracting the correlation coefficient between two sets of [[Random variables|stochastic variables]] is nontrivial, in particular where [[Canonical Correlation Analysis]] reports degraded correlation values due to the heavy noise contributions. A generalization of the approach is given elsewhere.<ref>{{cite book |first= N. |last=Moriya |year=2008 |contribution=Noise-related multivariate optimal joint-analysis in longitudinal stochastic processes  |pages=[https://books.google.com/books?id=4XvRgF0QfqkC&pg=PA223 223–260] |editor=Yang, Fengshan  |title=[[Progress in Applied Mathematical Modeling]] |publisher=[[Nova Science Publishers, Inc.]] |isbn=978-1-60021-976-4 }}</ref>

In case of missing data, Garren derived the [[maximum likelihood]] estimator.<ref>{{cite journal |last=Garren |first=Steven T. |date=15 June 1998 |title=Maximum likelihood estimation of the correlation coefficient in a bivariate normal model, with missing data |journal=Statistics & Probability Letters |volume=38 |issue=3 |pages=281–288 |doi=10.1016/S0167-7152(98)00035-2 }}</ref>

Some distributions (e.g., [[stable distribution]]s other than a [[normal distribution]]) do not have a defined variance.