Editing Pearson correlation coefficient (section)

==Inference==
Statistical inference based on Pearson's correlation coefficient often focuses on one of the following two aims:
* One aim is to test the [[null hypothesis]] that the true correlation coefficient ''ρ'' is equal to 0, based on the value of the sample correlation coefficient ''r''.
* The other aim is to derive a [[confidence interval]] that, on repeated sampling, has a given probability of containing ''ρ''.
Methods of achieving one or both of these aims are discussed below.

===Using a permutation test===
[[Permutation test]]s provide a direct approach to performing hypothesis tests and constructing confidence intervals. A permutation test for Pearson's correlation coefficient involves the following two steps: 
# Using the original paired data (''x''<sub>''i''</sub>,&nbsp;''y''<sub>''i''</sub>), randomly redefine the pairs to create a new data set (''x''<sub>''i''</sub>,&nbsp;''y''<sub>''{{prime|i}}''</sub>), where the ''{{prime|i}}'' are a [[permutation]] of the set {1,...,''n''}.  The permutation ''{{prime|i}}'' is selected randomly, with equal probabilities placed on all ''n''! possible permutations.  This is equivalent to drawing the ''{{prime|i}}'' randomly without replacement from the set {1, ..., ''n''}.  In [[Bootstrapping (statistics)|bootstrapping]], a closely related approach, the ''i'' and the ''{{prime|i}}'' are equal and drawn with replacement from {1, ..., ''n''};
# Construct a correlation coefficient ''r'' from the randomized data.
To perform the permutation test, repeat steps&nbsp;(1) and (2) a large number of times.  The [[p-value]] for the permutation test is the proportion of the ''r'' values generated in step&nbsp;(2) that are larger than the Pearson correlation coefficient that was calculated from the original data.  Here "larger" can mean either that the value is larger in magnitude, or larger in signed value, depending on whether a [[two-tailed test|two-sided]] or [[one-sided test|one-sided]] test is desired.

===Using a bootstrap===
The [[bootstrapping (statistics)|bootstrap]] can be used to construct confidence intervals for Pearson's correlation coefficient.  In the "non-parametric" bootstrap, ''n'' pairs (''x''<sub>''i''</sub>,&nbsp;''y''<sub>''i''</sub>) are resampled "with replacement" from the observed set of ''n'' pairs, and the correlation coefficient ''r'' is calculated based on the resampled data.  This process is repeated a large number of times, and the empirical distribution of the resampled ''r'' values are used to approximate the [[sampling distribution]] of the statistic.  A 95% [[confidence interval]] for ''ρ'' can be defined as the interval spanning from the 2.5th to the 97.5th [[percentile]] of the resampled ''r'' values.

=== Standard error ===
If <math>x</math> and <math>y</math> are random variables, with a simple linear relationship between them with an additive normal noise (i.e., y= a + bx + e), then a [[standard error]] associated to the correlation is

:<math>\sigma_r = \sqrt{\frac{1-r^2}{n-2}}</math>

where <math>r</math> is the correlation and <math>n</math> the sample size.<ref>{{Cite journal|last=Bowley|first=A. L.|date=1928|title=The Standard Deviation of the Correlation Coefficient|url=https://www.jstor.org/stable/2277400|journal=Journal of the American Statistical Association|volume=23|issue=161|pages=31–34|doi=10.2307/2277400|jstor=2277400|issn=0162-1459}}</ref><ref>{{Cite web|title=Derivation of the standard error for Pearson's correlation coefficient|url=https://stats.stackexchange.com/q/226380 |access-date=2021-07-30|website=Cross Validated}}</ref>

===Testing using Student's ''t''-distribution===
[[File:Critical correlation vs. sample size.svg|thumb|324x324px|Critical values of Pearson's correlation coefficient that must be exceeded to be considered significantly nonzero at the 0.05 level]]For pairs from an uncorrelated [[bivariate normal distribution]], the [[sampling distribution]] of the [[studentized]] Pearson's correlation coefficient follows [[Student's t-distribution|Student's ''t''-distribution]] with degrees of freedom ''n''&nbsp;−&nbsp;2.  Specifically, if the underlying variables have a bivariate normal distribution, the variable

:<math>t = \frac{r}{\sigma_r} = r\sqrt{\frac{n-2}{1 - r^2}}</math>

has a student's ''t''-distribution in the null case (zero correlation).<ref>Rahman, N. A. (1968) ''A Course in Theoretical Statistics'', Charles Griffin and Company, 1968</ref> This holds approximately in case of non-normal observed values if sample sizes are large enough.<ref>Kendall, M. G., Stuart, A. (1973) ''The Advanced Theory of Statistics, Volume 2: Inference and Relationship'', Griffin. {{isbn|0-85264-215-6}} (Section 31.19)</ref> For determining the critical values for ''r'' the inverse function is needed:

:<math>r = \frac{t}{\sqrt{n - 2 + t^2}}.</math>

Alternatively, large sample, asymptotic approaches can be used.

Another early paper<ref>{{cite journal |last1=Soper |first1=H.E. |author-link=H. E. Soper |last2=Young |first2=A.W. |last3=Cave |first3=B.M. |last4=Lee |first4=A. |last5=Pearson |first5=K. |year=1917 |title=On the distribution of the correlation coefficient in small samples. Appendix II to the papers of "Student" and R.A. Fisher. A co-operative study |url=https://zenodo.org/record/1431587 |journal=[[Biometrika]] |volume=11 |issue=4 |pages=328–413 |doi=10.1093/biomet/11.4.328}}</ref> provides graphs and tables for general values of ''ρ'', for small sample sizes, and discusses computational approaches.

In the case where the underlying variables are not normal, the sampling distribution of Pearson's correlation coefficient follows a Student's ''t''-distribution, but the degrees of freedom are reduced.<ref>{{cite journal |last1=Davey |first1=Catherine E. |last2=Grayden |first2=David B. |last3=Egan |first3=Gary F. |last4=Johnston |first4=Leigh A. |title=Filtering induces correlation in fMRI resting state data |journal=NeuroImage |date=January 2013 |volume=64 |pages=728–740 |doi=10.1016/j.neuroimage.2012.08.022 |pmid=22939874 |hdl=11343/44035 |s2cid=207184701 |hdl-access=free }}</ref>

===Using the exact distribution===
For data that follow a [[bivariate normal distribution]], the exact density function ''f''(''r'') for the sample correlation coefficient  ''r'' of a normal bivariate is<ref>{{cite journal |last1=Hotelling |first1=Harold |title=New Light on the Correlation Coefficient and its Transforms |journal=Journal of the Royal Statistical Society |series=Series B (Methodological) |date=1953 |volume=15 |issue=2 |pages=193–232 |jstor=2983768 |doi=10.1111/j.2517-6161.1953.tb00135.x }}</ref><ref>{{cite book |author1=Kenney, J.F. |author2=Keeping, E.S. |title=Mathematics of Statistics |volume=Part 2 |edition=2nd |place=Princeton, NJ |publisher=Van Nostrand |year=1951}}</ref><ref>{{cite web |url=http://mathworld.wolfram.com/CorrelationCoefficientBivariateNormalDistribution.html |title=Correlation Coefficient—Bivariate Normal Distribution |first=Eric W. |last=Weisstein |website=Wolfram MathWorld}}</ref>

:<math>f(r) = \frac{(n - 2)\, \mathrm{\Gamma}(n - 1) \left(1 - \rho^2\right)^{\frac{n - 1}{2}} \left(1 - r^2\right)^{\frac{n - 4}{2}}}{\sqrt{2\pi}\, \operatorname{\Gamma}\mathord\left(n - \tfrac{1}{2}\right) (1 - \rho r)^{n - \frac{3}{2}}} {}_{2}\mathrm{F}_{1}\mathord\left(\tfrac{1}{2}, \tfrac{1}{2}; \tfrac{1}{2}(2n - 1); \tfrac{1}{2}(\rho r + 1)\right)</math>

where <math>\Gamma</math> is the [[gamma function]] and <math>{}_{2}\mathrm{F}_{1}(a,b;c;z)</math> is the [[hypergeometric function|Gaussian hypergeometric function]].

In the special case when <math>\rho = 0</math> (zero population correlation), the exact density function ''f''(''r'') can be written as

:<math>f(r) = \frac{\left( 1-r^2 \right)^{\frac{n - 4}{2}}}{\operatorname{\Beta}\mathord\left(\tfrac{1}{2}, \tfrac{n - 2}{2}\right)},</math>

where <math>\Beta</math> is the [[beta function]], which is one way of writing the density of a Student's t-distribution for a [[studentized]] sample correlation coefficient, as above.

===Using the Fisher transformation===
{{main|Fisher transformation}}

In practice, [[confidence intervals]] and [[hypothesis test]]s relating to ''ρ'' are usually carried out using the, [[Variance-stabilizing transformation]], [[Fisher transformation]], <math>F</math>'':

:<math>F(r) \equiv \tfrac{1}{2} \, \ln \left(\frac{1 + r}{1 - r}\right) = \operatorname{artanh}(r)</math>

''F''(''r'') approximately follows a [[normal distribution]] with

:<math>\text{mean} = F(\rho) = \operatorname{artanh}(\rho)</math>{{spaces|4}}and [[standard error]] <math>=\text{SE} = \frac{1}{\sqrt{n - 3}},</math>

where ''n'' is the sample size. The approximation error is lowest for a large sample size <math>n</math> and small <math>r</math> and <math>\rho_0</math> and increases otherwise.

Using the approximation, a [[standard score|z-score]] is

:<math>z = \frac{x - \text{mean}}{\text{SE}} = [F(r) - F(\rho_0)]\sqrt{n - 3}</math>

under the [[null hypothesis]] that <math>\rho = \rho_0</math>, given the assumption that the sample pairs are [[independent and identically distributed]] and follow a [[bivariate normal distribution]].  Thus an approximate [[p-value]] can be obtained from a normal probability table.  For example, if ''z''&nbsp;=&nbsp;2.2 is observed and a two-sided p-value is desired to test the null hypothesis that <math>\rho = 0</math>, the p-value is {{nowrap|1=2Φ(−2.2) = 0.028}}, where Φ is the standard normal [[cumulative distribution function]].

To obtain a confidence interval for ρ, we first compute a confidence interval for ''F''(''<math>\rho</math>''):

:<math>100(1 - \alpha)\%\text{CI}: \operatorname{artanh}(\rho) \in [\operatorname{artanh}(r) \pm z_{\alpha/2}\text{SE}]</math>

The inverse Fisher transformation brings the interval back to the correlation scale.

:<math>100(1 - \alpha)\%\text{CI}: \rho \in [\tanh(\operatorname{artanh}(r) - z_{\alpha/2}\text{SE}), \tanh(\operatorname{artanh}(r) + z_{\alpha/2}\text{SE})]</math>

For example, suppose we observe ''r''&nbsp;=&nbsp;0.7 with a sample size of ''n''=50, and we wish to obtain a 95% confidence interval for&nbsp;''ρ''. The transformed value is <math display="inline">\operatorname{arctanh} \left ( r \right ) = 0.8673</math>, so the confidence interval on the transformed scale is <math>0.8673 \pm \frac{1.96}{\sqrt{47}} </math>, or (0.5814,&nbsp;1.1532). Converting back to the correlation scale yields (0.5237,&nbsp;0.8188).