Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Pearson correlation coefficient
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Sensitivity to the data distribution== {{Further|Correlation and dependence#Sensitivity to the data distribution}} ===Existence=== The population Pearson correlation coefficient is defined in terms of [[moment (mathematics)|moments]], and therefore exists for any bivariate [[probability distribution]] for which the [[statistical population|population]] [[covariance]] is defined and the [[marginal distribution|marginal]] [[population variance]]s are defined and are non-zero. Some probability distributions, such as the [[Cauchy distribution]], have undefined variance and hence Ο is not defined if ''X'' or ''Y'' follows such a distribution. In some practical applications, such as those involving data suspected to follow a [[heavy-tailed distribution]], this is an important consideration. However, the existence of the correlation coefficient is usually not a concern; for instance, if the range of the distribution is bounded, Ο is always defined. ===Sample size=== *If the sample size is moderate or large and the population is normal, then, in the case of the bivariate [[normal distribution]], the sample correlation coefficient is the [[maximum likelihood estimate]] of the population correlation coefficient, and is [[asymptotic distribution|asymptotically]] [[bias of an estimator|unbiased]] and [[efficiency (statistics)|efficient]], which roughly means that it is impossible to construct a more accurate estimate than the sample correlation coefficient. *If the sample size is large and the population is not normal, then the sample correlation coefficient remains approximately unbiased, but may not be efficient. *If the sample size is large, then the sample correlation coefficient is a [[consistent estimator]] of the population correlation coefficient as long as the sample means, variances, and covariance are consistent (which is guaranteed when the [[law of large numbers]] can be applied). *If the sample size is small, then the sample correlation coefficient ''r'' is not an unbiased estimate of ''Ο''.<ref name="RealCorBasic"/> The adjusted correlation coefficient must be used instead: see elsewhere in this article for the definition. *Correlations can be different for imbalanced [[dichotomous variable|dichotomous]] data when there is variance error in sample.<ref>{{cite journal |last1=Lai |first1=Chun Sing |last2=Tao |first2=Yingshan |last3=Xu |first3=Fangyuan |last4=Ng |first4=Wing W.Y. |last5=Jia |first5=Youwei |last6=Yuan |first6=Haoliang |last7=Huang |first7=Chao |last8=Lai |first8=Loi Lei |last9=Xu |first9=Zhao |last10=Locatelli |first10=Giorgio |title=A robust correlation analysis framework for imbalanced and dichotomous data with uncertainty |journal=Information Sciences |date=January 2019 |volume=470 |pages=58β77 |doi=10.1016/j.ins.2018.08.017 |s2cid=52878443 |url=http://eprints.whiterose.ac.uk/134706/2/ELSEVI_3.pdf }}</ref> ===Robustness=== Like many commonly used statistics, the sample [[statistic]] ''r'' is not [[robust statistics|robust]],<ref name="wilcox">{{Cite book| title=Introduction to robust estimation and hypothesis testing | last = Wilcox | first = Rand R. | publisher= Academic Press | year=2005}}</ref> so its value can be misleading if [[outlier]]s are present.<ref>{{Cite journal |title=Robust estimation and outlier detection with correlation coefficients |author1=Devlin, Susan J. |author1-link=Susan J. Devlin |author2=Gnanadesikan, R. |author3=Kettenring J.R. |journal=Biometrika |volume=62 |issue=3 |year=1975 |pages=531β545 |doi=10.1093/biomet/62.3.531 |jstor=2335508}}</ref><ref>{{Cite book| title=Robust Statistics | last = Huber | first = Peter. J.| publisher= Wiley | year=2004}}{{Page needed|date=September 2010}}</ref> Specifically, the PMCC is neither distributionally robust,<ref>{{Cite book |last=Vaart |first=A. W. van der |url=http://dx.doi.org/10.1017/cbo9780511802256 |title=Asymptotic Statistics |date=1998-10-13 |publisher=Cambridge University Press |doi=10.1017/cbo9780511802256 |isbn=978-0-511-80225-6}}</ref> nor outlier resistant<ref name="wilcox"/> (see ''{{section link|Robust statistics#Definition}}''). Inspection of the [[scatterplot]] between ''X'' and ''Y'' will typically reveal a situation where lack of robustness might be an issue, and in such cases it may be advisable to use a robust measure of association. Note however that while most robust estimators of association measure [[statistical dependence]] in some way, they are generally not interpretable on the same scale as the Pearson correlation coefficient. Statistical inference for Pearson's correlation coefficient is sensitive to the data distribution. Exact tests, and asymptotic tests based on the [[Fisher transformation]] can be applied if the data are approximately normally distributed, but may be misleading otherwise. In some situations, the [[bootstrapping (statistics)|bootstrap]] can be applied to construct confidence intervals, and [[permutation test]]s can be applied to carry out hypothesis tests. These [[non-parametric statistics|non-parametric]] approaches may give more meaningful results in some situations where bivariate normality does not hold. However the standard versions of these approaches rely on [[exchangeable random variables|exchangeability]] of the data, meaning that there is no ordering or grouping of the data pairs being analyzed that might affect the behavior of the correlation estimate. A stratified analysis is one way to either accommodate a lack of bivariate normality, or to isolate the correlation resulting from one factor while controlling for another. If ''W'' represents cluster membership or another factor that it is desirable to control, we can [[Stratified sampling|stratify]] the data based on the value of ''W'', then calculate a correlation coefficient within each stratum. The stratum-level estimates can then be combined to estimate the overall correlation while controlling for ''W''.<ref>Katz., Mitchell H. (2006) ''Multivariable Analysis β A Practical Guide for Clinicians''. 2nd Edition. Cambridge University Press. {{isbn|978-0-521-54985-1}}. {{isbn|0-521-54985-X}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)