Editing Multicollinearity

{{Short description|Linear dependency situation in a regression model}}
{{redirect-distinguish|Collinearity (statistics)|Collinearity (geometry)}}
{{morefootnotes|date=January 2024}}
{{Use dmy dates|date=March 2020}}
{{unbalanced|date=October 2024}}
In [[statistics]], '''multicollinearity''' or '''collinearity''' is a situation where the [[Independent variable|predictors]] in a [[Regression analysis|regression model]] are [[Linear independence|linearly dependent]].

'''Perfect multicollinearity''' refers to a situation where the [[Independent variable|predictive variables]] have an ''exact'' linear relationship. When there is perfect collinearity, the [[design matrix]] <math>X</math> has less than full [[Rank (linear algebra)|rank]], and therefore the [[moment matrix]] <math>X^{\mathsf{T}}X</math> cannot be [[Matrix inversion|inverted]]. In this situation, the [[Regression coefficient|parameter estimates]] of the regression are not well-defined, as the system of equations has [[Underdetermined system|infinitely many solutions]].

'''Imperfect multicollinearity''' refers to a situation where the [[Independent variable|predictive variables]] have a ''nearly'' exact linear relationship.

Contrary to popular belief, neither the [[Gauss–Markov theorem]] nor the more common [[Maximum likelihood estimation|maximum likelihood]] justification for [[ordinary least squares]] relies on any kind of correlation structure between dependent predictors<ref name=":3">{{cite book |last=Gujarati |first=Damodar |url=https://archive.org/details/basiceconometric05edguja |title=Basic Econometrics |publisher=McGraw−Hill |year=2009 |isbn=9780073375779 |edition=4th |pages=[https://archive.org/details/basiceconometric05edguja/page/363 363] |chapter=Multicollinearity: what happens if the regressors are correlated? |author-link=Damodar N. Gujarati |url-access=registration}}</ref><ref name=":6">{{Cite journal |last1=Kalnins |first1=Arturs |last2=Praitis Hill |first2=Kendall |date=2023-12-13 |title=The VIF Score. What is it Good For? Absolutely Nothing |url=http://journals.sagepub.com/doi/10.1177/10944281231216381 |journal=Organizational Research Methods |volume=28 |pages=58–75 |language=en |doi=10.1177/10944281231216381 |issn=1094-4281|url-access=subscription }}</ref><ref name=":5">{{Cite journal |last=Leamer |first=Edward E. |date=1973 |title=Multicollinearity: A Bayesian Interpretation |url=https://www.jstor.org/stable/1927962 |journal=The Review of Economics and Statistics |volume=55 |issue=3 |pages=371–380 |doi=10.2307/1927962 |jstor=1927962 |issn=0034-6535|url-access=subscription }}</ref> (although perfect collinearity can cause problems with some software).

There is no justification for the practice of removing collinear variables as part of regression analysis,<ref name=":3" /><ref name=":0">{{Cite web |last=Giles |first=Dave |date=2011-09-15 |title=Econometrics Beat: Dave Giles' Blog: Micronumerosity |url=https://davegiles.blogspot.com/2011/09/micronumerosity.html |access-date=2023-09-03 |website=Econometrics Beat}}</ref><ref>{{Cite book |last=Goldberger,(1964) |first=A.S. |title=Econometric Theory |publisher=Wiley |year=1964 |location=New York}}</ref><ref name=":1">{{Cite book |last=Goldberger |first=A.S. |title=A Course in Econometrics |publisher=Harvard University Press |location=Cambridge MA |chapter=Chapter 23.3}}</ref><ref name=":2">{{Cite journal |last=Blanchard |first=Olivier Jean |date=October 1987 |title=Comment |url=http://www.tandfonline.com/doi/abs/10.1080/07350015.1987.10509611 |journal=Journal of Business & Economic Statistics |language=en |volume=5 |issue=4 |pages=449–451 |doi=10.1080/07350015.1987.10509611 |issn=0735-0015|url-access=subscription }}</ref> and doing so may constitute [[scientific misconduct]]. Including collinear variables does not reduce the predictive power or [[Reliability (statistics)|reliability]] of the model as a whole,<ref name=":1" /> and does not reduce the accuracy of coefficient estimates.<ref name=":3" />

High collinearity indicates that it is exceptionally important to include all collinear variables, as excluding any will cause worse coefficient estimates, strong [[confounding]], and downward-biased estimates of [[standard error]]s.<ref name=":6" />

To address the high collinearity of a dataset, [[variance inflation factor]] can be used to identify the collinearity of the predictor variables.

==Perfect multicollinearity==
{{stack|
[[File:Multicollinearity.jpg|thumb|357x357px|A depiction of multicollinearity.]]
[[File:Effect_of_multicollinearity_on_coefficients_of_linear_model.png|thumb|In a linear regression, the true parameters are <math>a_1= 2,a_2 =4</math> which are reliably estimated in the case of uncorrelated <math>X_1</math> and <math>X_2</math> (black case) but are unreliably estimated when <math>X_1</math> and <math>X_2</math> are correlated (red case).]]
}}
Perfect multicollinearity refers to a situation where the predictors are [[Linear dependence|linearly dependent]] (one can be written as an exact linear function of the others).<ref>{{cite book |last1=James |first1=Gareth |last2=Witten |first2=Daniela |last3=Hastie |first3=Trevor |last4=Tibshirani |first4=Robert |title=An introduction to statistical learning: with applications in R |date=2021 |publisher=Springer |location=New York, NY |isbn=978-1-0716-1418-1 |page=115 |edition=Second |doi=10.1007/978-1-0716-1418-1 |url=https://link.springer.com/book/10.1007/978-1-0716-1418-1 |access-date=1 November 2024 |language=en}}</ref> [[Ordinary least squares]] requires inverting the matrix <math>X^{\mathsf{T}}X</math>, where

: <math> X = \begin{bmatrix}

      1 & X_{11} & \cdots & X_{k1}  \\

      \vdots & \vdots & & \vdots \\

      1 & X_{1N} & \cdots & X_{kN}

\end{bmatrix}</math>

is an ''<math> N \times (k+1) </math>'' matrix, where ''<math> N </math>'' is the number of observations, ''<math> k </math>'' is the number of explanatory variables, and ''<math> N \ge k+1 </math>''. If there is an exact linear relationship among the independent variables, then at least one of the columns of <math> X </math> is a linear combination of the others, and so the [[Rank (linear algebra)|rank]] of <math> X </math> (and therefore of <math>X^{\mathsf{T}}X</math>) is less than ''<math> k+1 </math>'', and the matrix <math>X^{\mathsf{T}}X</math> will not be invertible.

=== Resolution ===
Perfect collinearity is typically caused by including redundant variables in a regression. For example, a dataset may include variables for income, expenses, and savings. However, because income is equal to expenses plus savings by definition, it is incorrect to include all 3 variables in a regression simultaneously. Similarly, including a [[Dummy variable (statistics)|dummy variable]] for every category (e.g., summer, autumn, winter, and spring) as well as an intercept term will result in perfect collinearity. This is known as the dummy variable trap.<ref>{{Cite web |title=Dummy Variable Trap - What is the Dummy Variable Trap? |work=LearnDataSci (www.learndatasci.com) |url=https://www.learndatasci.com/glossary/dummy-variable-trap/ |access-date=2024-01-18 |first=Fatih |last=Karabiber }}</ref>

The other common cause of perfect collinearity is attempting to use [[ordinary least squares]] when working with very wide datasets (those with more variables than observations). These require more advanced data analysis techniques like [[Hierarchical linear model|Bayesian hierarchical modeling]] to produce meaningful results.{{fact|date=March 2024}}

== Numerical issues ==
Sometimes, the variables <math> X_j </math> are nearly collinear. In this case, the matrix <math>X^{\mathsf{T}}X</math> has an inverse, but it is [[ill-conditioned]]. A computer algorithm may or may not be able to compute an approximate inverse; even if it can, the resulting inverse may have large [[rounding error]]s.

The standard measure of [[Condition number|ill-conditioning]] in a matrix is the condition index. This determines if the inversion of the matrix is numerically unstable with finite-precision numbers, indicating the potential sensitivity of the computed inverse to small changes in the original matrix. The condition number is computed by finding the maximum [[singular value]] divided by the minimum singular value of the [[design matrix]].<ref name="Belsley19912">{{cite book |last=Belsley |first=David |url=https://archive.org/details/conditioningdiag0000bels |title=Conditioning Diagnostics: Collinearity and Weak Data in Regression |publisher=Wiley |year=1991 |isbn=978-0-471-52889-0 |location=New York |url-access=registration}}</ref> In the context of collinear variables, the [[variance inflation factor]] is the condition number for a particular coefficient.

=== Solutions ===
Numerical problems in estimating can be solved by applying standard techniques from [[linear algebra]] to estimate the equations more precisely:

# [[Standard score|'''Standardizing''']] '''predictor variables.''' Working with polynomial terms (e.g. <math>x_1</math>, <math>x_1^2</math>), including interaction terms (i.e., <math>x_1 \times x_2</math>) can cause multicollinearity. This is especially true when the variable in question has a limited range. Standardizing predictor variables will eliminate this special kind of multicollinearity for polynomials of up to 3rd order.<ref>{{Cite web |title=12.6 - Reducing Structural Multicollinearity {{!}} STAT 501 |url=https://newonlinecourses.science.psu.edu/stat501/lesson/12/12.6 |access-date=2019-03-16 |website=newonlinecourses.science.psu.edu}}</ref>
#* For higher-order polynomials, an [[Orthogonal polynomials|orthogonal polynomial]] representation will generally fix any collinearity problems.<ref name=":4">{{Cite web |title=Computational Tricks with Turing (Non-Centered Parametrization and QR Decomposition) |url=https://storopoli.io/Bayesian-Julia/pages/12_Turing_tricks/#qr_decomposition |access-date=2023-09-03 |website=storopoli.io}}</ref> However, polynomial regressions are [[Runge's phenomenon|generally unstable]], making them unsuitable for [[nonparametric regression]] and inferior to newer methods based on [[smoothing spline]]s, [[LOESS]], or [[Gaussian process]] regression.<ref>{{Cite journal |last1=Gelman |first1=Andrew |last2=Imbens |first2=Guido |date=2019-07-03 |title=Why High-Order Polynomials Should Not Be Used in Regression Discontinuity Designs |url=https://www.tandfonline.com/doi/full/10.1080/07350015.2017.1366909 |journal=Journal of Business & Economic Statistics |language=en |volume=37 |issue=3 |pages=447–456 |doi=10.1080/07350015.2017.1366909 |issn=0735-0015|url-access=subscription }}</ref>
# '''Use an [[QR decomposition|orthogonal representation]] of the data'''.<ref name=":4" /> Poorly-written statistical software will sometimes fail to converge to a correct representation when variables are strongly correlated. However, it is still possible to rewrite the regression to use only uncorrelated variables by performing a [[change of basis]].
#* For polynomial terms in particular, it is possible to rewrite the regression as a function of uncorrelated variables using [[orthogonal polynomials]].

== Effects on coefficient estimates ==
In addition to causing numerical problems, imperfect collinearity makes precise estimation of variables difficult. In other words, highly correlated variables lead to poor estimates and large standard errors.

As an example, say that we notice Alice wears her boots whenever it is raining and that there are only puddles when it rains. Then, we cannot tell whether she wears boots to keep the rain from landing on her feet, or to keep her feet dry if she steps in a puddle.

The problem with trying to identify how much each of the two variables matters is that they are [[Confounding|confounded]] with each other: our observations are explained equally well by either variable, so we do not know which one of them causes the observed correlations.

There are two ways to discover this information:

# Using prior information or theory. For example, if we notice Alice never steps in puddles, we can reasonably argue puddles are not why she wears boots, as she does not need the boots to avoid puddles.
# Collecting more data. If we observe Alice enough times, we will eventually see her on days where there are puddles but not rain (e.g. because the rain stops before she leaves home).

This confounding becomes substantially worse when researchers [[Scientific misconduct|attempt to ignore or suppress it]] by excluding these variables from the regression (see [[#Misuse]]). Excluding multicollinear variables from regressions will invalidate [[causal inference]] and produce worse estimates by removing important confounders.

=== Remedies ===
There are many ways to prevent multicollinearity from affecting results by planning ahead of time. However, these methods all require a researcher to decide on a procedure and analysis ''before'' data has been collected (see [[post hoc analysis]] and {{Section link|2=Misuse}}).

==== Regularized estimators ====
Many regression methods are naturally "robust" to multicollinearity and generally perform better than [[ordinary least squares]] regression, even when variables are independent. [[Regularization (mathematics)|Regularized regression]] techniques such as [[ridge regression]], [[Lasso (statistics)|LASSO]], [[Elastic net regularization|elastic net regression]], or [[spike-and-slab regression]] are less sensitive to including "useless" predictors, a common cause of collinearity. These techniques can detect and remove these predictors automatically to avoid problems. [[Bayesian hierarchical modeling|Bayesian hierarchical models]] (provided by software like [[Stan (software)|BRMS]]) can perform such regularization automatically, learning informative priors from the data.

Often, problems caused by the use of [[Frequentist statistical inference|frequentist estimation]] are misunderstood or misdiagnosed as being related to multicollinearity.<ref name=":5"/> Researchers are often frustrated not by multicollinearity, but by their inability to incorporate relevant [[Prior probability|prior information]] in regressions. For example, complaints that coefficients have "wrong signs" or confidence intervals that "include unrealistic values" indicate there is important prior information that is not being incorporated into the model. When this is information is available, it should be incorporated into the prior using [[Bayesian linear regression|Bayesian regression]] techniques.<ref name=":5" />

[[Stepwise regression]] (the procedure of excluding "collinear" or "insignificant" variables) is especially vulnerable to multicollinearity, and is one of the few procedures wholly invalidated by it (with any collinearity resulting in heavily biased estimates and invalidated p-values).<ref name=":6"/>

==== Improved experimental design ====
When conducting experiments where researchers have control over the predictive variables, researchers can often avoid collinearity by choosing an [[optimal experimental design]] in consultation with a statistician.

==== Acceptance ====
While the above strategies work in some situations, estimates using advanced techniques may still produce large standard errors. In such cases, the correct response to multicollinearity is to "do nothing".<ref name=":3" /> The [[Scientific method|scientific process]] often involves [[Null result|null]] or inconclusive results; not every experiment will be "successful" in the sense of decisively confirmation of the researcher's original hypothesis.

Edward Leamer notes that "The solution to the weak evidence problem is more and better data. Within the confines of the given data set there is nothing that can be done about weak evidence".<ref name=":5" /> Leamer notes that "bad" regression results that are often misattributed to multicollinearity instead indicate the researcher has chosen an unrealistic [[prior probability]] (generally the [[flat prior]] used in [[Ordinary least squares|OLS]]).<ref name=":5"/>

[[Damodar N. Gujarati|Damodar Gujarati]] writes that "we should rightly accept [our data] are sometimes not very informative about parameters of interest".<ref name=":3" /> [[Olivier Blanchard]] quips that "multicollinearity is God's will, not a problem with [[Ordinary least squares|OLS]]";<ref name=":2" /> in other words, when working with [[observational data]], researchers cannot "fix" multicollinearity, only accept it.

== Misuse ==
Variance inflation factors are often misused as criteria in [[stepwise regression]] (i.e. for variable inclusion/exclusion), a use that "lacks any logical basis but also is fundamentally misleading as a rule-of-thumb".<ref name=":6" />

Excluding collinear variables leads to artificially small estimates for standard errors, but does not reduce the true (not estimated) standard errors for regression coefficients.<ref name=":3" /> Excluding variables with a high [[variance inflation factor]] also invalidates the calculated standard errors and p-values, by turning the results of the regression into a [[post hoc analysis]].<ref>{{Cite journal |last1=Gelman |first1=Andrew |last2=Loken |first2=Eric |date=14 Nov 2013 |title=The garden of forking paths |url=http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf |journal=Unpublished |via=Columbia}}</ref>

Because collinearity leads to large standard errors and p-values, which can make publishing articles more difficult, some researchers will try to [[Scientific misconduct|suppress inconvenient data]] by removing strongly-correlated variables from their regression. This procedure falls into the broader categories of [[p-hacking]], [[data dredging]], and [[post hoc analysis]]. Dropping (useful) collinear predictors will generally worsen the accuracy of the model and coefficient estimates.

Similarly, trying many different models or estimation procedures (e.g. [[ordinary least squares]], ridge regression, etc.) until finding one that can "deal with" the collinearity creates a [[forking paths problem]]. P-values and confidence intervals derived from [[Post hoc analysis|post hoc analyses]] are invalidated by ignoring the uncertainty in the [[model selection]] procedure.

It is reasonable to exclude unimportant predictors if they are known ahead of time to have little or no effect on the outcome; for example, local cheese production should not be used to predict the height of skyscrapers. However, this must be done when first specifying the model, prior to observing any data, and potentially-informative variables should always be included.

==See also==

* [[Ill-conditioned matrix]]
* [[Linear independence|Linear dependence]]

==References==
{{Reflist}}
<references responsive="0"></references>

==Further reading==

* {{cite book |last1=Belsley |first1=David A. |title=Regression Diagnostics: Identifying Influential Data and Sources of Collinearity |last2=Kuh |first2=Edwin |last3=Welsch |first3=Roy E. |publisher=Wiley |year=1980 |isbn=978-0-471-05856-4 |location=New York |author-link2=Edwin Kuh}}
* {{cite book |last=Goldberger |first=Arthur S. |title=A Course in Econometrics |publisher=Harvard University Press |year=1991 |isbn=9780674175440 |location=Cambridge |pages=245–53 |chapter=Multicollinearity |author-link=Arthur Goldberger |chapter-url=https://books.google.com/books?id=mHmxNGKRlQsC&pg=PA245}}
* {{cite book |last1=Hill |first1=R. Carter |title=A Companion to Theoretical Econometrics |last2=Adkins |first2=Lee C. |publisher=Blackwell |year=2001 |isbn=978-0-631-21254-6 |editor-last=Baltagi |editor-first=Badi H. |pages=256–278 |chapter=Collinearity |doi=10.1002/9780470996249.ch13}}
* {{cite book |last=Johnston |first=John |url=https://archive.org/details/econometricmetho0000john_t7q9 |title=Econometric Methods |publisher=McGraw-Hill |year=1972 |isbn=9780070326798 |edition=Second |location=New York |pages=[https://archive.org/details/econometricmetho0000john_t7q9/page/159 159]–168 |author-link=John Johnston (econometrician) |url-access=registration}}
* {{cite journal |last=Kalnins |first=Arturs |year=2022 |title=When does multicollinearity bias coefficients and cause type 1 errors? A reconciliation of Lindner, Puck, and Verbeke (2020) with Kalnins (2018). |journal=Journal of International Business Studies |volume=53 |issue=7 |pages=1536–1548 |doi=10.1057/s41267-022-00531-9 |s2cid=249323519}}
* {{cite book |last=Kmenta |first=Jan |url=https://archive.org/details/elementsofeconom0003kmen/page/430 |title=Elements of Econometrics |publisher=Macmillan |year=1986 |isbn=978-0-02-365070-3 |edition=Second |location=New York |pages=[https://archive.org/details/elementsofeconom0003kmen/page/430 430–442] |author-link=Jan Kmenta |url-access=registration}}
* {{cite book |last1=Maddala |first1=G. S. |title=Introduction to Econometrics |last2=Lahiri |first2=Kajal |publisher=Wiley |year=2009 |isbn=978-0-470-01512-4 |edition=Fourth |location=Chichester |pages=279–312 |author-link=G. S. Maddala}}
* {{cite journal |last1=Tomaschek |first1=Fabian |last2=Hendrix |first2=Peter |last3=Baayen |first3=R. Harald |year=2018 |title=Strategies for addressing collinearity in multivariate linguistic data |journal=Journal of Phonetics |volume=71 |pages=249–267 |doi=10.1016/j.wocn.2018.09.004 |doi-access=free}}

==External links==
* {{cite web |last=Thoma |first=Mark |author-link=Mark Thoma |date=2 March 2011 |title=Econometrics Lecture (topic: multicollinearity) |url=https://www.youtube.com/watch?v=K8eFiMIb8qo&list=PLD15D38DC7AA3B737&index=16#t=25m09s |url-status=live |archive-url=https://ghostarchive.org/varchive/youtube/20211212/K8eFiMIb8qo |archive-date=2021-12-12 |publisher=[[University of Oregon]] |via=[[YouTube]]}}{{cbignore}}
* [http://jeff560.tripod.com/m.html Earliest Uses: The entry on Multicollinearity has some historical information.]
{{Authority control}}

[[Category:Regression analysis]]
[[Category:Design of experiments]]