Editing Multicollinearity (section)

== Effects on coefficient estimates ==
In addition to causing numerical problems, imperfect collinearity makes precise estimation of variables difficult. In other words, highly correlated variables lead to poor estimates and large standard errors.

As an example, say that we notice Alice wears her boots whenever it is raining and that there are only puddles when it rains. Then, we cannot tell whether she wears boots to keep the rain from landing on her feet, or to keep her feet dry if she steps in a puddle.

The problem with trying to identify how much each of the two variables matters is that they are [[Confounding|confounded]] with each other: our observations are explained equally well by either variable, so we do not know which one of them causes the observed correlations.

There are two ways to discover this information:

# Using prior information or theory. For example, if we notice Alice never steps in puddles, we can reasonably argue puddles are not why she wears boots, as she does not need the boots to avoid puddles.
# Collecting more data. If we observe Alice enough times, we will eventually see her on days where there are puddles but not rain (e.g. because the rain stops before she leaves home).

This confounding becomes substantially worse when researchers [[Scientific misconduct|attempt to ignore or suppress it]] by excluding these variables from the regression (see [[#Misuse]]). Excluding multicollinear variables from regressions will invalidate [[causal inference]] and produce worse estimates by removing important confounders.

=== Remedies ===
There are many ways to prevent multicollinearity from affecting results by planning ahead of time. However, these methods all require a researcher to decide on a procedure and analysis ''before'' data has been collected (see [[post hoc analysis]] and {{Section link|2=Misuse}}).

==== Regularized estimators ====
Many regression methods are naturally "robust" to multicollinearity and generally perform better than [[ordinary least squares]] regression, even when variables are independent. [[Regularization (mathematics)|Regularized regression]] techniques such as [[ridge regression]], [[Lasso (statistics)|LASSO]], [[Elastic net regularization|elastic net regression]], or [[spike-and-slab regression]] are less sensitive to including "useless" predictors, a common cause of collinearity. These techniques can detect and remove these predictors automatically to avoid problems. [[Bayesian hierarchical modeling|Bayesian hierarchical models]] (provided by software like [[Stan (software)|BRMS]]) can perform such regularization automatically, learning informative priors from the data.

Often, problems caused by the use of [[Frequentist statistical inference|frequentist estimation]] are misunderstood or misdiagnosed as being related to multicollinearity.<ref name=":5"/> Researchers are often frustrated not by multicollinearity, but by their inability to incorporate relevant [[Prior probability|prior information]] in regressions. For example, complaints that coefficients have "wrong signs" or confidence intervals that "include unrealistic values" indicate there is important prior information that is not being incorporated into the model. When this is information is available, it should be incorporated into the prior using [[Bayesian linear regression|Bayesian regression]] techniques.<ref name=":5" />

[[Stepwise regression]] (the procedure of excluding "collinear" or "insignificant" variables) is especially vulnerable to multicollinearity, and is one of the few procedures wholly invalidated by it (with any collinearity resulting in heavily biased estimates and invalidated p-values).<ref name=":6"/>

==== Improved experimental design ====
When conducting experiments where researchers have control over the predictive variables, researchers can often avoid collinearity by choosing an [[optimal experimental design]] in consultation with a statistician.

==== Acceptance ====
While the above strategies work in some situations, estimates using advanced techniques may still produce large standard errors. In such cases, the correct response to multicollinearity is to "do nothing".<ref name=":3" /> The [[Scientific method|scientific process]] often involves [[Null result|null]] or inconclusive results; not every experiment will be "successful" in the sense of decisively confirmation of the researcher's original hypothesis.

Edward Leamer notes that "The solution to the weak evidence problem is more and better data. Within the confines of the given data set there is nothing that can be done about weak evidence".<ref name=":5" /> Leamer notes that "bad" regression results that are often misattributed to multicollinearity instead indicate the researcher has chosen an unrealistic [[prior probability]] (generally the [[flat prior]] used in [[Ordinary least squares|OLS]]).<ref name=":5"/>

[[Damodar N. Gujarati|Damodar Gujarati]] writes that "we should rightly accept [our data] are sometimes not very informative about parameters of interest".<ref name=":3" /> [[Olivier Blanchard]] quips that "multicollinearity is God's will, not a problem with [[Ordinary least squares|OLS]]";<ref name=":2" /> in other words, when working with [[observational data]], researchers cannot "fix" multicollinearity, only accept it.