Editing Multicollinearity (section)

== Misuse ==
Variance inflation factors are often misused as criteria in [[stepwise regression]] (i.e. for variable inclusion/exclusion), a use that "lacks any logical basis but also is fundamentally misleading as a rule-of-thumb".<ref name=":6" />

Excluding collinear variables leads to artificially small estimates for standard errors, but does not reduce the true (not estimated) standard errors for regression coefficients.<ref name=":3" /> Excluding variables with a high [[variance inflation factor]] also invalidates the calculated standard errors and p-values, by turning the results of the regression into a [[post hoc analysis]].<ref>{{Cite journal |last1=Gelman |first1=Andrew |last2=Loken |first2=Eric |date=14 Nov 2013 |title=The garden of forking paths |url=http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf |journal=Unpublished |via=Columbia}}</ref>

Because collinearity leads to large standard errors and p-values, which can make publishing articles more difficult, some researchers will try to [[Scientific misconduct|suppress inconvenient data]] by removing strongly-correlated variables from their regression. This procedure falls into the broader categories of [[p-hacking]], [[data dredging]], and [[post hoc analysis]]. Dropping (useful) collinear predictors will generally worsen the accuracy of the model and coefficient estimates.

Similarly, trying many different models or estimation procedures (e.g. [[ordinary least squares]], ridge regression, etc.) until finding one that can "deal with" the collinearity creates a [[forking paths problem]]. P-values and confidence intervals derived from [[Post hoc analysis|post hoc analyses]] are invalidated by ignoring the uncertainty in the [[model selection]] procedure.

It is reasonable to exclude unimportant predictors if they are known ahead of time to have little or no effect on the outcome; for example, local cheese production should not be used to predict the height of skyscrapers. However, this must be done when first specifying the model, prior to observing any data, and potentially-informative variables should always be included.