Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Multicollinearity
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Misuse == Variance inflation factors are often misused as criteria in [[stepwise regression]] (i.e. for variable inclusion/exclusion), a use that "lacks any logical basis but also is fundamentally misleading as a rule-of-thumb".<ref name=":6" /> Excluding collinear variables leads to artificially small estimates for standard errors, but does not reduce the true (not estimated) standard errors for regression coefficients.<ref name=":3" /> Excluding variables with a high [[variance inflation factor]] also invalidates the calculated standard errors and p-values, by turning the results of the regression into a [[post hoc analysis]].<ref>{{Cite journal |last1=Gelman |first1=Andrew |last2=Loken |first2=Eric |date=14 Nov 2013 |title=The garden of forking paths |url=http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf |journal=Unpublished |via=Columbia}}</ref> Because collinearity leads to large standard errors and p-values, which can make publishing articles more difficult, some researchers will try to [[Scientific misconduct|suppress inconvenient data]] by removing strongly-correlated variables from their regression. This procedure falls into the broader categories of [[p-hacking]], [[data dredging]], and [[post hoc analysis]]. Dropping (useful) collinear predictors will generally worsen the accuracy of the model and coefficient estimates. Similarly, trying many different models or estimation procedures (e.g. [[ordinary least squares]], ridge regression, etc.) until finding one that can "deal with" the collinearity creates a [[forking paths problem]]. P-values and confidence intervals derived from [[Post hoc analysis|post hoc analyses]] are invalidated by ignoring the uncertainty in the [[model selection]] procedure. It is reasonable to exclude unimportant predictors if they are known ahead of time to have little or no effect on the outcome; for example, local cheese production should not be used to predict the height of skyscrapers. However, this must be done when first specifying the model, prior to observing any data, and potentially-informative variables should always be included.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)