Editing Multicollinearity (section)

==Perfect multicollinearity==
{{stack|
[[File:Multicollinearity.jpg|thumb|357x357px|A depiction of multicollinearity.]]
[[File:Effect_of_multicollinearity_on_coefficients_of_linear_model.png|thumb|In a linear regression, the true parameters are <math>a_1= 2,a_2 =4</math> which are reliably estimated in the case of uncorrelated <math>X_1</math> and <math>X_2</math> (black case) but are unreliably estimated when <math>X_1</math> and <math>X_2</math> are correlated (red case).]]
}}
Perfect multicollinearity refers to a situation where the predictors are [[Linear dependence|linearly dependent]] (one can be written as an exact linear function of the others).<ref>{{cite book |last1=James |first1=Gareth |last2=Witten |first2=Daniela |last3=Hastie |first3=Trevor |last4=Tibshirani |first4=Robert |title=An introduction to statistical learning: with applications in R |date=2021 |publisher=Springer |location=New York, NY |isbn=978-1-0716-1418-1 |page=115 |edition=Second |doi=10.1007/978-1-0716-1418-1 |url=https://link.springer.com/book/10.1007/978-1-0716-1418-1 |access-date=1 November 2024 |language=en}}</ref> [[Ordinary least squares]] requires inverting the matrix <math>X^{\mathsf{T}}X</math>, where

: <math> X = \begin{bmatrix}

      1 & X_{11} & \cdots & X_{k1}  \\

      \vdots & \vdots & & \vdots \\

      1 & X_{1N} & \cdots & X_{kN}

\end{bmatrix}</math>

is an ''<math> N \times (k+1) </math>'' matrix, where ''<math> N </math>'' is the number of observations, ''<math> k </math>'' is the number of explanatory variables, and ''<math> N \ge k+1 </math>''. If there is an exact linear relationship among the independent variables, then at least one of the columns of <math> X </math> is a linear combination of the others, and so the [[Rank (linear algebra)|rank]] of <math> X </math> (and therefore of <math>X^{\mathsf{T}}X</math>) is less than ''<math> k+1 </math>'', and the matrix <math>X^{\mathsf{T}}X</math> will not be invertible.

=== Resolution ===
Perfect collinearity is typically caused by including redundant variables in a regression. For example, a dataset may include variables for income, expenses, and savings. However, because income is equal to expenses plus savings by definition, it is incorrect to include all 3 variables in a regression simultaneously. Similarly, including a [[Dummy variable (statistics)|dummy variable]] for every category (e.g., summer, autumn, winter, and spring) as well as an intercept term will result in perfect collinearity. This is known as the dummy variable trap.<ref>{{Cite web |title=Dummy Variable Trap - What is the Dummy Variable Trap? |work=LearnDataSci (www.learndatasci.com) |url=https://www.learndatasci.com/glossary/dummy-variable-trap/ |access-date=2024-01-18 |first=Fatih |last=Karabiber }}</ref>

The other common cause of perfect collinearity is attempting to use [[ordinary least squares]] when working with very wide datasets (those with more variables than observations). These require more advanced data analysis techniques like [[Hierarchical linear model|Bayesian hierarchical modeling]] to produce meaningful results.{{fact|date=March 2024}}