Editing Gauss–Markov theorem (section)

==Statement==
Suppose we have, in matrix notation, the linear relationship
:<math> y = X \beta + \varepsilon,\quad (y,\varepsilon \in \mathbb{R}^n, \beta \in \mathbb{R}^K \text{ and } X\in\mathbb{R}^{n\times K}) </math>
expanding to,
:<math> y_i=\sum_{j=1}^{K}\beta_j X_{ij}+\varepsilon_i \quad \forall i=1,2,\ldots,n</math>

where <math>\beta_j</math> are non-random but '''un'''observable parameters, <math>X_{ij}</math> are non-random and observable (called the "explanatory variables"), <math>\varepsilon_i</math> are random, and so <math>y_i</math> are random. The random variables <math>\varepsilon_i</math> are called the "disturbance", "noise" or simply "error" (will be contrasted with "residual" later in the article; see [[errors and residuals in statistics]]). Note that to include a constant in the model above, one can choose to introduce the constant as a variable <math>\beta_{K+1}</math>  with a newly introduced last column of X being unity i.e., <math>X_{i(K+1)} = 1</math> for all <math> i </math>. Note that though <math>y_i,</math> as sample responses, are observable, the following statements and arguments including assumptions, proofs and the others assume under the '''only''' condition of knowing <math>X_{ij},</math> '''but not'''  <math>y_i.</math>

The '''Gauss–Markov''' assumptions concern the set of error random variables, <math>\varepsilon_i</math>:

*They have mean zero: <math>\operatorname{E}[\varepsilon_i]=0.</math>
*They are [[homoscedasticity|homoscedastic]], that is all have the same finite variance: <math>\operatorname{Var}(\varepsilon_i)= \sigma^2 < \infty</math> for all <math>i</math> and
*Distinct error terms are uncorrelated: <math>\text{Cov}(\varepsilon_i,\varepsilon_j) = 0, \forall i \neq j.</math>

A '''linear estimator''' of <math> \beta_j  </math> is a linear combination

:<math>\widehat\beta_j = c_{1j}y_1+\cdots+c_{nj}y_n</math>

in which the coefficients <math> c_{ij} </math>  are not allowed to depend on the underlying coefficients <math>\beta_j</math>, since those are not observable, but are allowed to depend on the values <math> X_{ij} </math>, since these data are observable.  (The dependence of the coefficients on each <math>X_{ij}</math> is typically nonlinear; the estimator is linear in each <math> y_i </math> and hence in each random <math> \varepsilon,</math> which is why this is [[linear regression|"linear" regression]].)  The estimator is said to be '''unbiased''' [[if and only if]]

:<math>\operatorname{E}\left [\widehat\beta_j \right ]=\beta_j</math>

regardless of the values of <math> X_{ij} </math>. Now, let <math display="inline">\sum_{j=1}^K\lambda_j\beta_j</math> be some linear combination of the coefficients. Then the '''[[mean squared error]]''' of the corresponding estimation is

:<math>\operatorname{E} \left [\left (\sum_{j=1}^K\lambda_j \left(\widehat\beta_j-\beta_j \right ) \right)^2\right ],</math>

in other words, it is the expectation of the square of the weighted sum (across parameters) of the differences between the estimators and the corresponding parameters to be estimated. (Since we are considering the case in which all the parameter estimates are unbiased, this mean squared error is the same as the variance of the linear combination.) The '''best linear unbiased estimator''' (BLUE) of the vector <math> \beta </math> of parameters <math> \beta_j </math> is one with the smallest mean squared error for every vector <math> \lambda </math> of linear combination parameters.  This is equivalent to the condition that

:<math>\operatorname{Var}\left(\widetilde\beta\right)- \operatorname{Var} \left( \widehat \beta \right)</math>

is a positive semi-definite matrix for every other linear unbiased estimator <math>\widetilde\beta</math>.

The '''ordinary least squares estimator (OLS)''' is the function

:<math>\widehat\beta=(X^\operatorname{T}X)^{-1}X^\operatorname{T}y</math>

of <math> y </math> and <math>X</math> (where <math>X^\operatorname{T}</math> denotes the [[transpose]] of <math> X </math>) that minimizes the '''sum of squares of [[errors and residuals in statistics|residuals]]''' (misprediction amounts):

:<math>\sum_{i=1}^n \left(y_i-\widehat{y}_i\right)^2=\sum_{i=1}^n \left(y_i-\sum_{j=1}^K \widehat\beta_j X_{ij}\right)^2.</math>

The theorem now states that the OLS estimator is a best linear unbiased estimator (BLUE). 

The main idea of the proof is that the least-squares estimator is uncorrelated with every linear unbiased estimator of zero, i.e., with every linear combination <math>a_1y_1+\cdots+a_ny_n</math> whose coefficients do not depend upon the unobservable <math> \beta </math> but whose expected value is always zero.

=== Remark ===
Proof that the OLS indeed ''minimizes'' the sum of squares of residuals may proceed as follows with a calculation of the [[Hessian matrix]] and showing that it is positive definite. 

The MSE function we want to minimize is 
<math display="block">f(\beta_0,\beta_1,\dots,\beta_p) = \sum_{i=1}^n (y_i-\beta_0-\beta_1x_{i1}-\dots-\beta_px_{ip})^2</math>
for a multiple regression model with ''p'' variables. The first derivative is 
<math display="block">\begin{aligned}
\frac{d}{d\boldsymbol{\beta}}f &= -2X^\operatorname{T} \left(\mathbf{y}-X\boldsymbol{\beta}\right)\\
&=-2\begin{bmatrix}
\sum_{i=1}^{n} (y_i - \dots - \beta_px_{ip})\\
\sum_{i=1}^nx_{i1} (y_i-\dots-\beta_px_{ip})\\
\vdots\\ 
\sum_{i=1}^nx_{ip} (y_i-\dots-\beta_px_{ip})
\end{bmatrix}\\
&= \mathbf{0}_{p+1},
\end{aligned}</math>
where <math>X^\operatorname{T}</math> is the design matrix 
<math display="block">X=\begin{bmatrix}
1 & x_{11} & \cdots & x_{1p}\\
1 & x_{21} & \cdots & x_{2p}\\
&&\vdots\\
1 & x_{n1} & \cdots & x_{np}
\end{bmatrix}\in \R^{n\times(p+1)}; \qquad n\geq p+1</math>

The [[Hessian matrix]] of second derivatives is 
<math display="block">\mathcal{H} = 2\begin{bmatrix}
n & \sum_{i=1}^n x_{i1} & \cdots & \sum_{i=1}^n x_{ip} \\
\sum_{i=1}^n x_{i1}& \sum_{i=1}^n x_{i1}^2 & \cdots & \sum_{i=1}^nx_{i1}x_{ip}\\
\vdots & \vdots &\ddots & \vdots \\
\sum_{i=1}^n x_{ip} & \sum_{i=1}^n x_{ip}x_{i1}& \cdots & \sum_{i=1}^n x_{ip}^2
\end{bmatrix} = 2X^\operatorname{T}X</math>

Assuming the columns of <math>X</math> are linearly independent so that <math>X^\operatorname{T} X</math> is invertible, let <math>X=\begin{bmatrix}\mathbf{v_1}& \mathbf{v_2}& \cdots & \mathbf{v}_{p+1}\end{bmatrix}</math>, then 
<math display="block">k_1\mathbf{v_1} + \dots + k_{p+1} \mathbf{v}_{p+1} = \mathbf 0\iff k_1= \dots =k_{p+1}=0</math>

Now let <math>\mathbf{k} = (k_1,\dots,k_{p+1})^T \in \R^{(p+1)\times 1}</math> be an eigenvector of <math>\mathcal{H}</math>. 

<math display="block">\mathbf{k} \ne \mathbf{0} \implies \left(k_1\mathbf{v_1}+\dots+k_{p+1}\mathbf{v}_{p+1}\right)^2 > 0</math>

In terms of vector multiplication, this means 
<math display="block">\begin{bmatrix} k_1 & \cdots & k_{p+1} \end{bmatrix}
\begin{bmatrix}\mathbf{v_1} \\ \vdots \\ \mathbf{v}_{p+1}\end{bmatrix}
\begin{bmatrix}\mathbf{v_1} & \cdots & \mathbf{v}_{p+1}\end{bmatrix}
\begin{bmatrix}k_1 \\ \vdots\\ k_{p+1}\end{bmatrix}
= \mathbf{k}^\operatorname{T}\mathcal{H}\mathbf{k} = \lambda \mathbf{k}^\operatorname{T}\mathbf{k}>0</math>
where <math>\lambda</math> is the eigenvalue corresponding to <math>\mathbf{k}</math>. Moreover, 
<math display="block">\mathbf{k}^\operatorname{T}\mathbf{k} = \sum_{i=1}^{p+1}k_i^2 > 0 \implies \lambda > 0</math>

Finally, as eigenvector <math>\mathbf{k}</math> was arbitrary, it means all eigenvalues of <math>\mathcal{H}</math> are positive, therefore <math>\mathcal{H}</math> is positive definite. Thus, 
<math display="block">\boldsymbol{\beta} = \left(X^\operatorname{T}X\right)^{-1}X^\operatorname{T}Y</math>
is indeed a global minimum.

Or, just see that for all vectors <math>\mathbf{v}, \mathbf{v}^\operatorname{T} X^\operatorname{T} X \mathbf{v} = \|\mathbf{X}\mathbf{v}\|^2 \ge 0 </math>. So the Hessian is positive definite if full rank.