Editing Regression analysis (section)

==Regression model==
In practice, researchers first select a model they would like to estimate and then use their chosen method (e.g., [[ordinary least squares]]) to estimate the parameters of that model. Regression models involve the following components:
*The '''unknown parameters''', often denoted as a [[scalar (physics)|scalar]] or [[Euclidean vector|vector]] <math>\beta</math>.
*The '''independent variables''', which are observed in data and are often denoted as a vector <math>X_i</math> (where <math>i</math> denotes a row of data).
*The '''dependent variable''', which are observed in data and often denoted using the scalar <math>Y_i</math>.
*The '''error terms''', which are ''not'' directly observed in data and are often denoted using the scalar <math>e_i</math>.

In various [[List of fields of application of statistics|fields of application]], different terminologies are used in place of [[dependent and independent variables]].

Most regression models propose that <math>Y_i</math> is a [[Function (mathematics)|function]] ('''regression function''') of <math>X_i</math> and <math> \beta</math>, with <math>e_i</math> representing an [[Errors and residuals|additive error term]] that may stand in for un-modeled determinants of <math>Y_i</math> or random statistical noise:

:<math>Y_i = f (X_i, \beta) + e_i</math>

Note that the independent variables <math>X_i</math> are assumed to be free of error. This important assumption is often overlooked, although [[errors-in-variables models]] can be used when the independent variables are assumed to contain errors.

The researchers' goal is to estimate the function <math>f(X_i, \beta)</math> that most closely fits the data. To carry out regression analysis, the form of the function <math>f</math> must be specified. Sometimes the form of this function is based on knowledge about the relationship between <math>Y_i</math> and <math>X_i</math> that does not rely on the data. If no such knowledge is available, a flexible or convenient form for <math>f</math> is chosen. For example, a simple univariate regression may propose <math>f(X_i, \beta) = \beta_0 + \beta_1 X_i</math>, suggesting that the researcher believes <math>Y_i = \beta_0 + \beta_1 X_i + e_i</math> to be a reasonable approximation for the statistical process generating the data.

Once researchers determine their preferred [[statistical model]], different forms of regression analysis provide tools to estimate the parameters <math>\beta </math>. For example, [[least squares]] (including its most common variant, [[ordinary least squares]]) finds the value of <math>\beta </math> that minimizes the sum of squared errors <math>\sum_i (Y_i - f(X_i, \beta))^2</math>. A given regression method will ultimately provide an estimate of <math>\beta</math>, usually denoted <math>\hat{\beta}</math> to distinguish the estimate from the true (unknown) parameter value that generated the data. Using this estimate, the researcher can then use the ''fitted value'' <math>\hat{Y_i} = f(X_i,\hat{\beta})</math> for prediction or to assess the accuracy of the model in explaining the data. Whether the researcher is intrinsically interested in the estimate <math>\hat{\beta}</math> or the predicted value <math>\hat{Y_i}</math> will depend on context and their goals. As described in [[ordinary least squares]], least squares is widely used because the estimated function <math>f(X_i, \hat{\beta})</math> approximates the [[conditional expectation]] <math>E(Y_i|X_i)</math>.<ref name="Gauss" /> However, alternative variants (e.g., [[least absolute deviations]] or [[quantile regression]]) are useful when researchers want to model other functions <math>f(X_i,\beta)</math>.

It is important to note that there must be sufficient data to estimate a regression model. For example, suppose that a researcher has access to <math>N</math> rows of data with one dependent and two independent variables: <math>(Y_i, X_{1i}, X_{2i})</math>. Suppose further that the researcher wants to estimate a bivariate linear model via [[least squares]]: <math>Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + e_i</math>. If the researcher only has access to <math>N=2</math> data points, then they could find infinitely many combinations <math>(\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2)</math> that explain the data equally well: any combination can be chosen that satisfies <math>\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_{1i} + \hat{\beta}_2 X_{2i}</math>, all of which lead to <math>\sum_i \hat{e}_i^2 = \sum_i (\hat{Y}_i - (\hat{\beta}_0 + \hat{\beta}_1 X_{1i} + \hat{\beta}_2 X_{2i}))^2 = 0</math> and are therefore valid solutions that minimize the sum of squared [[Errors and residuals|residuals]]. To understand why there are infinitely many options, note that the system of <math>N=2</math> equations is to be solved for 3 unknowns, which makes the system [[Underdetermined system|underdetermined]]. Alternatively, one can visualize infinitely many 3-dimensional planes that go through <math>N=2</math> fixed points.

More generally, to estimate a [[least squares]] model with <math>k</math> distinct parameters, one must have <math>N \geq k</math> distinct data points. If <math>N > k</math>, then there does not generally exist a set of parameters that will perfectly fit the data. The quantity <math>N-k</math> appears often in regression analysis, and is referred to as the [[Degrees of freedom (statistics)|degrees of freedom]] in the model. Moreover, to estimate a least squares model, the independent variables <math>(X_{1i}, X_{2i}, ..., X_{ki})</math> must be [[Linear independence|linearly independent]]: one must ''not'' be able to reconstruct any of the independent variables by adding and multiplying the remaining independent variables. As discussed in [[ordinary least squares]], this condition ensures that <math>X^{T}X</math> is an [[invertible matrix]] and therefore that a unique solution <math>\hat{\beta}</math> exists.