Editing Regression analysis (section)

==Linear regression==
{{Main|Linear regression}}
{{Hatnote|See [[simple linear regression]] for a derivation of these formulas and a numerical example}}
In linear regression, the model specification is that the dependent variable, <math> y_i </math> is a [[linear combination]] of the ''parameters'' (but need not be linear in the ''independent variables''). For example, in [[simple linear regression]] for modeling <math> n </math> data points there is one independent variable: <math> x_i </math>, and two parameters, <math>\beta_0</math> and <math>\beta_1</math>:

:straight line: <math>y_i=\beta_0 +\beta_1 x_i +\varepsilon_i,\quad i=1,\dots,n.\!</math>

In multiple linear regression, there are several independent variables or functions of independent variables.

Adding a term in <math>x_i^2</math> to the preceding regression gives:

:parabola: <math>y_i=\beta_0 +\beta_1 x_i +\beta_2 x_i^2+\varepsilon_i,\ i=1,\dots,n.\!</math>

This is still linear regression; although the expression on the right hand side is quadratic in the independent variable <math>x_i</math>, it is linear in the parameters <math>\beta_0</math>, <math>\beta_1</math> and <math>\beta_2.</math>

In both cases, <math>\varepsilon_i</math> is an error term and the subscript <math>i</math> indexes a particular observation.

Returning our attention to the straight line case: Given a random sample from the population, we estimate the population parameters and obtain the sample linear regression model:

: <math> \widehat{y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 x_i. </math>

The [[errors and residuals in statistics|residual]], <math> e_i = y_i - \widehat{y}_i </math>, is the difference between the value of the dependent variable predicted by the model, <math> \widehat{y}_i</math>, and the true value of the dependent variable, <math>y_i</math>. One method of estimation is [[ordinary least squares]]. This method obtains parameter estimates that minimize the sum of squared [[errors and residuals in statistics|residuals]], [[Residual sum of squares|SSR]]:

:<math>SSR=\sum_{i=1}^n e_i^2</math>

Minimization of this function results in a set of [[Linear least squares (mathematics)|normal equations]], a set of simultaneous linear equations in the parameters, which are solved to yield the parameter estimators, <math>\widehat{\beta}_0, \widehat{\beta}_1</math>.

[[Image:Linear regression.svg|thumb|upright=1.3|Illustration of linear regression on a data set]]

In the case of simple regression, the formulas for the least squares estimates are

:<math>\widehat{\beta}_1=\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}</math>
:<math>\widehat{\beta}_0=\bar{y}-\widehat{\beta}_1\bar{x}</math>

where <math>\bar{x}</math> is the [[Arithmetic mean|mean]] (average) of the <math>x</math> values and <math>\bar{y}</math> is the mean of the <math>y</math> values.

Under the assumption that the population error term has a constant variance, the estimate of that variance is given by:

: <math> \hat{\sigma}^2_\varepsilon = \frac{SSR}{n-2}</math>

This is called the [[mean square error]] (MSE) of the regression. The denominator is the sample size reduced by the number of model parameters estimated from the same data, <math>(n-p)</math> for <math>p</math> [[regressor]]s or <math>(n-p-1)</math> if an intercept is used.<ref>Steel, R.G.D, and Torrie, J. H., ''Principles and Procedures of Statistics with Special Reference to the Biological Sciences.'', [[McGraw Hill]], 1960, page 288.</ref> In this case, <math>p=1</math> so the denominator is <math>n-2</math>.

The [[standard error (statistics)|standard error]]s of the parameter estimates are given by

:<math>\hat\sigma_{\beta_1}=\hat\sigma_{\varepsilon} \sqrt{\frac{1}{\sum(x_i-\bar x)^2}}</math>

:<math>\hat\sigma_{\beta_0}=\hat\sigma_\varepsilon \sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum(x_i-\bar x)^2}}=\hat\sigma_{\beta_1} \sqrt{\frac{\sum x_i^2}{n}}. </math>

Under the further assumption that the population error term is normally distributed, the researcher can use these estimated standard errors to create [[confidence interval]]s and conduct [[hypothesis test]]s about the [[population parameter]]s.

===General linear model===
{{Hatnote|For a derivation, see [[linear least squares]]}}
{{Hatnote|For a numerical example, see [[linear regression]]}}
In the more general multiple regression model, there are <math>p</math> independent variables:

: <math> y_i = \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \varepsilon_i, \, </math>

where <math>x_{ij}</math> is the <math>i</math>-th observation on the <math>j</math>-th independent variable.
If the first independent variable takes the value 1 for all <math>i</math>, <math>x_{i1} = 1</math>, then <math>\beta_1</math> is called the [[regression intercept]].

The least squares parameter estimates are obtained from <math>p</math> normal equations. The residual can be written as

:<math>\varepsilon_i=y_i -  \hat\beta_1 x_{i1} - \cdots - \hat\beta_p x_{ip}.</math>

The '''normal equations''' are

:<math>\sum_{i=1}^n \sum_{k=1}^p x_{ij}x_{ik}\hat \beta_k=\sum_{i=1}^n x_{ij}y_i,\  j=1,\dots,p.\,</math>

In matrix notation, the normal equations are written as

:<math>\mathbf{(X^\top X )\hat{\boldsymbol{\beta}}= {}X^\top Y},\,</math>

where the <math>ij</math> element of <math>\mathbf X</math> is <math>x_{ij}</math>, the <math>i</math> element of the column vector <math>Y</math> is <math>y_i</math>, and the <math>j</math> element of <math>\hat \boldsymbol \beta</math> is <math>\hat \beta_j</math>. Thus <math>\mathbf X</math> is <math>n \times p</math>, <math>Y</math> is <math>n \times 1</math>, and <math>\hat \boldsymbol \beta</math> is <math>p \times 1</math>. The solution is

:<math>\mathbf{\hat{\boldsymbol{\beta}}= (X^\top X )^{-1}X^\top Y}.\,</math>

===Diagnostics===
{{main|Regression diagnostics}}
{{Category see also|Regression diagnostics}}
Once a regression model has been constructed, it may be important to confirm the [[goodness of fit]] of the model and the [[statistical significance]] of the estimated parameters. Commonly used checks of goodness of fit include the [[R-squared]], analyses of the pattern of [[errors and residuals in statistics|residuals]] and hypothesis testing. Statistical significance can be checked by an [[F-test]] of the overall fit, followed by [[t-test]]s of individual parameters.

Interpretations of these diagnostic tests rest heavily on the model's assumptions. Although examination of the residuals can be used to invalidate a model, the results of a [[t-test]] or [[F-test]] are sometimes more difficult to interpret if the model's assumptions are violated. For example, if the error term does not have a normal distribution, in small samples the estimated parameters will not follow normal distributions and complicate inference. With relatively large samples, however, a [[central limit theorem]] can be invoked such that hypothesis testing may proceed using asymptotic approximations.

===Limited dependent variables===

[[Limited dependent variable]]s, which are response variables that are [[categorical variable|categorical]] or constrained to fall only in a certain range, often arise in [[econometrics]].

The response variable may be non-continuous ("limited" to lie on some subset of the real line). For binary (zero or one) variables, if analysis proceeds with least-squares linear regression, the model is called the [[linear probability model]]. Nonlinear models for binary dependent variables include the [[probit model|probit]] and [[logistic regression|logit model]]. The [[multivariate probit]] model is a standard method of estimating a joint relationship between several binary dependent variables and some independent variables. For [[categorical variable]]s with more than two values there is the [[multinomial logit]]. For [[ordinal variable]]s with more than two values, there are the [[ordered logit]] and [[ordered probit]] models. [[Censored regression model]]s may be used when the dependent variable is only sometimes observed, and [[Heckman correction]] type models may be used when the sample is not randomly selected from the population of interest. 

An alternative to such procedures is linear regression based on [[polychoric correlation]] (or polyserial correlations) between the categorical variables. Such procedures differ in the assumptions made about the distribution of the variables in the population. If the variable is positive with low values and represents the repetition of the occurrence of an event, then count models like the [[Poisson regression]] or the [[negative binomial]] model may be used.