Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Regression analysis
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Regression model== In practice, researchers first select a model they would like to estimate and then use their chosen method (e.g., [[ordinary least squares]]) to estimate the parameters of that model. Regression models involve the following components: *The '''unknown parameters''', often denoted as a [[scalar (physics)|scalar]] or [[Euclidean vector|vector]] <math>\beta</math>. *The '''independent variables''', which are observed in data and are often denoted as a vector <math>X_i</math> (where <math>i</math> denotes a row of data). *The '''dependent variable''', which are observed in data and often denoted using the scalar <math>Y_i</math>. *The '''error terms''', which are ''not'' directly observed in data and are often denoted using the scalar <math>e_i</math>. In various [[List of fields of application of statistics|fields of application]], different terminologies are used in place of [[dependent and independent variables]]. Most regression models propose that <math>Y_i</math> is a [[Function (mathematics)|function]] ('''regression function''') of <math>X_i</math> and <math> \beta</math>, with <math>e_i</math> representing an [[Errors and residuals|additive error term]] that may stand in for un-modeled determinants of <math>Y_i</math> or random statistical noise: :<math>Y_i = f (X_i, \beta) + e_i</math> Note that the independent variables <math>X_i</math> are assumed to be free of error. This important assumption is often overlooked, although [[errors-in-variables models]] can be used when the independent variables are assumed to contain errors. The researchers' goal is to estimate the function <math>f(X_i, \beta)</math> that most closely fits the data. To carry out regression analysis, the form of the function <math>f</math> must be specified. Sometimes the form of this function is based on knowledge about the relationship between <math>Y_i</math> and <math>X_i</math> that does not rely on the data. If no such knowledge is available, a flexible or convenient form for <math>f</math> is chosen. For example, a simple univariate regression may propose <math>f(X_i, \beta) = \beta_0 + \beta_1 X_i</math>, suggesting that the researcher believes <math>Y_i = \beta_0 + \beta_1 X_i + e_i</math> to be a reasonable approximation for the statistical process generating the data. Once researchers determine their preferred [[statistical model]], different forms of regression analysis provide tools to estimate the parameters <math>\beta </math>. For example, [[least squares]] (including its most common variant, [[ordinary least squares]]) finds the value of <math>\beta </math> that minimizes the sum of squared errors <math>\sum_i (Y_i - f(X_i, \beta))^2</math>. A given regression method will ultimately provide an estimate of <math>\beta</math>, usually denoted <math>\hat{\beta}</math> to distinguish the estimate from the true (unknown) parameter value that generated the data. Using this estimate, the researcher can then use the ''fitted value'' <math>\hat{Y_i} = f(X_i,\hat{\beta})</math> for prediction or to assess the accuracy of the model in explaining the data. Whether the researcher is intrinsically interested in the estimate <math>\hat{\beta}</math> or the predicted value <math>\hat{Y_i}</math> will depend on context and their goals. As described in [[ordinary least squares]], least squares is widely used because the estimated function <math>f(X_i, \hat{\beta})</math> approximates the [[conditional expectation]] <math>E(Y_i|X_i)</math>.<ref name="Gauss" /> However, alternative variants (e.g., [[least absolute deviations]] or [[quantile regression]]) are useful when researchers want to model other functions <math>f(X_i,\beta)</math>. It is important to note that there must be sufficient data to estimate a regression model. For example, suppose that a researcher has access to <math>N</math> rows of data with one dependent and two independent variables: <math>(Y_i, X_{1i}, X_{2i})</math>. Suppose further that the researcher wants to estimate a bivariate linear model via [[least squares]]: <math>Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + e_i</math>. If the researcher only has access to <math>N=2</math> data points, then they could find infinitely many combinations <math>(\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2)</math> that explain the data equally well: any combination can be chosen that satisfies <math>\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_{1i} + \hat{\beta}_2 X_{2i}</math>, all of which lead to <math>\sum_i \hat{e}_i^2 = \sum_i (\hat{Y}_i - (\hat{\beta}_0 + \hat{\beta}_1 X_{1i} + \hat{\beta}_2 X_{2i}))^2 = 0</math> and are therefore valid solutions that minimize the sum of squared [[Errors and residuals|residuals]]. To understand why there are infinitely many options, note that the system of <math>N=2</math> equations is to be solved for 3 unknowns, which makes the system [[Underdetermined system|underdetermined]]. Alternatively, one can visualize infinitely many 3-dimensional planes that go through <math>N=2</math> fixed points. More generally, to estimate a [[least squares]] model with <math>k</math> distinct parameters, one must have <math>N \geq k</math> distinct data points. If <math>N > k</math>, then there does not generally exist a set of parameters that will perfectly fit the data. The quantity <math>N-k</math> appears often in regression analysis, and is referred to as the [[Degrees of freedom (statistics)|degrees of freedom]] in the model. Moreover, to estimate a least squares model, the independent variables <math>(X_{1i}, X_{2i}, ..., X_{ki})</math> must be [[Linear independence|linearly independent]]: one must ''not'' be able to reconstruct any of the independent variables by adding and multiplying the remaining independent variables. As discussed in [[ordinary least squares]], this condition ensures that <math>X^{T}X</math> is an [[invertible matrix]] and therefore that a unique solution <math>\hat{\beta}</math> exists.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)