Editing Generalized linear model (section)

== Model components ==

The GLM consists of three elements:
: 1. A particular distribution for modeling <math> Y </math> from among those which are considered exponential families of probability distributions,
: 2. A linear predictor <math>\eta = X \beta</math>,  and
: 3. A link function <math>g</math> such that <math>\operatorname{E}(Y \mid X) = \mu = g^{-1}(\eta)</math>.

=== Probability distribution ===
An '''overdispersed exponential family''' of distributions is a generalization of an [[exponential family]] and the [[exponential dispersion model]] of distributions and includes those families of probability distributions, parameterized by <math>\boldsymbol\theta</math> and <math>\tau</math>, whose density functions ''f'' (or [[probability mass function]], for the case of a [[discrete distribution]]) can be expressed in the form
:<math> f_Y(\mathbf{y} \mid \boldsymbol\theta, \tau) = h(\mathbf{y},\tau) \exp \left(\frac{\mathbf{b}(\boldsymbol\theta)^{\rm T}\mathbf{T}(\mathbf{y}) - A(\boldsymbol\theta)} {d(\tau)} \right). \,\!</math>

The ''dispersion parameter'', <math>\tau</math>, typically is known and is usually related to the variance of the distribution.  The functions <math>h(\mathbf{y},\tau)</math>, <math>\mathbf{b}(\boldsymbol\theta)</math>, <math>\mathbf{T}(\mathbf{y})</math>, <math>A(\boldsymbol\theta)</math>, and <math>d(\tau)</math> are known.  Many common distributions are in this family, including the normal, exponential, gamma, Poisson, Bernoulli, and (for fixed number of trials) binomial, multinomial, and negative binomial.

For scalar <math>\mathbf{y}</math> and <math>\boldsymbol\theta</math> (denoted <math>y</math> and <math>\theta</math> in this case), this reduces to
: <math> f_Y(y \mid \theta, \tau) = h(y,\tau) \exp \left(\frac{b(\theta)T(y) - A(\theta)}{d(\tau)} \right). \,\!</math>

<math>\boldsymbol\theta</math> is related to the mean of the distribution.  If <math>\mathbf{b}(\boldsymbol\theta)</math> is the identity function, then the distribution is said to be in [[canonical form]] (or ''natural form''). Note that any distribution can be converted to canonical form by rewriting <math>\boldsymbol\theta</math> as <math>\boldsymbol\theta'</math> and then applying the transformation <math>\boldsymbol\theta = \mathbf{b}(\boldsymbol\theta')</math>.  It is always possible to convert <math>A(\boldsymbol\theta)</math> in terms of the new parametrization, even if <math>\mathbf{b}(\boldsymbol\theta')</math> is not a [[one-to-one function]]; see comments in the page on [[exponential families]]. 

If, in addition, <math>\mathbf{T}(\mathbf{y})</math> and <math>\mathbf{b}(\boldsymbol\theta)</math> are the identity, then <math>\boldsymbol\theta</math> is called the ''canonical parameter'' (or ''natural parameter'') and is related to the mean through
:<math> \boldsymbol\mu = \operatorname{E}(\mathbf{y}) = \nabla_{\boldsymbol{\theta}} A(\boldsymbol\theta). \,\!</math>

For scalar <math>\mathbf{y}</math> and <math>\boldsymbol\theta</math>, this reduces to
:<math> \mu = \operatorname{E}(y) = A'(\theta).</math>

Under this scenario, the variance of the distribution can be shown to be<ref>{{harvnb|McCullagh|Nelder|1989}}, Chapter&nbsp;2.</ref>
:<math>\operatorname{Var}(\mathbf{y}) = \nabla^2_{\boldsymbol{\theta}} A(\boldsymbol\theta)d(\tau). \,\!</math>

For scalar <math>\mathbf{y}</math> and <math>\boldsymbol\theta</math>, this reduces to
:<math>\operatorname{Var}(y) = A''(\theta) d(\tau). \,\!</math>

=== Linear predictor ===

The linear predictor is the quantity which incorporates the information about the independent variables into the model.  The symbol ''&eta;'' ([[Greek alphabet|Greek]] "[[Eta (letter)|eta]]") denotes a linear predictor.  It is related to the [[expected value]] of the data through the link function.

''&eta;'' is expressed as linear combinations (thus, "linear") of unknown parameters '''''β'''''.  The coefficients of the linear combination are represented as the matrix of independent variables '''X'''.  ''&eta;'' can thus be expressed as

:<math> \eta = \mathbf{X}\boldsymbol{\beta}.\,</math>

=== Link function ===

The link function provides the relationship between the linear predictor and the [[Expected value|mean]] of the distribution function.  There are many commonly used link functions, and their choice is informed by several considerations. There is always a well-defined ''canonical'' link function which is derived from the exponential of the response's [[density function]]. However, in some cases it makes sense to try to match the [[Domain of a function|domain]] of the link function to the [[range of a function|range]] of the distribution function's mean, or use a non-canonical link function for algorithmic purposes, for example [[Probit model#Gibbs sampling|Bayesian probit regression]].

When using a distribution function with a canonical parameter <math>\theta,</math> the canonical link function is the function that expresses <math>\theta</math> in terms of <math>\mu,</math> i.e. <math>\theta = g(\mu).</math>  For the most common distributions, the mean <math>\mu</math> is one of the parameters in the standard form of the distribution's [[density function]], and then <math>g(\mu)</math> is the function as defined above that maps the density function into its canonical form.  When using the canonical link function, <math>g(\mu) = \theta = \mathbf{X}\boldsymbol{\beta},</math> which allows <math>\mathbf{X}^{\rm T} \mathbf{Y}</math> to be a [[sufficiency (statistics)|sufficient statistic]] for <math>\boldsymbol{\beta}</math>.

Following is a table of several exponential-family distributions in common use and the data they are typically used for, along with the canonical link functions and their inverses (sometimes referred to as the mean function, as done here).

{| class="wikitable"
|+ Common distributions with typical uses and canonical link functions 
! Distribution !! Support of distribution !! Typical uses !! Link name !! Link function, <math>\mathbf{X}\boldsymbol{\beta}=g(\mu)\,\!</math>  !! Mean function
|-
| [[normal distribution|Normal]]
| rowspan="2" |real: <math>(-\infty,+\infty)</math> || rowspan="2" |Linear-response data || rowspan="2" | Identity 
| rowspan="2" |<math>\mathbf{X}\boldsymbol{\beta}=\mu\,\!</math> || rowspan="2" | <math>\mu=\mathbf{X}\boldsymbol{\beta}\,\!</math>
|-
| [[Laplace distribution|Laplace]]
|-
| [[exponential distribution|Exponential]]
| rowspan="2" | real: <math>(0,+\infty)</math> || rowspan="2" | Exponential-response data, scale parameters
| rowspan="2" | [[Multiplicative inverse|Negative inverse]]
| rowspan="2" | <math>\mathbf{X}\boldsymbol{\beta}=-\mu^{-1}\,\!</math> 
| rowspan="2" | <math>\mu=-(\mathbf{X}\boldsymbol{\beta})^{-1}\,\!</math>
|-
| [[gamma distribution|Gamma]]
|-
| [[Inverse Gaussian distribution|Inverse <br>Gaussian]]
| real: <math>(0, +\infty)</math> || || Inverse <br>squared || <math>\mathbf{X}\boldsymbol{\beta}=\mu^{-2}\,\!</math> || <math>\mu=(\mathbf{X}\boldsymbol{\beta})^{-1/2}\,\!</math>
|-
| [[Poisson distribution|Poisson]]
| integer: <math>0,1,2,\ldots</math> || count of occurrences in fixed amount of time/space || [[Natural logarithm|Log]] || <math>\mathbf{X}\boldsymbol{\beta} = \ln(\mu) \,\!</math> || <math>\mu=\exp (\mathbf{X}\boldsymbol{\beta}) \,\!</math>
|-
| [[Bernoulli distribution|Bernoulli]]
| integer: <math>\{0,1\}</math> || outcome of single yes/no occurrence 
| rowspan="5" | [[Logit]]
| <math>\mathbf{X}\boldsymbol{\beta}=\ln \left(\frac \mu {1-\mu}\right) \,\!</math>
| rowspan="5" | <math>\mu=\frac{\exp(\mathbf{X}\boldsymbol{\beta})}{1 + \exp(\mathbf{X}\boldsymbol{\beta})} = \frac 1 {1 + \exp(-\mathbf{X} \boldsymbol{\beta})} \,\!</math>
|-
| [[binomial distribution|Binomial]]
| integer: <math>0,1,\ldots,N</math> || count of # of "yes" occurrences out of N yes/no occurrences 
|<math>\mathbf{X}\boldsymbol{\beta}=\ln \left(\frac \mu {n-\mu}\right) \,\!</math>
|-
| rowspan=2| [[categorical distribution|Categorical]]
| integer: <math>[0,K)</math>|| rowspan=2| outcome of single ''K''-way occurrence 
| rowspan="3" |<math>\mathbf{X}\boldsymbol{\beta}=\ln \left(\frac \mu {1-\mu}\right) \,\!</math>
|-
| ''K''-vector of integer: <math>[0,1]</math>, where exactly one element in the vector has the value 1
|-
| [[multinomial distribution|Multinomial]]
| ''K''-vector of integer: <math>[0,N]</math> || count of occurrences of different types (1, ..., ''K'') out of ''N'' total ''K''-way occurrences 
|}

In the cases of the exponential and gamma distributions, the domain of the canonical link function is not the same as the permitted range of the mean. In particular, the linear predictor may be positive, which would give an impossible negative mean.  When maximizing the likelihood, precautions must be taken to avoid this.  An alternative is to use a noncanonical link function.

In the case of the Bernoulli, binomial, categorical and multinomial distributions, the support of the distributions is not the same type of data as the parameter being predicted.  In all of these cases, the predicted parameter is one or more probabilities, i.e. real numbers in the range <math>[0,1]</math>. The resulting model is known as ''[[logistic regression]]'' (or ''[[multinomial logistic regression]]'' in the case that ''K''-way rather than binary values are being predicted).

For the Bernoulli and binomial distributions, the parameter is a single probability, indicating the likelihood of occurrence of a single event.  The Bernoulli still satisfies the basic condition of the generalized linear model in that, even though a single outcome will always be either 0 or 1, the ''[[expected value]]'' will nonetheless be a real-valued probability, i.e. the probability of occurrence of a "yes" (or 1) outcome.  Similarly, in a binomial distribution, the expected value is ''Np'', i.e. the expected proportion of "yes" outcomes will be the probability to be predicted.

For categorical and multinomial distributions, the parameter to be predicted is a ''K''-vector of probabilities, with the further restriction that all probabilities must add up to 1.  Each probability indicates the likelihood of occurrence of one of the ''K'' possible values. For the multinomial distribution, and for the vector form of the categorical distribution, the expected values of the elements of the vector can be related to the predicted probabilities similarly to the binomial and Bernoulli distributions.