Editing Likelihood function (section)

==Likelihoods that eliminate nuisance parameters==
In many cases, the likelihood is a function of more than one parameter but interest focuses on the estimation of only one, or at most a few of them, with the others being considered as [[nuisance parameter]]s. Several alternative approaches have been developed to eliminate such nuisance parameters, so that a likelihood can be written as a function of only the parameter (or parameters) of interest: the main approaches are profile, conditional, and marginal likelihoods.<ref>{{cite book |title=In All Likelihood: Statistical Modelling and Inference Using Likelihood |first=Yudi |last=Pawitan |year=2001 |publisher= [[Oxford University Press]] }}</ref><ref>{{cite web | author = Wen Hsiang Wei |url= http://web.thu.edu.tw/wenwei/www/glmpdfmargin.htm |title = Generalized Linear Model - course notes | pages = Chapter 5 | publisher = [[Tunghai University]] | location= Taichung, Taiwan | access-date = 2017-10-01 }}</ref> These approaches are also useful when a high-dimensional likelihood surface needs to be reduced to one or two parameters of interest in order to allow a [[Graph of a function|graph]].

===Profile likelihood===
It is possible to reduce the dimensions by concentrating the likelihood function for a subset of parameters by expressing the nuisance parameters as functions of the parameters of interest and replacing them in the likelihood function.<ref>{{cite book |first=Takeshi |last=Amemiya |author-link=Takeshi Amemiya |title=Advanced Econometrics |chapter=Concentrated Likelihood Function |location=Cambridge |publisher=Harvard University Press |year=1985 |pages=[https://archive.org/details/advancedeconomet00amem/page/125 125–127] |isbn=978-0-674-00560-0 |chapter-url=https://books.google.com/books?id=0bzGQE14CwEC&pg=PA125 |url-access=registration |url=https://archive.org/details/advancedeconomet00amem/page/125 }}</ref><ref>{{cite book |first1=Russell |last1=Davidson |first2=James G. |last2=MacKinnon |author-link2=James G. MacKinnon |title=Estimation and Inference in Econometrics |chapter=Concentrating the Loglikelihood Function |location=New York |publisher=Oxford University Press |year=1993 |pages=267–269 |isbn=978-0-19-506011-9 }}</ref> In general, for a likelihood function depending on the parameter vector <math display="inline">\mathbf{\theta}</math> that can be partitioned into <math display="inline">\mathbf{\theta} = \left( \mathbf{\theta}_{1} : \mathbf{\theta}_{2} \right)</math>, and where a correspondence <math display="inline">\mathbf{\hat{\theta}}_{2} = \mathbf{\hat{\theta}}_{2} \left( \mathbf{\theta}_{1} \right)</math> can be determined explicitly, concentration reduces [[Computational complexity|computational burden]] of the original maximization problem.<ref>{{cite book |first1=Christian |last1=Gourieroux |first2=Alain |last2=Monfort |title=Statistics and Econometric Models |chapter=Concentrated Likelihood Function |location=New York |publisher=Cambridge University Press |year=1995 |isbn=978-0-521-40551-5 |pages=170–175 |chapter-url=https://books.google.com/books?id=gqI-pAP2JZ8C&pg=PA170 }}</ref>

For instance, in a [[linear regression]] with normally distributed errors, <math display="inline">\mathbf{y} = \mathbf{X} \beta + u</math>, the coefficient vector could be [[Partition of a set|partitioned]] into <math display="inline">\beta = \left[ \beta_{1} : \beta_{2} \right]</math> (and consequently the [[design matrix]] <math display="inline">\mathbf{X} = \left[ \mathbf{X}_{1} : \mathbf{X}_{2} \right]</math>). Maximizing with respect to <math display="inline">\beta_{2}</math> yields an optimal value function <math display="inline">\beta_{2} (\beta_{1}) = \left( \mathbf{X}_{2}^{\mathsf{T}} \mathbf{X}_{2} \right)^{-1} \mathbf{X}_{2}^{\mathsf{T}} \left( \mathbf{y} - \mathbf{X}_{1} \beta_{1} \right)</math>. Using this result, the maximum likelihood estimator for <math display="inline">\beta_{1}</math> can then be derived as
<math display="block">\hat{\beta}_{1} = \left( \mathbf{X}_{1}^{\mathsf{T}} \left( \mathbf{I} - \mathbf{P}_{2} \right) \mathbf{X}_{1} \right)^{-1} \mathbf{X}_{1}^{\mathsf{T}} \left( \mathbf{I} - \mathbf{P}_{2} \right) \mathbf{y}</math>
where <math display="inline">\mathbf{P}_{2} = \mathbf{X}_{2} \left( \mathbf{X}_{2}^{\mathsf{T}} \mathbf{X}_{2} \right)^{-1} \mathbf{X}_{2}^{\mathsf{T}}</math> is the [[projection matrix]] of <math display="inline">\mathbf{X}_{2}</math>. This result is known as the [[Frisch–Waugh–Lovell theorem]].

Since graphically the procedure of concentration is equivalent to slicing the likelihood surface along the ridge of values of the nuisance parameter <math display="inline">\beta_{2}</math> that maximizes the likelihood function, creating an [[Contour line|isometric]] [[Topographic profile|profile]] of the likelihood function for a given <math display="inline">\beta_{1}</math>, the result of this procedure is also known as ''profile likelihood''.<ref>{{citation |first=Andrew |last=Pickles |title=An Introduction to Likelihood Analysis |location=Norwich |publisher=W. H. Hutchins & Sons |year=1985 |isbn=0-86094-190-6 |pages=[https://archive.org/details/introductiontoli0000pick/page/21 21–24] |mode=cs1 |url=https://archive.org/details/introductiontoli0000pick/page/21 }}</ref><ref>{{cite book |first=Benjamin M. |last=Bolker |title=Ecological Models and Data in R |publisher=Princeton University Press |year=2008 |isbn=978-0-691-12522-0 |pages=187–189 |url=https://books.google.com/books?id=flyBd1rpqeoC&pg=PA188 }}</ref> In addition to being graphed, the profile likelihood can also be used to compute [[confidence interval]]s that often have better small-sample properties than those based on asymptotic [[Standard error (statistics)|standard errors]] calculated from the full likelihood.<ref>{{citation|last=Aitkin|first=Murray|title=GLIM 82: Proceedings of the International Conference on Generalised Linear Models|pages=76–86|year=1982|chapter=Direct Likelihood Inference|publisher=Springer|isbn=0-387-90777-7|author-link=Murray Aitkin|mode=cs1}}</ref><ref>{{citation |first1=D. J. |last1=Venzon |first2=S. H. |last2=Moolgavkar |title=A Method for Computing Profile-Likelihood-Based Confidence Intervals |journal=[[Journal of the Royal Statistical Society]] |series=Series C (Applied Statistics) |volume=37 |issue=1 |year=1988 |pages=87–94 |doi=10.2307/2347496 |jstor=2347496 |mode=cs1 }}</ref>

===Conditional likelihood===
Sometimes it is possible to find a [[sufficient statistic]] for the nuisance parameters, and conditioning on this statistic results in a likelihood which does not depend on the nuisance parameters.<ref>{{cite journal |first1=J. D. |last1=Kalbfleisch |first2=D. A. |last2=Sprott |title=Marginal and Conditional Likelihoods |journal=Sankhyā: The Indian Journal of Statistics |series=Series A |volume=35 |issue=3 |year=1973 |pages=311–328 |jstor=25049882 }}</ref>

One example occurs in 2×2 tables, where conditioning on all four marginal totals leads to a conditional likelihood based on the non-central [[hypergeometric distribution]]. This form of conditioning is also the basis for [[Fisher's exact test]].

===Marginal likelihood===
{{Main|Marginal likelihood}}
Sometimes we can remove the nuisance parameters by considering a likelihood based on only part of the information in the data, for example by using the set of ranks rather than the numerical values. Another example occurs in linear [[mixed model]]s, where considering a likelihood for the residuals only after fitting the fixed effects leads to [[residual maximum likelihood]] estimation of the variance components.

===Partial likelihood===
A partial likelihood is an adaption of the full likelihood such that only a part of the parameters (the parameters of interest) occur in it.<ref>
{{citation
 |last=Cox |first=D. R. |author-link=David Cox (statistician)
 |title=Partial likelihood
 |journal=[[Biometrika]]
 |year=1975 |volume=62 |issue=2 |pages=269&ndash;276
 |doi=10.1093/biomet/62.2.269 |mr=0400509
|mode=cs1 }}</ref> It is a key component of the [[proportional hazards model]]: using a restriction on the hazard function, the likelihood does not contain the shape of the hazard over time.