Editing Likelihood function

{{Short description|Function related to statistics and probability theory}}
{{Bayesian statistics}}
A '''likelihood function''' (often simply called the '''likelihood''') measures how well a [[statistical model]] explains [[Realization (probability)|observed data]] by calculating the probability of seeing that data under different [[Statistical parameter|parameter]] values of the model. It is constructed from the [[joint probability distribution]] of the [[random variable]] that (presumably) generated the observations.<ref>{{cite book |first1=George |last1=Casella |first2=Roger L. |last2=Berger |title=Statistical Inference |location= |publisher=Duxbury |edition=2nd |year=2002 |isbn=0-534-24312-6 |page=290 }}</ref><ref>{{cite book |first=Jon |last=Wakefield |title=Frequentist and Bayesian Regression Methods |location= |publisher=Springer |edition=1st |year=2013 |isbn=978-1-4419-0925-1 |page=36 }}</ref><ref>{{cite book |first1 = Erich L. |last1=Lehmann | first2 = George |last2 = Casella |title=Theory of Point Estimation |location= |publisher=Springer |edition=2nd |year=1998 |isbn= 0-387-98502-6 |page=444 }}</ref> When evaluated on the actual data points, it becomes a function solely of the model parameters.

In [[maximum likelihood estimation]], the [[arg max|argument that maximizes]] the likelihood function serves as a [[Point estimation|point estimate]] for the unknown parameter, while the [[Fisher information]] (often approximated by the likelihood's [[Hessian matrix]] at the maximum) gives an indication of the estimate's [[Precision (statistics)|precision]].

In contrast, in [[Bayesian statistics]], the estimate of interest is the ''converse'' of the likelihood, the so-called [[posterior probability]] of the parameter given the observed data, which is calculated via [[Bayes' theorem|Bayes' rule]].<ref>{{cite book |first=Arnold |last=Zellner |title=An Introduction to Bayesian Inference in Econometrics |location=New York |publisher=Wiley |year=1971 |pages=13–14 |isbn=0-471-98165-6 }}</ref>

==Definition==
The likelihood function, parameterized by a (possibly multivariate) parameter <math display="inline">\theta</math>, is usually defined differently for [[Continuous or discrete variable|discrete and continuous]] [[Probability distribution|probability distributions]] (a more general definition is discussed below). Given a probability density or mass function

<math display="block">x\mapsto f(x \mid \theta),</math>

where <math display="inline">x</math> is a realization of the random variable <math display="inline">X</math>, the likelihood function is
<math display="block">\theta\mapsto f(x \mid \theta),</math>
often written
<math display="block">\mathcal{L}(\theta \mid x). </math>

In other words, when <math display="inline">f(x\mid\theta)</math> is viewed as a function of <math display="inline">x</math> with <math display="inline">\theta</math> fixed, it is a probability density function, and when viewed as a function of <math display="inline">\theta</math> with <math display="inline">x</math> fixed, it is a likelihood function. In the [[Frequentist_probability|frequentist paradigm]], the notation <math display="inline">f(x\mid\theta)</math> is often avoided  and instead <math display="inline">f(x;\theta)</math> or <math display="inline">f(x,\theta)</math> are used to indicate that <math display="inline">\theta</math> is regarded as a fixed unknown quantity rather than as a [[random variable]] being conditioned on.  

The likelihood function does ''not'' specify the probability that <math display="inline">\theta</math> is the truth, given the observed sample <math display="inline">X = x</math>. Such an interpretation is a common error, with potentially disastrous consequences (see [[prosecutor's fallacy]]).

===Discrete probability distribution===
Let <math display="inline">X</math> be a discrete [[random variable]] with [[probability mass function]] <math display="inline">p</math> depending on a parameter <math display="inline">\theta</math>. Then the function

<math display="block">\mathcal{L}(\theta \mid x) = p_\theta (x) = P_\theta (X=x), </math>

considered as a function of <math display="inline">\theta</math>, is the ''likelihood function'', given the [[Outcome (probability)|outcome]] <math display="inline">x</math> of the random variable <math display="inline">X</math>. Sometimes the probability of "the value <math display="inline">x</math> of <math display="inline">X</math> for the parameter value <math display="inline">\theta</math>{{resize|20%|&nbsp;}}" is written as {{math|''P''(''X'' {{=}} ''x'' {{!}} ''θ'')}} or {{math|''P''(''X'' {{=}} ''x''; ''θ'')}}. The likelihood is the probability that a particular outcome <math display="inline">x</math> is observed when the true value of the parameter is <math display="inline">\theta</math>, equivalent to the probability mass on <math display="inline">x</math>; it is ''not'' a probability density over the parameter <math display="inline">\theta</math>. The likelihood, <math display="inline">\mathcal{L}(\theta \mid x) </math>, should not be confused with <math display="inline">P(\theta \mid x)</math>, which is the posterior probability of <math display="inline">\theta</math> given the data <math display="inline">x</math>.

====Example====
[[Image:likelihoodFunctionAfterHH.png|thumb|400px|Figure 1.&nbsp; The likelihood function (<math display="inline">p_\text{H}^2</math>) for the probability of a coin landing heads-up (without prior knowledge of the coin's fairness), given that we have observed HH.]]
[[Image:likelihoodFunctionAfterHHT.png|thumb|400px|Figure 2.&nbsp; The likelihood function (<math display="inline">p_\text{H}^2(1-p_\text{H})</math>) for the probability of a coin landing heads-up (without prior knowledge of the coin's fairness), given that we have observed HHT.]]
Consider a simple statistical model of a coin flip: a single parameter <math display="inline">p_\text{H}</math> that expresses the "fairness" of the coin. The parameter is the probability that a coin lands heads up ("H") when tossed. <math display="inline">p_\text{H}</math> can take on any value within the range 0.0 to 1.0. For a perfectly [[fair coin]], <math display="inline">p_\text{H} = 0.5</math>.

Imagine flipping a fair coin twice, and observing two heads in two tosses ("HH"). Assuming that each successive coin flip is [[Independent and identically distributed random variables|i.i.d.]], then the probability of observing HH is

<math display="block">P(\text{HH} \mid p_\text{H}=0.5) = 0.5^2 = 0.25.</math>

Equivalently, the likelihood of observing "HH" assuming <math display="inline">p_\text{H} = 0.5</math> is

<math display="block">\mathcal{L}(p_\text{H}=0.5 \mid \text{HH}) = 0.25.</math>

This is not the same as saying that <math display="inline">P(p_\text{H} = 0.5 \mid HH) = 0.25</math>, a conclusion which could only be reached via [[Bayes' theorem]] given knowledge about the marginal probabilities  <math display="inline">P(p_\text{H} = 0.5)</math> and <math display="inline">P(\text{HH})</math>.

Now suppose that the coin is not a fair coin, but instead that <math display="inline">p_\text{H} = 0.3</math>. Then the probability of two heads on two flips is

<math display="block">P(\text{HH} \mid p_\text{H}=0.3) = 0.3^2 = 0.09.</math>

Hence

<math display="block">\mathcal{L}(p_\text{H}=0.3 \mid \text{HH}) = 0.09.</math>

More generally, for each value of <math display="inline">p_\text{H}</math>, we can calculate the corresponding likelihood. The result of such calculations is displayed in Figure&nbsp;1. The integral of <math display="inline">\mathcal{L}</math> over [0,&nbsp;1] is 1/3; likelihoods need not integrate or sum to one over the parameter space.

===Continuous probability distribution===
Let <math display="inline">X</math> be a [[random variable]] following an [[Probability distribution#Continuous probability distribution|absolutely continuous probability distribution]] with [[probability density function|density function]] <math display="inline">f</math> (a function of <math display="inline">x</math>) which depends on a parameter <math display="inline">\theta</math>. Then the function

<math display="block">\mathcal{L}(\theta \mid x) = f_\theta (x), </math>

considered as a function of <math display="inline">\theta</math>, is the ''likelihood function'' (of <math display="inline">\theta</math>, given the [[Outcome (probability)|outcome]] <math display="inline">X=x</math>). Again, <math display="inline">\mathcal{L}</math> is not a probability density or mass function over <math display="inline">\theta</math>, despite being a function of <math display="inline">\theta</math> given the observation <math display="inline">X = x</math>.

====Relationship between the likelihood and probability density functions====
The use of the [[probability density function|probability density]] in specifying the likelihood function above is justified as follows. Given an observation <math display="inline">x_j</math>, the likelihood for the interval <math display="inline">[x_j, x_j + h]</math>, where <math display="inline">h > 0</math> is a constant, is given by <math display="inline"> \mathcal{L}(\theta\mid x \in [x_j, x_j + h]) </math>. Observe that
<math display="block"> \mathop\operatorname{arg\,max}_\theta \mathcal{L}(\theta\mid x \in [x_j, x_j + h]) = \mathop\operatorname{arg\,max}_\theta \frac{1}{h} \mathcal{L}(\theta\mid x \in [x_j, x_j + h]) ,</math>
since <math display="inline"> h </math> is positive and constant. Because
<math display="block"> 
\mathop\operatorname{arg\,max}_\theta \frac 1 h \mathcal{L}(\theta\mid x \in [x_j, x_j + h])
= \mathop\operatorname{arg\,max}_\theta \frac 1 h \Pr(x_j \leq x \leq x_j + h \mid \theta)
 = \mathop\operatorname{arg\,max}_\theta \frac 1 h \int_{x_j}^{x_j+h} f(x\mid \theta) \,dx,
</math>

where <math display="inline"> f(x\mid \theta) </math> is the probability density function, it follows that

<math display="block"> \mathop\operatorname{arg\,max}_\theta \mathcal{L}(\theta\mid x \in [x_j, x_j + h]) = \mathop\operatorname{arg\,max}_\theta \frac{1}{h} \int_{x_j}^{x_j+h} f(x\mid\theta) \,dx .</math>

The first [[fundamental theorem of calculus]] provides that
<math display="block">
\lim_{h \to 0^{+}} \frac 1 h \int_{x_j}^{x_j+h} f(x\mid\theta) \,dx = f(x_j \mid \theta).
</math>

Then
<math display="block">
\begin{align}
\mathop\operatorname{arg\,max}_\theta \mathcal{L}(\theta\mid x_j)
&= \mathop\operatorname{arg\,max}_\theta \left[ \lim_{h\to 0^{+}} \mathcal{L}(\theta\mid x \in [x_j, x_j + h]) \right] \\[4pt]
&= \mathop\operatorname{arg\,max}_\theta \left[ \lim_{h\to 0^{+}} \frac{1}{h} \int_{x_j}^{x_j+h} f(x\mid\theta) \,dx \right] \\[4pt]
&= \mathop\operatorname{arg\,max}_\theta f(x_j \mid \theta).
\end{align}
</math>

Therefore,
<math display="block"> \mathop\operatorname{arg\,max}_\theta \mathcal{L}(\theta\mid x_j) = \mathop\operatorname{arg\,max}_\theta f(x_j \mid \theta), </math>
and so maximizing the probability density at <math display="inline"> x_j </math> amounts to maximizing the likelihood of the specific observation <math display="inline"> x_j </math>.

===In general===

In [[Probability theory#Measure-theoretic probability theory|measure-theoretic probability theory]], the [[Probability density function|density function]] is defined as the [[Radon–Nikodym theorem|Radon–Nikodym derivative]] of the probability distribution relative to a common dominating measure.<ref>{{citation |first=Patrick |last=Billingsley | author-link= Patrick Billingsley|title=Probability and Measure |publisher= [[John Wiley & Sons]] |edition=Third |year=1995 |pages=422–423 |mode=cs1 }}</ref> The likelihood function is this density interpreted as a function of the parameter, rather than the random variable.<ref name="Shao03">{{citation| first= Jun| last= Shao| year= 2003 | title= Mathematical Statistics | edition= 2nd | publisher= Springer | at= §4.4.1 |mode=cs1 }}</ref> Thus, we can construct a likelihood function for any distribution, whether discrete, continuous, a mixture, or otherwise. (Likelihoods are comparable, e.g. for parameter estimation, only if they are Radon–Nikodym derivatives with respect to the same dominating measure.)

The above discussion of the likelihood for discrete random variables uses the [[counting measure]], under which the probability density at any outcome equals the probability of that outcome.

===Likelihoods for mixed continuous&ndash;discrete distributions===
The above can be extended in a simple way to allow consideration of distributions which contain both discrete and continuous components. Suppose that the distribution consists of a number of discrete probability masses <math display="inline">p_k (\theta)</math> and a density <math display="inline">f(x\mid\theta)</math>, where the sum of all the <math display="inline">p</math>'s added to the integral of <math display="inline">f</math> is always one. Assuming that it is possible to distinguish an observation corresponding to one of the discrete probability masses from one which corresponds to the density component, the likelihood function for an observation from the continuous component can be dealt with in the manner shown above. For an observation from the discrete component, the likelihood function for an observation from the discrete component is simply
<math display="block">\mathcal{L}(\theta \mid x )= p_k(\theta), </math>
where <math display="inline">k</math> is the index of the discrete probability mass corresponding to observation <math display="inline">x</math>, because maximizing the probability mass (or probability) at <math display="inline">x</math> amounts to maximizing the likelihood of the specific observation.

The fact that the likelihood function can be defined in a way that includes contributions that are not commensurate (the density and the probability mass) arises from the way in which the likelihood function is defined up to a constant of proportionality, where this "constant" can change with the observation <math display="inline">x</math>, but not with the parameter <math display="inline">\theta</math>.

=== Regularity conditions ===
In the context of parameter estimation, the likelihood function is usually assumed to obey certain conditions, known as regularity conditions. These conditions are {{em|assumed}} in various proofs involving likelihood functions, and need to be verified in each particular application. For maximum likelihood estimation, the existence of a global maximum of the likelihood function is of the utmost importance. By the [[extreme value theorem]], it suffices that the likelihood function is [[Continuous function|continuous]] on a [[compactness|compact]] parameter space for the maximum likelihood estimator to exist.<ref>{{cite book |first1=Christian |last1=Gouriéroux |author-link=Christian Gouriéroux |first2=Alain |last2=Monfort |year=1995 |title=Statistics and Econometric Models |location=New York |publisher=Cambridge University Press |isbn=0-521-40551-3 |page=161 |url=https://books.google.com/books?id=gqI-pAP2JZ8C&pg=PA161 }}</ref> While the continuity assumption is usually met, the compactness assumption about the parameter space is often not, as the bounds of the true parameter values might be unknown. In that case, [[Concave function|concavity]] of the likelihood function plays a key role.

More specifically, if the likelihood function is twice continuously differentiable on the <var>k</var>-dimensional parameter space <math display="inline"> \Theta </math> assumed to be an [[Open set|open]] [[Connected space|connected]] subset of <math display="inline"> \mathbb{R}^{k} \,,</math> there exists a unique maximum <math display="inline">\hat{\theta} \in \Theta</math> if the [[Hessian matrix|matrix of second partials]]
<math display="block"> \mathbf{H}(\theta) \equiv \left[\, \frac{ \partial^2 L }{\, \partial \theta_i \, \partial \theta_j \,} \,\right]_{i,j=1,1}^{n_\mathrm{i},n_\mathrm{j}} \;</math> is [[negative definite]] for every <math display="inline">\, \theta \in \Theta \,</math> at which the gradient <math display="inline">\; \nabla L \equiv \left[\, \frac{ \partial L }{\, \partial \theta_i \,} \,\right]_{i=1}^{n_\mathrm{i}} \;</math> vanishes,
and if the likelihood function approaches a constant on the [[Boundary (topology)|boundary]] of the parameter space, <math display="inline">\; \partial \Theta \;,</math> i.e.,
<math display="block"> \lim_{\theta \to \partial \Theta} L(\theta) = 0 \;,</math>
which may include the points at infinity if <math display="inline"> \, \Theta \, </math> is unbounded. Mäkeläinen and co-authors prove this result using [[Morse theory]] while informally appealing to a mountain pass property.<ref>{{cite journal |first1=Timo |last1=Mäkeläinen |first2=Klaus |last2=Schmidt |first3=George P.H. |last3=Styan |year=1981 |title=On the existence and uniqueness of the maximum likelihood estimate of a vector-valued parameter in fixed-size samples |journal=[[Annals of Statistics]] |volume=9 |issue=4 |pages=758–767 |doi=10.1214/aos/1176345516 |jstor=2240844 |doi-access=free }}</ref> Mascarenhas restates their proof using the [[mountain pass theorem]].<ref>{{cite journal |first=W.F. |last=Mascarenhas |year=2011 |title=A mountain pass lemma and its implications regarding the uniqueness of constrained minimizers |journal=Optimization |volume=60 |issue=8–9 |pages=1121–1159 |doi=10.1080/02331934.2010.527973 |s2cid=15896597 }}</ref>

In the proofs of [[Consistent estimator|consistency]] and asymptotic normality of the maximum likelihood estimator, additional assumptions are made about the probability densities that form the basis of a particular likelihood function. These conditions were first established by Chanda.<ref>{{cite journal |first=K.C. |last=Chanda |year=1954 |title=A note on the consistency and maxima of the roots of likelihood equations |journal=[[Biometrika]] |volume=41 |issue=1–2 |pages=56–61 |doi=10.2307/2333005 |jstor=2333005 }}</ref> In particular, for [[almost all]] <math display="inline">x</math>, and for all <math display="inline">\, \theta \in \Theta \,,</math>
<math display="block">\frac{\partial \log f}{\partial \theta_r} \,, \quad \frac{\partial^2 \log f}{\partial \theta_r \partial \theta_s} \,, \quad \frac{\partial^3 \log f}{\partial \theta_r \, \partial \theta_s \, \partial \theta_t} \,</math>
exist for all <math display="inline">\, r, s, t = 1, 2, \ldots, k \,</math> in order to ensure the existence of a [[Taylor expansion]]. Second, for almost all <math display="inline">x</math> and for every <math display="inline">\, \theta \in \Theta \,</math> it must be that
<math display="block"> \left| \frac{\partial f}{\partial \theta_r} \right| < F_r(x) \,, \quad \left| \frac{\partial^2 f}{\partial \theta_r \, \partial \theta_s} \right| < F_{rs}(x) \,, \quad \left| \frac{\partial^3 f}{\partial \theta_r \, \partial \theta_s \, \partial \theta_t} \right| < H_{rst}(x) </math>
where <math display="inline">H</math> is such that <math display="inline">\, \int_{-\infty}^{\infty} H_{rst}(z) \mathrm{d}z \leq M < \infty \;.</math> This boundedness of the derivatives is needed to allow for [[differentiation under the integral sign]]. And lastly, it is assumed that the [[information matrix]],
<math display="block">\mathbf{I}(\theta) = \int_{-\infty}^{\infty} \frac{\partial \log f}{\partial \theta_r}\ \frac{\partial \log f}{\partial \theta_s}\ f\ \mathrm{d}z </math>
is [[positive definite]] and <math display="inline">\, \left| \mathbf{I}(\theta) \right| \,</math> is finite. This ensures that the [[Score (statistics)|score]] has a finite variance.<ref>{{cite book |first1=Edward |last1=Greenberg |first2=Charles E. Jr. |last2=Webster |title=Advanced Econometrics: A Bridge to the Literature |location=New York, NY |publisher=John Wiley & Sons |year=1983 |isbn=0-471-09077-8 |pages=24–25 }}</ref>

The above conditions are sufficient, but not necessary. That is, a model that does not meet these regularity conditions may or may not have a maximum likelihood estimator of the properties mentioned above. Further, in case of non-independently or non-identically distributed observations additional properties may need to be assumed.

In Bayesian statistics, almost identical regularity conditions are imposed on the likelihood function in order to proof asymptotic normality of the [[posterior probability]],<ref>{{cite journal |first1=C. C. |last1=Heyde |first2=I. M. |last2=Johnstone |title=On Asymptotic Posterior Normality for Stochastic Processes |journal=Journal of the Royal Statistical Society |series=Series B (Methodological) |volume=41 |issue=2 |year=1979 |pages=184–189 |doi=10.1111/j.2517-6161.1979.tb01071.x }}</ref><ref>{{cite journal |first=Chan-Fu |last=Chen |title=On Asymptotic Normality of Limiting Density Functions with Bayesian Implications |journal=Journal of the Royal Statistical Society |series=Series B (Methodological) |volume=47 |issue=3 |year=1985 |pages=540–546 |doi=10.1111/j.2517-6161.1985.tb01384.x }}</ref> and therefore to justify a [[Laplace approximation]] of the posterior in large samples.<ref>{{cite book |first1=Robert E. |last1=Kass |first2=Luke |last2=Tierney |first3=Joseph B. |last3=Kadane |chapter=The Validity of Posterior Expansions Based on Laplace's Method |editor-first=S. |editor-last=Geisser |editor2-first=J. S. |editor2-last=Hodges |editor3-first=S. J. |editor3-last=Press |editor4-first=A. |editor4-last=Zellner |pages=473–488 |publisher=Elsevier |title=Bayesian and Likelihood Methods in Statistics and Econometrics |year=1990 |isbn=0-444-88376-2 }}</ref>

==Likelihood ratio and relative likelihood==
{{See also|Pseudo-R-squared}}
===Likelihood ratio===
{{About|the likelihood ratio in general|the use of likelihood ratios in interpreting diagnostic tests|Likelihood ratios in diagnostic testing|the statistical test to compare goodness of fit|Likelihood-ratio test|section=yes}}
A ''likelihood ratio'' is the ratio of any two specified likelihoods, frequently written as:
<math display="block">\Lambda(\theta_1:\theta_2 \mid x) = \frac{\mathcal{L}(\theta_1 \mid x)}{\mathcal{L}(\theta_2 \mid x)}.</math>

The likelihood ratio is central to [[likelihoodist statistics]]: the ''[[law of likelihood]]'' states that the degree to which data (considered as evidence) supports one parameter value versus another is measured by the likelihood ratio.

In [[frequentist inference]], the likelihood ratio is the basis for a [[test statistic]], the so-called [[likelihood-ratio test]]. By the [[Neyman–Pearson lemma]], this is the most [[Statistical power|powerful]] test for comparing two [[simple hypothesis|simple hypotheses]] at a given [[significance level]]. Numerous other tests can be viewed as likelihood-ratio tests or approximations thereof.<ref>{{cite journal |first=A. |last=Buse |title=The Likelihood Ratio, Wald, and Lagrange Multiplier Tests: An Expository Note |journal=[[The American Statistician]] |volume=36 |issue=3a |year=1982 |pages=153–157 |doi=10.1080/00031305.1982.10482817 }}</ref> The asymptotic distribution of the log-likelihood ratio, considered as a test statistic, is given by [[Wilks' theorem]].

The likelihood ratio is also of central importance in [[Bayesian inference]], where it is known as the [[Bayes factor]], and is used in [[Bayes' rule]]. Stated in terms of [[odds]], Bayes' rule states that the ''posterior'' odds of two alternatives, {{tmath|A_1}} and {{tmath|A_2}}, given an event {{tmath|B}}, is the ''prior'' odds, times the likelihood ratio. As an equation:
<math display="block">O(A_1:A_2 \mid B) = O(A_1:A_2) \cdot \Lambda(A_1:A_2 \mid B).</math>

The likelihood ratio is not directly used in AIC-based statistics. Instead, what is used is the relative likelihood of models (see below).

In [[evidence-based medicine]], likelihood ratios [[Likelihood ratios in diagnostic testing|are used in diagnostic testing]] to assess the value of performing a [[diagnostic test]].

===Relative likelihood function===
{{See also|Relative likelihood}}
Since the actual value of the likelihood function depends on the sample, it is often convenient to work with a standardized measure. Suppose that the [[maximum likelihood estimate]] for the parameter {{mvar|θ}} is <math display="inline">\hat{\theta}</math>.  Relative plausibilities of other {{mvar|θ}} values may be found by comparing the likelihoods of those other values with the likelihood of <math display="inline">\hat{\theta}</math>.  The ''relative likelihood'' of {{mvar|θ}} is defined to be<ref name='Kalbfleisch'>{{citation | author-link= James G. Kalbfleisch | last= Kalbfleisch | first= J. G. | year=1985 | title= Probability and Statistical Inference | publisher= Springer}} (§9.3).</ref><ref>{{citation| last= Azzalini | first= A. | title= Statistical Inference—Based on the likelihood | year= 1996 | publisher= [[Chapman & Hall]] | url= https://books.google.com/books?id=hyN6gXHvSo0C | isbn= 9780412606502 }} (§1.4.2).</ref><ref name='Sprott'>Sprott, D. A. (2000), ''Statistical Inference in Science'', Springer (chap.&nbsp;2).</ref><ref>Davison, A. C. (2008), ''Statistical Models'', [[Cambridge University Press]] (§4.1.2).</ref><ref>{{citation|first1= L. | last1= Held | first2= D. S. | last2= Sabanés Bové | title= Applied Statistical Inference—Likelihood and Bayes | year= 2014 | publisher= Springer}} (§2.1).</ref> 
<math display="block">R(\theta) = \frac{\mathcal{L}(\theta \mid x)}{\mathcal{L}(\hat{\theta} \mid x)}.</math>
Thus, the relative likelihood is the likelihood ratio (discussed above) with the fixed denominator <math display="inline"> \mathcal{L}(\hat{\theta})</math>. This corresponds to standardizing the likelihood to have a maximum of 1.

====Likelihood region====
A ''likelihood region'' is the set of all values of {{mvar|θ}} whose relative likelihood is greater than or equal to a given threshold. In terms of percentages, a ''{{mvar|p}}% likelihood region'' for {{mvar|θ}} is defined to be<ref name='Kalbfleisch'/><ref name='Sprott'/><ref name="Rossi2018">{{citation | last= Rossi | first= R. J. | year= 2018 | title= Mathematical Statistics | publisher= [[Wiley (publisher)|Wiley]] | page= 267 }}.</ref>

<math display="block">
\left\{\theta : R(\theta) \ge \frac p {100} \right\}.
</math>

If {{mvar|θ}} is a single real parameter, a {{mvar|p}}% likelihood region will usually comprise an [[Interval (mathematics)|interval]] of real values. If the region does comprise an interval, then it is called a ''likelihood interval''.<ref name='Kalbfleisch'/><ref name='Sprott'/><ref name=Hudson>{{Citation
| last1 = Hudson | first1 = D. J.
| title = Interval estimation from the likelihood function
| journal = [[Journal of the Royal Statistical Society, Series B]]
| volume = 33
| issue = 2
| pages = 256–262
| year = 1971 
| doi = 10.1111/j.2517-6161.1971.tb00877.x
}}.</ref>

Likelihood intervals, and more generally likelihood regions, are used for [[interval estimation]] within likelihoodist statistics: they are similar to [[confidence interval]]s in frequentist statistics and [[credible interval]]s in Bayesian statistics. Likelihood intervals are interpreted directly in terms of relative likelihood, not in terms of [[coverage probability]] (frequentism) or [[posterior probability]] (Bayesianism).

Given a model, likelihood intervals can be compared to confidence intervals. If {{mvar|θ}} is a single real parameter, then under certain conditions, a 14.65% likelihood interval (about 1:7 likelihood) for {{mvar|θ}} will be the same as a 95% confidence interval (19/20 coverage probability).<ref name='Kalbfleisch'/><ref name="Rossi2018"/> In a slightly different formulation suited to the use of log-likelihoods (see [[Likelihood-ratio test#Distribution: Wilks.27 theorem|Wilks' theorem]]), the test statistic is twice the difference in log-likelihoods and the probability distribution of the test statistic is approximately a [[chi-squared distribution]] with degrees-of-freedom (df) equal to the difference in df's between the two models (therefore, the {{mvar|e}}<sup>&minus;2</sup> likelihood interval is the same as the 0.954 confidence interval; assuming difference in df's to be 1).<ref name="Rossi2018"/><ref name=Hudson/>

==Likelihoods that eliminate nuisance parameters==
In many cases, the likelihood is a function of more than one parameter but interest focuses on the estimation of only one, or at most a few of them, with the others being considered as [[nuisance parameter]]s. Several alternative approaches have been developed to eliminate such nuisance parameters, so that a likelihood can be written as a function of only the parameter (or parameters) of interest: the main approaches are profile, conditional, and marginal likelihoods.<ref>{{cite book |title=In All Likelihood: Statistical Modelling and Inference Using Likelihood |first=Yudi |last=Pawitan |year=2001 |publisher= [[Oxford University Press]] }}</ref><ref>{{cite web | author = Wen Hsiang Wei |url= http://web.thu.edu.tw/wenwei/www/glmpdfmargin.htm |title = Generalized Linear Model - course notes | pages = Chapter 5 | publisher = [[Tunghai University]] | location= Taichung, Taiwan | access-date = 2017-10-01 }}</ref> These approaches are also useful when a high-dimensional likelihood surface needs to be reduced to one or two parameters of interest in order to allow a [[Graph of a function|graph]].

===Profile likelihood===
It is possible to reduce the dimensions by concentrating the likelihood function for a subset of parameters by expressing the nuisance parameters as functions of the parameters of interest and replacing them in the likelihood function.<ref>{{cite book |first=Takeshi |last=Amemiya |author-link=Takeshi Amemiya |title=Advanced Econometrics |chapter=Concentrated Likelihood Function |location=Cambridge |publisher=Harvard University Press |year=1985 |pages=[https://archive.org/details/advancedeconomet00amem/page/125 125–127] |isbn=978-0-674-00560-0 |chapter-url=https://books.google.com/books?id=0bzGQE14CwEC&pg=PA125 |url-access=registration |url=https://archive.org/details/advancedeconomet00amem/page/125 }}</ref><ref>{{cite book |first1=Russell |last1=Davidson |first2=James G. |last2=MacKinnon |author-link2=James G. MacKinnon |title=Estimation and Inference in Econometrics |chapter=Concentrating the Loglikelihood Function |location=New York |publisher=Oxford University Press |year=1993 |pages=267–269 |isbn=978-0-19-506011-9 }}</ref> In general, for a likelihood function depending on the parameter vector <math display="inline">\mathbf{\theta}</math> that can be partitioned into <math display="inline">\mathbf{\theta} = \left( \mathbf{\theta}_{1} : \mathbf{\theta}_{2} \right)</math>, and where a correspondence <math display="inline">\mathbf{\hat{\theta}}_{2} = \mathbf{\hat{\theta}}_{2} \left( \mathbf{\theta}_{1} \right)</math> can be determined explicitly, concentration reduces [[Computational complexity|computational burden]] of the original maximization problem.<ref>{{cite book |first1=Christian |last1=Gourieroux |first2=Alain |last2=Monfort |title=Statistics and Econometric Models |chapter=Concentrated Likelihood Function |location=New York |publisher=Cambridge University Press |year=1995 |isbn=978-0-521-40551-5 |pages=170–175 |chapter-url=https://books.google.com/books?id=gqI-pAP2JZ8C&pg=PA170 }}</ref>

For instance, in a [[linear regression]] with normally distributed errors, <math display="inline">\mathbf{y} = \mathbf{X} \beta + u</math>, the coefficient vector could be [[Partition of a set|partitioned]] into <math display="inline">\beta = \left[ \beta_{1} : \beta_{2} \right]</math> (and consequently the [[design matrix]] <math display="inline">\mathbf{X} = \left[ \mathbf{X}_{1} : \mathbf{X}_{2} \right]</math>). Maximizing with respect to <math display="inline">\beta_{2}</math> yields an optimal value function <math display="inline">\beta_{2} (\beta_{1}) = \left( \mathbf{X}_{2}^{\mathsf{T}} \mathbf{X}_{2} \right)^{-1} \mathbf{X}_{2}^{\mathsf{T}} \left( \mathbf{y} - \mathbf{X}_{1} \beta_{1} \right)</math>. Using this result, the maximum likelihood estimator for <math display="inline">\beta_{1}</math> can then be derived as
<math display="block">\hat{\beta}_{1} = \left( \mathbf{X}_{1}^{\mathsf{T}} \left( \mathbf{I} - \mathbf{P}_{2} \right) \mathbf{X}_{1} \right)^{-1} \mathbf{X}_{1}^{\mathsf{T}} \left( \mathbf{I} - \mathbf{P}_{2} \right) \mathbf{y}</math>
where <math display="inline">\mathbf{P}_{2} = \mathbf{X}_{2} \left( \mathbf{X}_{2}^{\mathsf{T}} \mathbf{X}_{2} \right)^{-1} \mathbf{X}_{2}^{\mathsf{T}}</math> is the [[projection matrix]] of <math display="inline">\mathbf{X}_{2}</math>. This result is known as the [[Frisch–Waugh–Lovell theorem]].

Since graphically the procedure of concentration is equivalent to slicing the likelihood surface along the ridge of values of the nuisance parameter <math display="inline">\beta_{2}</math> that maximizes the likelihood function, creating an [[Contour line|isometric]] [[Topographic profile|profile]] of the likelihood function for a given <math display="inline">\beta_{1}</math>, the result of this procedure is also known as ''profile likelihood''.<ref>{{citation |first=Andrew |last=Pickles |title=An Introduction to Likelihood Analysis |location=Norwich |publisher=W. H. Hutchins & Sons |year=1985 |isbn=0-86094-190-6 |pages=[https://archive.org/details/introductiontoli0000pick/page/21 21–24] |mode=cs1 |url=https://archive.org/details/introductiontoli0000pick/page/21 }}</ref><ref>{{cite book |first=Benjamin M. |last=Bolker |title=Ecological Models and Data in R |publisher=Princeton University Press |year=2008 |isbn=978-0-691-12522-0 |pages=187–189 |url=https://books.google.com/books?id=flyBd1rpqeoC&pg=PA188 }}</ref> In addition to being graphed, the profile likelihood can also be used to compute [[confidence interval]]s that often have better small-sample properties than those based on asymptotic [[Standard error (statistics)|standard errors]] calculated from the full likelihood.<ref>{{citation|last=Aitkin|first=Murray|title=GLIM 82: Proceedings of the International Conference on Generalised Linear Models|pages=76–86|year=1982|chapter=Direct Likelihood Inference|publisher=Springer|isbn=0-387-90777-7|author-link=Murray Aitkin|mode=cs1}}</ref><ref>{{citation |first1=D. J. |last1=Venzon |first2=S. H. |last2=Moolgavkar |title=A Method for Computing Profile-Likelihood-Based Confidence Intervals |journal=[[Journal of the Royal Statistical Society]] |series=Series C (Applied Statistics) |volume=37 |issue=1 |year=1988 |pages=87–94 |doi=10.2307/2347496 |jstor=2347496 |mode=cs1 }}</ref>

===Conditional likelihood===
Sometimes it is possible to find a [[sufficient statistic]] for the nuisance parameters, and conditioning on this statistic results in a likelihood which does not depend on the nuisance parameters.<ref>{{cite journal |first1=J. D. |last1=Kalbfleisch |first2=D. A. |last2=Sprott |title=Marginal and Conditional Likelihoods |journal=Sankhyā: The Indian Journal of Statistics |series=Series A |volume=35 |issue=3 |year=1973 |pages=311–328 |jstor=25049882 }}</ref>

One example occurs in 2×2 tables, where conditioning on all four marginal totals leads to a conditional likelihood based on the non-central [[hypergeometric distribution]]. This form of conditioning is also the basis for [[Fisher's exact test]].

===Marginal likelihood===
{{Main|Marginal likelihood}}
Sometimes we can remove the nuisance parameters by considering a likelihood based on only part of the information in the data, for example by using the set of ranks rather than the numerical values. Another example occurs in linear [[mixed model]]s, where considering a likelihood for the residuals only after fitting the fixed effects leads to [[residual maximum likelihood]] estimation of the variance components.

===Partial likelihood===
A partial likelihood is an adaption of the full likelihood such that only a part of the parameters (the parameters of interest) occur in it.<ref>
{{citation
 |last=Cox |first=D. R. |author-link=David Cox (statistician)
 |title=Partial likelihood
 |journal=[[Biometrika]]
 |year=1975 |volume=62 |issue=2 |pages=269&ndash;276
 |doi=10.1093/biomet/62.2.269 |mr=0400509
|mode=cs1 }}</ref> It is a key component of the [[proportional hazards model]]: using a restriction on the hazard function, the likelihood does not contain the shape of the hazard over time.

==Products of likelihoods==
The likelihood, given two or more [[independence (probability theory)|independent]] [[Event (probability theory)|events]], is the product of the likelihoods of each of the individual events:
<math display="block">\Lambda(A \mid X_1 \land X_2) = \Lambda(A \mid X_1) \cdot \Lambda(A \mid X_2).</math>
This follows from the definition of independence in probability: the probabilities of two independent events happening, given a model, is the product of the probabilities.

This is particularly important when the events are from [[independent and identically distributed random variables]], such as independent observations or [[sampling with replacement]]. In such a situation, the likelihood function factors into a product of individual likelihood functions.

The empty product has value 1, which corresponds to the likelihood, given no event, being 1: before any data, the likelihood is always 1. This is similar to a [[uniform prior]] in Bayesian statistics, but in likelihoodist statistics this is not an [[improper prior]] because likelihoods are not integrated.

==Log-likelihood==
{{see also|Log-probability}}
''Log-likelihood function'' is the logarithm of the likelihood function, often denoted by a lowercase {{math|''l''}} or {{tmath|\ell}}, to contrast with the uppercase {{math|''L''}} or <math display="inline">\mathcal{L}</math> for the likelihood. Because logarithms are [[strictly increasing]] functions, maximizing the likelihood is equivalent to maximizing the log-likelihood. But for practical purposes it is more convenient to work with the log-likelihood function in [[maximum likelihood estimation]], in particular since most common [[probability distribution]]s—notably the [[exponential family]]—are only [[Logarithmically concave function|logarithmically concave]],<ref>{{citation |first1=Robert E. |last1=Kass |first2=Paul W. |last2=Vos |title=Geometrical Foundations of Asymptotic Inference |location=New York |publisher=John Wiley & Sons |year=1997 |isbn=0-471-82668-5 |page=14 |url=https://books.google.com/books?id=e43EAIfUPCwC&pg=PA14 |mode=cs1 }}</ref><ref>{{cite web |first=Alecos |last=Papadopoulos |title=Why we always put log() before the joint pdf when we use MLE (Maximum likelihood Estimation)? |date=September 25, 2013 |work=[[Stack Exchange]] |url=https://stats.stackexchange.com/q/70975 }}</ref> and [[Concave function|concavity]] of the [[objective function]] plays a key role in the [[Mathematical optimization|maximization]].

Given the independence of each event, the overall log-likelihood of intersection equals the sum of the log-likelihoods of the individual events. This is analogous to the fact that the overall [[log-probability]] is the sum of the log-probability of the individual events. In addition to the mathematical convenience from this, the adding process of log-likelihood has an intuitive interpretation, as often expressed as "support" from the data. When the parameters are estimated using the log-likelihood for the [[maximum likelihood estimation]], each data point is used by being added to the total log-likelihood. As the data can be viewed as an evidence that support the estimated parameters, this process can be interpreted as "support from independent evidence ''adds",'' and the log-likelihood is the "weight of evidence". Interpreting negative log-probability as [[information content]] or [[surprisal]], the support (log-likelihood) of a model, given an event, is the negative of the surprisal of the event, given the model: a model is supported by an event to the extent that the event is unsurprising, given the model.

A logarithm of a likelihood ratio is equal to the difference of the log-likelihoods:
<math display="block">\log \frac{\mathcal{L}(A)}{\mathcal{L}(B)} = \log \mathcal{L}(A) - \log \mathcal{L}(B) = \ell(A) - \ell(B).</math>

Just as the likelihood, given no event, being 1, the log-likelihood, given no event, is 0, which corresponds to the value of the empty sum: without any data, there is no support for any models.

===Graph===
The [[Graph of a function|graph]] of the log-likelihood is called the '''support curve''' (in the [[univariate]] case).<ref name="Edwards72">{{cite book|last=Edwards|first=A. W. F.|authorlink=A. W. F. Edwards| orig-date=1972| year=1992| title=Likelihood| publisher=[[Johns Hopkins University Press]]|isbn=0-8018-4443-6}}</ref>
In the multivariate case, the concept generalizes into a '''support surface''' over the [[parameter space]].
It has a relation to, but is distinct from, the [[Support (mathematics)#Support (statistics)|support of a distribution]].

The term was coined by [[A. W. F. Edwards]]<ref name="Edwards72" /> in the context of [[statistical hypothesis testing]], i.e. whether or not the data "support" one hypothesis (or parameter value) being tested more than any other.

The log-likelihood function being plotted is used in the computation of the [[score (statistics)|score]] (the gradient of the log-likelihood) and [[Fisher information]] (the curvature of the log-likelihood). Thus, the graph has a direct interpretation in the context of [[maximum likelihood estimation]] and [[likelihood-ratio test]]s.

===Likelihood equations===
If the log-likelihood function is [[Smoothness|smooth]], its [[gradient]] with respect to the parameter, known as the [[Score (statistics)|score]] and written <math display="inline">s_{n}(\theta) \equiv \nabla_{\theta} \ell_{n}(\theta)</math>, exists and allows for the application of [[differential calculus]]. The basic way to maximize a differentiable function is to find the [[stationary point]]s (the points where the [[derivative]] is zero); since the derivative of a sum is just the sum of the derivatives, but the derivative of a product requires the [[product rule]], it is easier to compute the stationary points of the log-likelihood of independent events than for the likelihood of independent events.

The equations defined by the stationary point of the score function serve as [[estimating equations]] for the maximum likelihood estimator.
<math display="block">s_{n}(\theta) = \mathbf{0}</math>
In that sense, the maximum likelihood estimator is implicitly defined by the value at <math display="inline">\mathbf{0}</math> of the [[inverse function]] <math display="inline">s_{n}^{-1}: \mathbb{E}^{d} \to \Theta</math>, where <math display="inline">\mathbb{E}^{d}</math> is the <var>d</var>-dimensional [[Euclidean space]], and <math display="inline">\Theta</math> is the parameter space. Using the [[inverse function theorem]], it can be shown that <math display="inline">s_{n}^{-1}</math> is [[well-defined]] in an [[open neighborhood]] about <math display="inline">\mathbf{0}</math> with probability going to one, and <math display="inline">\hat{\theta}_{n} = s_{n}^{-1}(\mathbf{0})</math> is a consistent estimate of <math display="inline">\theta</math>. As a consequence there exists a sequence <math display="inline">\left\{ \hat{\theta}_{n} \right\}</math> such that <math display="inline">s_{n}(\hat{\theta}_{n}) = \mathbf{0}</math> asymptotically [[almost surely]], and <math display="inline">\hat{\theta}_{n} \xrightarrow{\text{p}} \theta_{0}</math>.<ref>{{cite journal |first=Robert V. |last=Foutz |title=On the Unique Consistent Solution to the Likelihood Equations |journal=[[Journal of the American Statistical Association]] |volume=72 |year=1977 |issue=357 |pages=147–148 |doi=10.1080/01621459.1977.10479926 }}</ref> A similar result can be established using [[Rolle's theorem]].<ref>{{cite journal |first1=Robert E. |last1=Tarone |first2=Gary |last2=Gruenhage |title=A Note on the Uniqueness of Roots of the Likelihood Equations for Vector-Valued Parameters |journal=Journal of the American Statistical Association |volume=70 |year=1975 |issue=352 |pages=903–904 |doi=10.1080/01621459.1975.10480321 }}</ref><ref>{{cite journal |first1=Kamta |last1=Rai |first2=John |last2=Van Ryzin |title=A Note on a Multivariate Version of Rolle's Theorem and Uniqueness of Maximum Likelihood Roots |journal=Communications in Statistics |series=Theory and Methods |volume=11 |year=1982 |issue=13 |pages=1505–1510 |doi=10.1080/03610928208828325 }}</ref>

The second derivative evaluated at <math display="inline">\hat{\theta}</math>, known as [[Fisher information]], determines the curvature of the likelihood surface,<ref>{{citation |first=B. Raja |last=Rao |title=A formula for the curvature of the likelihood surface of a sample drawn from a distribution admitting sufficient statistics |journal=[[Biometrika]] |volume=47 |issue=1–2 |year=1960 |pages=203–207 |doi=10.1093/biomet/47.1-2.203 |mode=cs1 }}</ref> and thus indicates the [[Precision (statistics)|precision]] of the estimate.<ref>{{citation |first1=Michael D. |last1=Ward |first2=John S. |last2=Ahlquist |title=Maximum Likelihood for Social Science : Strategies for Analysis |publisher= [[Cambridge University Press]] |year=2018 |pages=25–27 |mode=cs1 }}</ref>

===Exponential families===
{{further|Exponential family}}
The log-likelihood is also particularly useful for [[exponential families]] of distributions, which include many of the common [[parametric model|parametric probability distributions]]. The probability distribution function (and thus likelihood function) for exponential families contain products of factors involving [[exponentiation]]. The logarithm of such a function is a sum of products, again easier to differentiate than the original function.

An exponential family is one whose probability density function is of the form (for some functions, writing <math display="inline">\langle -, - \rangle</math> for the [[inner product]]):

<math display="block"> p(x \mid \boldsymbol \theta) = h(x) \exp\Big(\langle \boldsymbol\eta({\boldsymbol \theta}), \mathbf{T}(x)\rangle -A({\boldsymbol \theta}) \Big).</math>

Each of these terms has an interpretation,{{efn|See {{slink|Exponential family|Interpretation}}}} but simply switching from probability to likelihood and taking logarithms yields the sum:

<math display="block"> \ell(\boldsymbol \theta \mid x) = \langle \boldsymbol\eta({\boldsymbol \theta}), \mathbf{T}(x)\rangle - A({\boldsymbol \theta}) + \log h(x).</math>

The <math display="inline">\boldsymbol \eta(\boldsymbol \theta)</math> and <math display="inline">h(x)</math> each correspond to a [[change of coordinates]], so in these coordinates, the log-likelihood of an exponential family is given by the simple formula:

<math display="block"> \ell(\boldsymbol \eta \mid x) = \langle \boldsymbol\eta, \mathbf{T}(x)\rangle - A({\boldsymbol \eta}).</math>

In words, the log-likelihood of an exponential family is inner product of the natural parameter {{tmath|\boldsymbol\eta}} and the [[sufficient statistic]] {{tmath|\mathbf{T}(x)}}, minus the normalization factor ([[log-partition function]]) {{tmath|A({\boldsymbol \eta})}}. Thus for example the maximum likelihood estimate can be computed by taking derivatives of the sufficient statistic {{math|''T''}} and the log-partition function {{math|''A''}}.

====Example: the gamma distribution====
The [[gamma distribution]] is an exponential family with two parameters, <math display="inline">\alpha</math> and <math display="inline">\beta</math>. The likelihood function is

<math display="block">\mathcal{L} (\alpha, \beta \mid x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}.</math>

Finding the maximum likelihood estimate of <math display="inline">\beta</math> for a single observed value <math display="inline">x</math> looks rather daunting. Its logarithm is much simpler to work with:

<math display="block">\log \mathcal{L}(\alpha,\beta \mid x) = \alpha \log \beta - \log \Gamma(\alpha) + (\alpha-1) \log x  - \beta x. \, </math>

To maximize the log-likelihood, we first take the [[partial derivative]] with respect to <math display="inline">\beta</math>:

<math display="block">\frac{\partial \log \mathcal{L}(\alpha,\beta \mid x)}{\partial \beta} = \frac{\alpha}{\beta} - x.</math>

If there are a number of independent observations <math display="inline">x_1, \ldots, x_n</math>, then the joint log-likelihood will be the sum of individual log-likelihoods, and the derivative of this sum will be a sum of derivatives of each individual log-likelihood:

<math display="block">
\begin{align}
& \frac{\partial \log \mathcal{L}(\alpha,\beta \mid x_1, \ldots, x_n)}{\partial \beta} \\
&= \frac{\partial \log \mathcal{L}(\alpha,\beta \mid x_1)}{\partial \beta} + \cdots + \frac{\partial \log \mathcal{L}(\alpha,\beta \mid x_n)}{\partial \beta} \\
&= \frac{n \alpha} \beta - \sum_{i=1}^n x_i.
\end{align}
</math>

To complete the maximization procedure for the joint log-likelihood, the equation is set to zero and solved for <math display="inline">\beta</math>:

<math display="block">\widehat\beta = \frac{\alpha}{\bar{x}}.</math>

Here <math display="inline">\widehat\beta</math> denotes the maximum-likelihood estimate, and <math display="inline">\textstyle \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i</math> is the [[sample mean]] of the observations.

==Background and interpretation==
===Historical remarks===
{{see also|History of statistics|History of probability}}

The term "likelihood" has been in use in English since at least late [[Middle English]].<ref>"likelihood", ''[[Shorter Oxford English Dictionary]]'' (2007).</ref> Its formal use to refer to a specific [[Function (mathematics)|function]] in mathematical statistics was proposed by [[Ronald Fisher]],<ref>{{Citation | title=On the history of maximum likelihood in relation to inverse probability and least squares| first= A. | last=Hald |author-link=Anders Hald |journal=[[Statistical Science]] |volume= 14| issue=2 |year=1999 | pages =214&ndash;222 | doi=10.1214/ss/1009212248 | jstor = 2676741|url=http://projecteuclid.org/download/pdf_1/euclid.ss/1009212248 |mode=cs1 | doi-access=free }}</ref> in two research papers published in 1921<ref>{{citation
 | last=Fisher | first=R.A. |author-link=Ronald Fisher
 | journal= Metron
 | title= On the "probable error" of a coefficient of correlation deduced from a small sample
 | volume=1 | year=1921 | pages=3–32 |mode=cs1 }}</ref> and 1922.<ref name=Fisher1922>{{citation
 | last=Fisher | first=R.A. |author-link=Ronald Fisher
 | journal= Philosophical Transactions of the Royal Society A
 | title=On the mathematical foundations of theoretical statistics
 | volume=222 | issue=594–604 | year=1922 | pages=309–368
 | url=http://digital.library.adelaide.edu.au/dspace/handle/2440/15172
 | jstor=91208 | jfm = 48.1280.02 |doi=10.1098/rsta.1922.0009
| bibcode=1922RSPTA.222..309F |mode=cs1 | doi-access=free | hdl=2440/15172 | hdl-access=free }}</ref> The 1921 paper introduced what is today called a "likelihood interval"; the 1922 paper introduced the term "[[method of maximum likelihood]]". Quoting Fisher:

{{Cquote|[I]n 1922, I proposed the term 'likelihood,' in view of the fact that, with respect to [the parameter], it is not a probability, and does not obey the laws of probability, while at the same time it bears to the problem of rational choice among the possible values of [the parameter] a relation similar to that which probability bears to the problem of predicting events in games of chance. . . . Whereas, however, in relation to psychological judgment, likelihood has some resemblance to probability, the two concepts are wholly distinct. . . ."<ref>{{citation |last=Klemens |first=Ben |title=Modeling with Data: Tools and Techniques for Scientific Computing |publisher= [[Princeton University Press]] |year=2008 |page=329 |mode=cs1 }}</ref>}}

The concept of likelihood should not be confused with probability as mentioned by Sir Ronald Fisher

{{Cquote|I stress this because in spite of the emphasis that I have always laid upon the difference between probability and likelihood there is still a tendency to treat likelihood as though it were a sort of probability. The first result is thus that there are two different measures of rational belief appropriate to different cases. Knowing the population we can express our incomplete knowledge of, or expectation of, the sample in terms of probability; knowing the sample we can express our incomplete knowledge of the population in terms of likelihood.<ref>{{citation
  | last = Fisher | first = Ronald | authorlink=Ronald Fisher
  | title = Inverse Probability
  | year = 1930
  | journal = [[Mathematical Proceedings of the Cambridge Philosophical Society]]
  | volume = 26 | issue = 4
  | pages= 528–535
  | doi = 10.1017/S0305004100016297
  | bibcode = 1930PCPS...26..528F
 |mode=cs1 }}</ref>}}

Fisher's invention of statistical likelihood was in reaction against an earlier form of reasoning called [[inverse probability]].<ref>{{citation | last1 = Fienberg | first1 = Stephen E | year = 1997 | title = Introduction to R.A. Fisher on inverse probability and likelihood | journal = [[Statistical Science]] | volume = 12 | issue = 3| page = 161 | doi = 10.1214/ss/1030037905 |mode=cs1 | doi-access = free }}</ref> His use of the term "likelihood" fixed the meaning of the term within mathematical statistics.

[[A. W. F. Edwards]] (1972) established the axiomatic basis for use of the log-likelihood ratio as a measure of relative support for one hypothesis against another. The ''support function'' is then the natural logarithm of the likelihood function. Both terms are used in [[phylogenetics]], but were not adopted in a general treatment of the topic of statistical evidence.<ref>{{citation |last=Royall |first=R. |year=1997 |title=Statistical Evidence |publisher=[[Chapman & Hall]] |mode=cs1 }}</ref>

===Interpretations under different foundations===
Among statisticians, there is no consensus about what the [[Foundations of statistics|foundation of statistics]] should be. There are four main paradigms that have been proposed for the foundation: [[frequentism]], [[Bayesianism]], [[likelihoodism]], and [[Akaike information criterion|AIC-based]].<ref name="BF11">{{Citation |editor1-last= Bandyopadhyay |editor1-first= P. S. |editor-first2= M. R. |editor-last2= Forster | title = Philosophy of Statistics | publisher= [[North-Holland Publishing]] | year = 2011 |mode=cs1 }}</ref> For each of the proposed foundations, the interpretation of likelihood is different. The four interpretations are described in the subsections below.

====Frequentist interpretation====
{{empty section|date=March 2019}}

====Bayesian interpretation====
In [[Bayesian inference]], although one can speak about the likelihood of any proposition or [[random variable]] given another random variable: for example the likelihood of a parameter value or of a [[statistical model]] (see [[marginal likelihood]]), given specified data or other evidence,<ref name='good1950'>I. J. Good: ''Probability and the Weighing of Evidence'' (Griffin 1950), §6.1</ref><ref name='jeffreys1983'>H. Jeffreys: ''Theory of Probability'' (3rd ed., Oxford University Press 1983), §1.22</ref><ref name='jaynes2003'>E. T. Jaynes: ''Probability Theory: The Logic of Science'' (Cambridge University Press 2003),  §4.1</ref><ref name='lindley1980'>D. V. Lindley: ''Introduction to Probability and Statistics from a Bayesian Viewpoint. Part 1: Probability'' (Cambridge University Press 1980), §1.6</ref> the likelihood function remains the same entity, with the additional interpretations of (i) a [[Conditional probability distribution|conditional density]] of the data given the parameter (since the parameter is then a random variable) and (ii) a measure or amount of information brought by the data about the parameter value or even the model.<ref name='good1950'/><ref name='jeffreys1983'/><ref name='jaynes2003'/><ref name='lindley1980'/><ref name='gelmanetal2014'>A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, D. B. Rubin: ''Bayesian Data Analysis'' (3rd ed., Chapman & Hall/CRC 2014), §1.3</ref> Due to the introduction of a probability structure on the parameter space or on the collection of models, it is possible that a parameter value or a statistical model have a large likelihood value for given data, and yet have a low ''probability'', or vice versa.<ref name='jaynes2003'/><ref name='gelmanetal2014'/> This is often the case in medical contexts.<ref>{{citation |first1=H. C. |last1=Sox |first2=M. C. |last2=Higgins |first3=D. K. |last3=Owens |title=Medical Decision Making |edition=2nd |publisher=Wiley |year=2013 |doi=10.1002/9781118341544 |isbn=9781118341544 |at=chapters 3–4 }}</ref> Following [[Bayes' Rule]], the likelihood when seen as a conditional density can be multiplied by the [[prior probability]] density of the parameter and then normalized, to give a [[posterior probability]] density.<ref name='good1950'/><ref name='jeffreys1983'/><ref name='jaynes2003'/><ref name='lindley1980'/><ref name="gelmanetal2014"/> More generally, the likelihood of an unknown quantity <math display="inline">X</math> given another unknown quantity <math display="inline">Y</math> is proportional to the ''probability of <math display="inline">Y</math> given <math display="inline">X</math>''.<ref name='good1950'/><ref name='jeffreys1983'/><ref name='jaynes2003'/><ref name='lindley1980'/><ref name='gelmanetal2014'/>

====Likelihoodist interpretation====
{{more footnotes needed|date=April 2019}}
In frequentist statistics, the likelihood function is itself a [[statistic]] that summarizes a single sample from a population, whose calculated value depends on a choice of several parameters ''θ''<sub>1</sub> ... ''θ''<sub>p</sub>, where ''p'' is the count of parameters in some already-selected [[statistical model]]. The value of the likelihood serves as a figure of merit for the choice used for the parameters, and the parameter set with maximum likelihood is the best choice, given the data available.

The specific calculation of the likelihood is the probability that the observed sample would be assigned, assuming that the model chosen and the values of the several parameters '''''θ''''' give an accurate approximation of the [[frequency distribution]] of the population that the observed sample was drawn from. Heuristically, it makes sense that a good choice of parameters is those which render the sample actually observed the maximum possible ''post-hoc'' probability of having happened. [[Wilks' theorem]] quantifies the heuristic rule by showing that the difference in the logarithm of the likelihood generated by the estimate's parameter values and the logarithm of the likelihood generated by population's "true" (but unknown) parameter values is asymptotically [[chi-squared distribution|χ<sup>2</sup> distributed]].

Each independent sample's maximum likelihood estimate is a separate estimate of the "true" parameter set describing the population sampled. Successive estimates from many independent samples will cluster together with the population's "true" set of parameter values hidden somewhere in their midst. The difference in the logarithms of the maximum likelihood and adjacent parameter sets' likelihoods may be used to draw a [[confidence region]] on a plot whose co-ordinates are the parameters ''θ''<sub>1</sub> ... ''θ''<sub>p</sub>. The region surrounds the maximum-likelihood estimate, and all points (parameter sets) within that region differ at most in log-likelihood by some fixed value. The [[chi-squared distribution|χ<sup>2</sup> distribution]] given by [[Wilks' theorem]] converts the region's log-likelihood differences into the "confidence" that the population's "true" parameter set lies inside. The art of choosing the fixed log-likelihood difference is to make the confidence acceptably high while keeping the region acceptably small (narrow range of estimates).

As more data are observed, instead of being used to make independent estimates, they can be combined with the previous samples to make a single combined sample, and that large sample may be used for a new maximum likelihood estimate. As the size of the combined sample increases, the size of the likelihood region with the same confidence shrinks. Eventually, either the size of the confidence region is very nearly a single point, or the entire population has been sampled; in both cases, the estimated parameter set is essentially the same as the population parameter set.

====AIC-based interpretation====
{{expand section|date=March 2019}}
Under the [[Akaike information criterion|AIC]] paradigm, likelihood is interpreted within the context of [[information theory]].<ref>{{Citation | first=H. |last=Akaike |author-link=Hirotugu Akaike | contribution = Prediction and entropy | pages=1–24 | title= A Celebration of Statistics | editor1-first= A. C. | editor1-last= Atkinson | editor2-first= S. E. | editor2-last= Fienberg | editor2-link= Stephen Fienberg | year = 1985 | publisher= Springer |mode=cs1 }}</ref><ref>{{Citation | author1-first= Y. | author1-last= Sakamoto | author2-first= M. | author2-last= Ishiguro | author3-first= G. | author3-last= Kitagawa | title= Akaike Information Criterion Statistics | year= 1986  | publisher= [[D. Reidel]] | at= Part&nbsp;I |mode=cs1 }}</ref><ref>{{Citation |last1=Burnham |first1=K. P. |last2=Anderson |first2=D. R. |year=2002 |title=Model Selection and Multimodel Inference: A practical information-theoretic approach |edition=2nd |publisher= [[Springer-Verlag]] | at= chap.&nbsp;7 |mode=cs1 }}</ref>

== See also ==
{{Columns-list|colwidth=20em|
* [[Bayes factor]]
* [[Conditional entropy]]
* [[Conditional probability]]
* [[Empirical likelihood]]
* [[Likelihood principle]]
* [[Likelihood-ratio test]]
* [[Likelihoodist statistics]]
* [[Maximum likelihood estimation]]
* [[Principle of maximum entropy]]
* [[Pseudolikelihood]]
* [[Score (statistics)]]
}}

==Notes==
{{notelist}}

==References==
{{Reflist}}

==Further reading==
*{{citation |first=Adelchi |last=Azzalini |chapter=Likelihood |title=Statistical Inference Based on the Likelihood |publisher=Chapman and Hall |year=1996 |isbn=0-412-60650-X |pages=17–50 |mode=cs1 }}
*{{citation |first1=Dennis D. |last1=Boos |first2=L. A. |last2=Stefanski |chapter=Likelihood Construction and Estimation |title=Essential Statistical Inference : Theory and Methods |location=New York |publisher=Springer |year=2013 |isbn=978-1-4614-4817-4 |pages=27–124 |doi=10.1007/978-1-4614-4818-1_2 |mode=cs1 }}
*{{citation |last=Edwards |first=A. W. F. |author-link=A. W. F. Edwards |orig-year=1972 |title=Likelihood |publisher=[[Johns Hopkins University Press]] |isbn=0-8018-4443-6 |edition=Expanded |year=1992 |mode=cs1 }}
*{{citation |last=King |first=Gary |author-link=Gary King (political scientist) |chapter=The Likelihood Model of Inference |title=Unifying Political Methodology : the Likehood Theory of Statistical Inference |publisher=Cambridge University Press |year=1989 |isbn=0-521-36697-6 |pages=59–94 |mode=cs1 |chapter-url=https://books.google.com/books?id=cligOwrd7XoC&pg=PA59 }}
* {{cite journal |title= Efficiency Testing of Prediction Markets: Martingale Approach, Likelihood Ratio and Bayes Factor Analysis |date=1 February 2021 |first1=Mark |last1=Richard |first2=Jan |last2=Vecer |journal=Risks |volume=9 |issue=2|page= 31 |doi= 10.3390/risks9020031 |doi-access= free |hdl=10419/258120 |hdl-access=free }}
*{{citation |last=Lindsey |first=J. K. |chapter=Likelihood |title=Parametric Statistical Inference |publisher=Oxford University Press |year=1996 |isbn=0-19-852359-9 |pages=69–139 |chapter-url=https://archive.org/details/parametricstatis0000lind/page/69 |mode=cs1 }}
*{{citation |last=Rohde |first=Charles A. |title=Introductory Statistical Inference with the Likelihood Function |location=Berlin |publisher=Springer |year=2014 |isbn=978-3-319-10460-7 |mode=cs1 }}
*{{citation |last=Royall |first=Richard |title=Statistical Evidence : A Likelihood Paradigm |location=London |publisher=Chapman & Hall |year=1997 |isbn=0-412-04411-0 |mode=cs1 |url-access=registration |url=https://archive.org/details/statisticalevide0000roya }}
*{{citation |first1=Michael D. |last1=Ward |author-link=Michael D. Ward |first2=John S. |last2=Ahlquist |chapter=The Likelihood Function: A Deeper Dive |title=Maximum Likelihood for Social Science : Strategies for Analysis |publisher= [[Cambridge University Press]] |year=2018 |isbn=978-1-316-63682-4 |pages=21–28 |mode=cs1 |chapter-url=https://books.google.com/books?id=iqRyDwAAQBAJ&pg=PA21 }}

==External links==
{{Wiktionary|likelihood}}
* [http://planetmath.org/likelihoodfunction Likelihood function at Planetmath]
*{{cite web |title=Log-likelihood |url=https://www.statlect.com/glossary/log-likelihood |work=Statlect}}

{{Statistics|inference}}
{{Portal bar|Mathematics}}

{{DEFAULTSORT:Likelihood Function}}
[[Category:Likelihood| ]]
[[Category:Bayesian statistics]]