Editing Likelihood function (section)

==Definition==
The likelihood function, parameterized by a (possibly multivariate) parameter <math display="inline">\theta</math>, is usually defined differently for [[Continuous or discrete variable|discrete and continuous]] [[Probability distribution|probability distributions]] (a more general definition is discussed below). Given a probability density or mass function

<math display="block">x\mapsto f(x \mid \theta),</math>

where <math display="inline">x</math> is a realization of the random variable <math display="inline">X</math>, the likelihood function is
<math display="block">\theta\mapsto f(x \mid \theta),</math>
often written
<math display="block">\mathcal{L}(\theta \mid x). </math>

In other words, when <math display="inline">f(x\mid\theta)</math> is viewed as a function of <math display="inline">x</math> with <math display="inline">\theta</math> fixed, it is a probability density function, and when viewed as a function of <math display="inline">\theta</math> with <math display="inline">x</math> fixed, it is a likelihood function. In the [[Frequentist_probability|frequentist paradigm]], the notation <math display="inline">f(x\mid\theta)</math> is often avoided  and instead <math display="inline">f(x;\theta)</math> or <math display="inline">f(x,\theta)</math> are used to indicate that <math display="inline">\theta</math> is regarded as a fixed unknown quantity rather than as a [[random variable]] being conditioned on.  

The likelihood function does ''not'' specify the probability that <math display="inline">\theta</math> is the truth, given the observed sample <math display="inline">X = x</math>. Such an interpretation is a common error, with potentially disastrous consequences (see [[prosecutor's fallacy]]).

===Discrete probability distribution===
Let <math display="inline">X</math> be a discrete [[random variable]] with [[probability mass function]] <math display="inline">p</math> depending on a parameter <math display="inline">\theta</math>. Then the function

<math display="block">\mathcal{L}(\theta \mid x) = p_\theta (x) = P_\theta (X=x), </math>

considered as a function of <math display="inline">\theta</math>, is the ''likelihood function'', given the [[Outcome (probability)|outcome]] <math display="inline">x</math> of the random variable <math display="inline">X</math>. Sometimes the probability of "the value <math display="inline">x</math> of <math display="inline">X</math> for the parameter value <math display="inline">\theta</math>{{resize|20%|&nbsp;}}" is written as {{math|''P''(''X'' {{=}} ''x'' {{!}} ''θ'')}} or {{math|''P''(''X'' {{=}} ''x''; ''θ'')}}. The likelihood is the probability that a particular outcome <math display="inline">x</math> is observed when the true value of the parameter is <math display="inline">\theta</math>, equivalent to the probability mass on <math display="inline">x</math>; it is ''not'' a probability density over the parameter <math display="inline">\theta</math>. The likelihood, <math display="inline">\mathcal{L}(\theta \mid x) </math>, should not be confused with <math display="inline">P(\theta \mid x)</math>, which is the posterior probability of <math display="inline">\theta</math> given the data <math display="inline">x</math>.

====Example====
[[Image:likelihoodFunctionAfterHH.png|thumb|400px|Figure 1.&nbsp; The likelihood function (<math display="inline">p_\text{H}^2</math>) for the probability of a coin landing heads-up (without prior knowledge of the coin's fairness), given that we have observed HH.]]
[[Image:likelihoodFunctionAfterHHT.png|thumb|400px|Figure 2.&nbsp; The likelihood function (<math display="inline">p_\text{H}^2(1-p_\text{H})</math>) for the probability of a coin landing heads-up (without prior knowledge of the coin's fairness), given that we have observed HHT.]]
Consider a simple statistical model of a coin flip: a single parameter <math display="inline">p_\text{H}</math> that expresses the "fairness" of the coin. The parameter is the probability that a coin lands heads up ("H") when tossed. <math display="inline">p_\text{H}</math> can take on any value within the range 0.0 to 1.0. For a perfectly [[fair coin]], <math display="inline">p_\text{H} = 0.5</math>.

Imagine flipping a fair coin twice, and observing two heads in two tosses ("HH"). Assuming that each successive coin flip is [[Independent and identically distributed random variables|i.i.d.]], then the probability of observing HH is

<math display="block">P(\text{HH} \mid p_\text{H}=0.5) = 0.5^2 = 0.25.</math>

Equivalently, the likelihood of observing "HH" assuming <math display="inline">p_\text{H} = 0.5</math> is

<math display="block">\mathcal{L}(p_\text{H}=0.5 \mid \text{HH}) = 0.25.</math>

This is not the same as saying that <math display="inline">P(p_\text{H} = 0.5 \mid HH) = 0.25</math>, a conclusion which could only be reached via [[Bayes' theorem]] given knowledge about the marginal probabilities  <math display="inline">P(p_\text{H} = 0.5)</math> and <math display="inline">P(\text{HH})</math>.

Now suppose that the coin is not a fair coin, but instead that <math display="inline">p_\text{H} = 0.3</math>. Then the probability of two heads on two flips is

<math display="block">P(\text{HH} \mid p_\text{H}=0.3) = 0.3^2 = 0.09.</math>

Hence

<math display="block">\mathcal{L}(p_\text{H}=0.3 \mid \text{HH}) = 0.09.</math>

More generally, for each value of <math display="inline">p_\text{H}</math>, we can calculate the corresponding likelihood. The result of such calculations is displayed in Figure&nbsp;1. The integral of <math display="inline">\mathcal{L}</math> over [0,&nbsp;1] is 1/3; likelihoods need not integrate or sum to one over the parameter space.

===Continuous probability distribution===
Let <math display="inline">X</math> be a [[random variable]] following an [[Probability distribution#Continuous probability distribution|absolutely continuous probability distribution]] with [[probability density function|density function]] <math display="inline">f</math> (a function of <math display="inline">x</math>) which depends on a parameter <math display="inline">\theta</math>. Then the function

<math display="block">\mathcal{L}(\theta \mid x) = f_\theta (x), </math>

considered as a function of <math display="inline">\theta</math>, is the ''likelihood function'' (of <math display="inline">\theta</math>, given the [[Outcome (probability)|outcome]] <math display="inline">X=x</math>). Again, <math display="inline">\mathcal{L}</math> is not a probability density or mass function over <math display="inline">\theta</math>, despite being a function of <math display="inline">\theta</math> given the observation <math display="inline">X = x</math>.

====Relationship between the likelihood and probability density functions====
The use of the [[probability density function|probability density]] in specifying the likelihood function above is justified as follows. Given an observation <math display="inline">x_j</math>, the likelihood for the interval <math display="inline">[x_j, x_j + h]</math>, where <math display="inline">h > 0</math> is a constant, is given by <math display="inline"> \mathcal{L}(\theta\mid x \in [x_j, x_j + h]) </math>. Observe that
<math display="block"> \mathop\operatorname{arg\,max}_\theta \mathcal{L}(\theta\mid x \in [x_j, x_j + h]) = \mathop\operatorname{arg\,max}_\theta \frac{1}{h} \mathcal{L}(\theta\mid x \in [x_j, x_j + h]) ,</math>
since <math display="inline"> h </math> is positive and constant. Because
<math display="block"> 
\mathop\operatorname{arg\,max}_\theta \frac 1 h \mathcal{L}(\theta\mid x \in [x_j, x_j + h])
= \mathop\operatorname{arg\,max}_\theta \frac 1 h \Pr(x_j \leq x \leq x_j + h \mid \theta)
 = \mathop\operatorname{arg\,max}_\theta \frac 1 h \int_{x_j}^{x_j+h} f(x\mid \theta) \,dx,
</math>

where <math display="inline"> f(x\mid \theta) </math> is the probability density function, it follows that

<math display="block"> \mathop\operatorname{arg\,max}_\theta \mathcal{L}(\theta\mid x \in [x_j, x_j + h]) = \mathop\operatorname{arg\,max}_\theta \frac{1}{h} \int_{x_j}^{x_j+h} f(x\mid\theta) \,dx .</math>

The first [[fundamental theorem of calculus]] provides that
<math display="block">
\lim_{h \to 0^{+}} \frac 1 h \int_{x_j}^{x_j+h} f(x\mid\theta) \,dx = f(x_j \mid \theta).
</math>

Then
<math display="block">
\begin{align}
\mathop\operatorname{arg\,max}_\theta \mathcal{L}(\theta\mid x_j)
&= \mathop\operatorname{arg\,max}_\theta \left[ \lim_{h\to 0^{+}} \mathcal{L}(\theta\mid x \in [x_j, x_j + h]) \right] \\[4pt]
&= \mathop\operatorname{arg\,max}_\theta \left[ \lim_{h\to 0^{+}} \frac{1}{h} \int_{x_j}^{x_j+h} f(x\mid\theta) \,dx \right] \\[4pt]
&= \mathop\operatorname{arg\,max}_\theta f(x_j \mid \theta).
\end{align}
</math>

Therefore,
<math display="block"> \mathop\operatorname{arg\,max}_\theta \mathcal{L}(\theta\mid x_j) = \mathop\operatorname{arg\,max}_\theta f(x_j \mid \theta), </math>
and so maximizing the probability density at <math display="inline"> x_j </math> amounts to maximizing the likelihood of the specific observation <math display="inline"> x_j </math>.

===In general===

In [[Probability theory#Measure-theoretic probability theory|measure-theoretic probability theory]], the [[Probability density function|density function]] is defined as the [[Radon–Nikodym theorem|Radon–Nikodym derivative]] of the probability distribution relative to a common dominating measure.<ref>{{citation |first=Patrick |last=Billingsley | author-link= Patrick Billingsley|title=Probability and Measure |publisher= [[John Wiley & Sons]] |edition=Third |year=1995 |pages=422–423 |mode=cs1 }}</ref> The likelihood function is this density interpreted as a function of the parameter, rather than the random variable.<ref name="Shao03">{{citation| first= Jun| last= Shao| year= 2003 | title= Mathematical Statistics | edition= 2nd | publisher= Springer | at= §4.4.1 |mode=cs1 }}</ref> Thus, we can construct a likelihood function for any distribution, whether discrete, continuous, a mixture, or otherwise. (Likelihoods are comparable, e.g. for parameter estimation, only if they are Radon–Nikodym derivatives with respect to the same dominating measure.)

The above discussion of the likelihood for discrete random variables uses the [[counting measure]], under which the probability density at any outcome equals the probability of that outcome.

===Likelihoods for mixed continuous&ndash;discrete distributions===
The above can be extended in a simple way to allow consideration of distributions which contain both discrete and continuous components. Suppose that the distribution consists of a number of discrete probability masses <math display="inline">p_k (\theta)</math> and a density <math display="inline">f(x\mid\theta)</math>, where the sum of all the <math display="inline">p</math>'s added to the integral of <math display="inline">f</math> is always one. Assuming that it is possible to distinguish an observation corresponding to one of the discrete probability masses from one which corresponds to the density component, the likelihood function for an observation from the continuous component can be dealt with in the manner shown above. For an observation from the discrete component, the likelihood function for an observation from the discrete component is simply
<math display="block">\mathcal{L}(\theta \mid x )= p_k(\theta), </math>
where <math display="inline">k</math> is the index of the discrete probability mass corresponding to observation <math display="inline">x</math>, because maximizing the probability mass (or probability) at <math display="inline">x</math> amounts to maximizing the likelihood of the specific observation.

The fact that the likelihood function can be defined in a way that includes contributions that are not commensurate (the density and the probability mass) arises from the way in which the likelihood function is defined up to a constant of proportionality, where this "constant" can change with the observation <math display="inline">x</math>, but not with the parameter <math display="inline">\theta</math>.

=== Regularity conditions ===
In the context of parameter estimation, the likelihood function is usually assumed to obey certain conditions, known as regularity conditions. These conditions are {{em|assumed}} in various proofs involving likelihood functions, and need to be verified in each particular application. For maximum likelihood estimation, the existence of a global maximum of the likelihood function is of the utmost importance. By the [[extreme value theorem]], it suffices that the likelihood function is [[Continuous function|continuous]] on a [[compactness|compact]] parameter space for the maximum likelihood estimator to exist.<ref>{{cite book |first1=Christian |last1=Gouriéroux |author-link=Christian Gouriéroux |first2=Alain |last2=Monfort |year=1995 |title=Statistics and Econometric Models |location=New York |publisher=Cambridge University Press |isbn=0-521-40551-3 |page=161 |url=https://books.google.com/books?id=gqI-pAP2JZ8C&pg=PA161 }}</ref> While the continuity assumption is usually met, the compactness assumption about the parameter space is often not, as the bounds of the true parameter values might be unknown. In that case, [[Concave function|concavity]] of the likelihood function plays a key role.

More specifically, if the likelihood function is twice continuously differentiable on the <var>k</var>-dimensional parameter space <math display="inline"> \Theta </math> assumed to be an [[Open set|open]] [[Connected space|connected]] subset of <math display="inline"> \mathbb{R}^{k} \,,</math> there exists a unique maximum <math display="inline">\hat{\theta} \in \Theta</math> if the [[Hessian matrix|matrix of second partials]]
<math display="block"> \mathbf{H}(\theta) \equiv \left[\, \frac{ \partial^2 L }{\, \partial \theta_i \, \partial \theta_j \,} \,\right]_{i,j=1,1}^{n_\mathrm{i},n_\mathrm{j}} \;</math> is [[negative definite]] for every <math display="inline">\, \theta \in \Theta \,</math> at which the gradient <math display="inline">\; \nabla L \equiv \left[\, \frac{ \partial L }{\, \partial \theta_i \,} \,\right]_{i=1}^{n_\mathrm{i}} \;</math> vanishes,
and if the likelihood function approaches a constant on the [[Boundary (topology)|boundary]] of the parameter space, <math display="inline">\; \partial \Theta \;,</math> i.e.,
<math display="block"> \lim_{\theta \to \partial \Theta} L(\theta) = 0 \;,</math>
which may include the points at infinity if <math display="inline"> \, \Theta \, </math> is unbounded. Mäkeläinen and co-authors prove this result using [[Morse theory]] while informally appealing to a mountain pass property.<ref>{{cite journal |first1=Timo |last1=Mäkeläinen |first2=Klaus |last2=Schmidt |first3=George P.H. |last3=Styan |year=1981 |title=On the existence and uniqueness of the maximum likelihood estimate of a vector-valued parameter in fixed-size samples |journal=[[Annals of Statistics]] |volume=9 |issue=4 |pages=758–767 |doi=10.1214/aos/1176345516 |jstor=2240844 |doi-access=free }}</ref> Mascarenhas restates their proof using the [[mountain pass theorem]].<ref>{{cite journal |first=W.F. |last=Mascarenhas |year=2011 |title=A mountain pass lemma and its implications regarding the uniqueness of constrained minimizers |journal=Optimization |volume=60 |issue=8–9 |pages=1121–1159 |doi=10.1080/02331934.2010.527973 |s2cid=15896597 }}</ref>

In the proofs of [[Consistent estimator|consistency]] and asymptotic normality of the maximum likelihood estimator, additional assumptions are made about the probability densities that form the basis of a particular likelihood function. These conditions were first established by Chanda.<ref>{{cite journal |first=K.C. |last=Chanda |year=1954 |title=A note on the consistency and maxima of the roots of likelihood equations |journal=[[Biometrika]] |volume=41 |issue=1–2 |pages=56–61 |doi=10.2307/2333005 |jstor=2333005 }}</ref> In particular, for [[almost all]] <math display="inline">x</math>, and for all <math display="inline">\, \theta \in \Theta \,,</math>
<math display="block">\frac{\partial \log f}{\partial \theta_r} \,, \quad \frac{\partial^2 \log f}{\partial \theta_r \partial \theta_s} \,, \quad \frac{\partial^3 \log f}{\partial \theta_r \, \partial \theta_s \, \partial \theta_t} \,</math>
exist for all <math display="inline">\, r, s, t = 1, 2, \ldots, k \,</math> in order to ensure the existence of a [[Taylor expansion]]. Second, for almost all <math display="inline">x</math> and for every <math display="inline">\, \theta \in \Theta \,</math> it must be that
<math display="block"> \left| \frac{\partial f}{\partial \theta_r} \right| < F_r(x) \,, \quad \left| \frac{\partial^2 f}{\partial \theta_r \, \partial \theta_s} \right| < F_{rs}(x) \,, \quad \left| \frac{\partial^3 f}{\partial \theta_r \, \partial \theta_s \, \partial \theta_t} \right| < H_{rst}(x) </math>
where <math display="inline">H</math> is such that <math display="inline">\, \int_{-\infty}^{\infty} H_{rst}(z) \mathrm{d}z \leq M < \infty \;.</math> This boundedness of the derivatives is needed to allow for [[differentiation under the integral sign]]. And lastly, it is assumed that the [[information matrix]],
<math display="block">\mathbf{I}(\theta) = \int_{-\infty}^{\infty} \frac{\partial \log f}{\partial \theta_r}\ \frac{\partial \log f}{\partial \theta_s}\ f\ \mathrm{d}z </math>
is [[positive definite]] and <math display="inline">\, \left| \mathbf{I}(\theta) \right| \,</math> is finite. This ensures that the [[Score (statistics)|score]] has a finite variance.<ref>{{cite book |first1=Edward |last1=Greenberg |first2=Charles E. Jr. |last2=Webster |title=Advanced Econometrics: A Bridge to the Literature |location=New York, NY |publisher=John Wiley & Sons |year=1983 |isbn=0-471-09077-8 |pages=24–25 }}</ref>

The above conditions are sufficient, but not necessary. That is, a model that does not meet these regularity conditions may or may not have a maximum likelihood estimator of the properties mentioned above. Further, in case of non-independently or non-identically distributed observations additional properties may need to be assumed.

In Bayesian statistics, almost identical regularity conditions are imposed on the likelihood function in order to proof asymptotic normality of the [[posterior probability]],<ref>{{cite journal |first1=C. C. |last1=Heyde |first2=I. M. |last2=Johnstone |title=On Asymptotic Posterior Normality for Stochastic Processes |journal=Journal of the Royal Statistical Society |series=Series B (Methodological) |volume=41 |issue=2 |year=1979 |pages=184–189 |doi=10.1111/j.2517-6161.1979.tb01071.x }}</ref><ref>{{cite journal |first=Chan-Fu |last=Chen |title=On Asymptotic Normality of Limiting Density Functions with Bayesian Implications |journal=Journal of the Royal Statistical Society |series=Series B (Methodological) |volume=47 |issue=3 |year=1985 |pages=540–546 |doi=10.1111/j.2517-6161.1985.tb01384.x }}</ref> and therefore to justify a [[Laplace approximation]] of the posterior in large samples.<ref>{{cite book |first1=Robert E. |last1=Kass |first2=Luke |last2=Tierney |first3=Joseph B. |last3=Kadane |chapter=The Validity of Posterior Expansions Based on Laplace's Method |editor-first=S. |editor-last=Geisser |editor2-first=J. S. |editor2-last=Hodges |editor3-first=S. J. |editor3-last=Press |editor4-first=A. |editor4-last=Zellner |pages=473–488 |publisher=Elsevier |title=Bayesian and Likelihood Methods in Statistics and Econometrics |year=1990 |isbn=0-444-88376-2 }}</ref>