Editing Likelihood function (section)

==Log-likelihood==
{{see also|Log-probability}}
''Log-likelihood function'' is the logarithm of the likelihood function, often denoted by a lowercase {{math|''l''}} or {{tmath|\ell}}, to contrast with the uppercase {{math|''L''}} or <math display="inline">\mathcal{L}</math> for the likelihood. Because logarithms are [[strictly increasing]] functions, maximizing the likelihood is equivalent to maximizing the log-likelihood. But for practical purposes it is more convenient to work with the log-likelihood function in [[maximum likelihood estimation]], in particular since most common [[probability distribution]]s—notably the [[exponential family]]—are only [[Logarithmically concave function|logarithmically concave]],<ref>{{citation |first1=Robert E. |last1=Kass |first2=Paul W. |last2=Vos |title=Geometrical Foundations of Asymptotic Inference |location=New York |publisher=John Wiley & Sons |year=1997 |isbn=0-471-82668-5 |page=14 |url=https://books.google.com/books?id=e43EAIfUPCwC&pg=PA14 |mode=cs1 }}</ref><ref>{{cite web |first=Alecos |last=Papadopoulos |title=Why we always put log() before the joint pdf when we use MLE (Maximum likelihood Estimation)? |date=September 25, 2013 |work=[[Stack Exchange]] |url=https://stats.stackexchange.com/q/70975 }}</ref> and [[Concave function|concavity]] of the [[objective function]] plays a key role in the [[Mathematical optimization|maximization]].

Given the independence of each event, the overall log-likelihood of intersection equals the sum of the log-likelihoods of the individual events. This is analogous to the fact that the overall [[log-probability]] is the sum of the log-probability of the individual events. In addition to the mathematical convenience from this, the adding process of log-likelihood has an intuitive interpretation, as often expressed as "support" from the data. When the parameters are estimated using the log-likelihood for the [[maximum likelihood estimation]], each data point is used by being added to the total log-likelihood. As the data can be viewed as an evidence that support the estimated parameters, this process can be interpreted as "support from independent evidence ''adds",'' and the log-likelihood is the "weight of evidence". Interpreting negative log-probability as [[information content]] or [[surprisal]], the support (log-likelihood) of a model, given an event, is the negative of the surprisal of the event, given the model: a model is supported by an event to the extent that the event is unsurprising, given the model.

A logarithm of a likelihood ratio is equal to the difference of the log-likelihoods:
<math display="block">\log \frac{\mathcal{L}(A)}{\mathcal{L}(B)} = \log \mathcal{L}(A) - \log \mathcal{L}(B) = \ell(A) - \ell(B).</math>

Just as the likelihood, given no event, being 1, the log-likelihood, given no event, is 0, which corresponds to the value of the empty sum: without any data, there is no support for any models.

===Graph===
The [[Graph of a function|graph]] of the log-likelihood is called the '''support curve''' (in the [[univariate]] case).<ref name="Edwards72">{{cite book|last=Edwards|first=A. W. F.|authorlink=A. W. F. Edwards| orig-date=1972| year=1992| title=Likelihood| publisher=[[Johns Hopkins University Press]]|isbn=0-8018-4443-6}}</ref>
In the multivariate case, the concept generalizes into a '''support surface''' over the [[parameter space]].
It has a relation to, but is distinct from, the [[Support (mathematics)#Support (statistics)|support of a distribution]].

The term was coined by [[A. W. F. Edwards]]<ref name="Edwards72" /> in the context of [[statistical hypothesis testing]], i.e. whether or not the data "support" one hypothesis (or parameter value) being tested more than any other.

The log-likelihood function being plotted is used in the computation of the [[score (statistics)|score]] (the gradient of the log-likelihood) and [[Fisher information]] (the curvature of the log-likelihood). Thus, the graph has a direct interpretation in the context of [[maximum likelihood estimation]] and [[likelihood-ratio test]]s.

===Likelihood equations===
If the log-likelihood function is [[Smoothness|smooth]], its [[gradient]] with respect to the parameter, known as the [[Score (statistics)|score]] and written <math display="inline">s_{n}(\theta) \equiv \nabla_{\theta} \ell_{n}(\theta)</math>, exists and allows for the application of [[differential calculus]]. The basic way to maximize a differentiable function is to find the [[stationary point]]s (the points where the [[derivative]] is zero); since the derivative of a sum is just the sum of the derivatives, but the derivative of a product requires the [[product rule]], it is easier to compute the stationary points of the log-likelihood of independent events than for the likelihood of independent events.

The equations defined by the stationary point of the score function serve as [[estimating equations]] for the maximum likelihood estimator.
<math display="block">s_{n}(\theta) = \mathbf{0}</math>
In that sense, the maximum likelihood estimator is implicitly defined by the value at <math display="inline">\mathbf{0}</math> of the [[inverse function]] <math display="inline">s_{n}^{-1}: \mathbb{E}^{d} \to \Theta</math>, where <math display="inline">\mathbb{E}^{d}</math> is the <var>d</var>-dimensional [[Euclidean space]], and <math display="inline">\Theta</math> is the parameter space. Using the [[inverse function theorem]], it can be shown that <math display="inline">s_{n}^{-1}</math> is [[well-defined]] in an [[open neighborhood]] about <math display="inline">\mathbf{0}</math> with probability going to one, and <math display="inline">\hat{\theta}_{n} = s_{n}^{-1}(\mathbf{0})</math> is a consistent estimate of <math display="inline">\theta</math>. As a consequence there exists a sequence <math display="inline">\left\{ \hat{\theta}_{n} \right\}</math> such that <math display="inline">s_{n}(\hat{\theta}_{n}) = \mathbf{0}</math> asymptotically [[almost surely]], and <math display="inline">\hat{\theta}_{n} \xrightarrow{\text{p}} \theta_{0}</math>.<ref>{{cite journal |first=Robert V. |last=Foutz |title=On the Unique Consistent Solution to the Likelihood Equations |journal=[[Journal of the American Statistical Association]] |volume=72 |year=1977 |issue=357 |pages=147–148 |doi=10.1080/01621459.1977.10479926 }}</ref> A similar result can be established using [[Rolle's theorem]].<ref>{{cite journal |first1=Robert E. |last1=Tarone |first2=Gary |last2=Gruenhage |title=A Note on the Uniqueness of Roots of the Likelihood Equations for Vector-Valued Parameters |journal=Journal of the American Statistical Association |volume=70 |year=1975 |issue=352 |pages=903–904 |doi=10.1080/01621459.1975.10480321 }}</ref><ref>{{cite journal |first1=Kamta |last1=Rai |first2=John |last2=Van Ryzin |title=A Note on a Multivariate Version of Rolle's Theorem and Uniqueness of Maximum Likelihood Roots |journal=Communications in Statistics |series=Theory and Methods |volume=11 |year=1982 |issue=13 |pages=1505–1510 |doi=10.1080/03610928208828325 }}</ref>

The second derivative evaluated at <math display="inline">\hat{\theta}</math>, known as [[Fisher information]], determines the curvature of the likelihood surface,<ref>{{citation |first=B. Raja |last=Rao |title=A formula for the curvature of the likelihood surface of a sample drawn from a distribution admitting sufficient statistics |journal=[[Biometrika]] |volume=47 |issue=1–2 |year=1960 |pages=203–207 |doi=10.1093/biomet/47.1-2.203 |mode=cs1 }}</ref> and thus indicates the [[Precision (statistics)|precision]] of the estimate.<ref>{{citation |first1=Michael D. |last1=Ward |first2=John S. |last2=Ahlquist |title=Maximum Likelihood for Social Science : Strategies for Analysis |publisher= [[Cambridge University Press]] |year=2018 |pages=25–27 |mode=cs1 }}</ref>

===Exponential families===
{{further|Exponential family}}
The log-likelihood is also particularly useful for [[exponential families]] of distributions, which include many of the common [[parametric model|parametric probability distributions]]. The probability distribution function (and thus likelihood function) for exponential families contain products of factors involving [[exponentiation]]. The logarithm of such a function is a sum of products, again easier to differentiate than the original function.

An exponential family is one whose probability density function is of the form (for some functions, writing <math display="inline">\langle -, - \rangle</math> for the [[inner product]]):

<math display="block"> p(x \mid \boldsymbol \theta) = h(x) \exp\Big(\langle \boldsymbol\eta({\boldsymbol \theta}), \mathbf{T}(x)\rangle -A({\boldsymbol \theta}) \Big).</math>

Each of these terms has an interpretation,{{efn|See {{slink|Exponential family|Interpretation}}}} but simply switching from probability to likelihood and taking logarithms yields the sum:

<math display="block"> \ell(\boldsymbol \theta \mid x) = \langle \boldsymbol\eta({\boldsymbol \theta}), \mathbf{T}(x)\rangle - A({\boldsymbol \theta}) + \log h(x).</math>

The <math display="inline">\boldsymbol \eta(\boldsymbol \theta)</math> and <math display="inline">h(x)</math> each correspond to a [[change of coordinates]], so in these coordinates, the log-likelihood of an exponential family is given by the simple formula:

<math display="block"> \ell(\boldsymbol \eta \mid x) = \langle \boldsymbol\eta, \mathbf{T}(x)\rangle - A({\boldsymbol \eta}).</math>

In words, the log-likelihood of an exponential family is inner product of the natural parameter {{tmath|\boldsymbol\eta}} and the [[sufficient statistic]] {{tmath|\mathbf{T}(x)}}, minus the normalization factor ([[log-partition function]]) {{tmath|A({\boldsymbol \eta})}}. Thus for example the maximum likelihood estimate can be computed by taking derivatives of the sufficient statistic {{math|''T''}} and the log-partition function {{math|''A''}}.

====Example: the gamma distribution====
The [[gamma distribution]] is an exponential family with two parameters, <math display="inline">\alpha</math> and <math display="inline">\beta</math>. The likelihood function is

<math display="block">\mathcal{L} (\alpha, \beta \mid x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}.</math>

Finding the maximum likelihood estimate of <math display="inline">\beta</math> for a single observed value <math display="inline">x</math> looks rather daunting. Its logarithm is much simpler to work with:

<math display="block">\log \mathcal{L}(\alpha,\beta \mid x) = \alpha \log \beta - \log \Gamma(\alpha) + (\alpha-1) \log x  - \beta x. \, </math>

To maximize the log-likelihood, we first take the [[partial derivative]] with respect to <math display="inline">\beta</math>:

<math display="block">\frac{\partial \log \mathcal{L}(\alpha,\beta \mid x)}{\partial \beta} = \frac{\alpha}{\beta} - x.</math>

If there are a number of independent observations <math display="inline">x_1, \ldots, x_n</math>, then the joint log-likelihood will be the sum of individual log-likelihoods, and the derivative of this sum will be a sum of derivatives of each individual log-likelihood:

<math display="block">
\begin{align}
& \frac{\partial \log \mathcal{L}(\alpha,\beta \mid x_1, \ldots, x_n)}{\partial \beta} \\
&= \frac{\partial \log \mathcal{L}(\alpha,\beta \mid x_1)}{\partial \beta} + \cdots + \frac{\partial \log \mathcal{L}(\alpha,\beta \mid x_n)}{\partial \beta} \\
&= \frac{n \alpha} \beta - \sum_{i=1}^n x_i.
\end{align}
</math>

To complete the maximization procedure for the joint log-likelihood, the equation is set to zero and solved for <math display="inline">\beta</math>:

<math display="block">\widehat\beta = \frac{\alpha}{\bar{x}}.</math>

Here <math display="inline">\widehat\beta</math> denotes the maximum-likelihood estimate, and <math display="inline">\textstyle \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i</math> is the [[sample mean]] of the observations.