Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Likelihood function
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Log-likelihood== {{see also|Log-probability}} ''Log-likelihood function'' is the logarithm of the likelihood function, often denoted by a lowercase {{math|''l''}} or {{tmath|\ell}}, to contrast with the uppercase {{math|''L''}} or <math display="inline">\mathcal{L}</math> for the likelihood. Because logarithms are [[strictly increasing]] functions, maximizing the likelihood is equivalent to maximizing the log-likelihood. But for practical purposes it is more convenient to work with the log-likelihood function in [[maximum likelihood estimation]], in particular since most common [[probability distribution]]sβnotably the [[exponential family]]βare only [[Logarithmically concave function|logarithmically concave]],<ref>{{citation |first1=Robert E. |last1=Kass |first2=Paul W. |last2=Vos |title=Geometrical Foundations of Asymptotic Inference |location=New York |publisher=John Wiley & Sons |year=1997 |isbn=0-471-82668-5 |page=14 |url=https://books.google.com/books?id=e43EAIfUPCwC&pg=PA14 |mode=cs1 }}</ref><ref>{{cite web |first=Alecos |last=Papadopoulos |title=Why we always put log() before the joint pdf when we use MLE (Maximum likelihood Estimation)? |date=September 25, 2013 |work=[[Stack Exchange]] |url=https://stats.stackexchange.com/q/70975 }}</ref> and [[Concave function|concavity]] of the [[objective function]] plays a key role in the [[Mathematical optimization|maximization]]. Given the independence of each event, the overall log-likelihood of intersection equals the sum of the log-likelihoods of the individual events. This is analogous to the fact that the overall [[log-probability]] is the sum of the log-probability of the individual events. In addition to the mathematical convenience from this, the adding process of log-likelihood has an intuitive interpretation, as often expressed as "support" from the data. When the parameters are estimated using the log-likelihood for the [[maximum likelihood estimation]], each data point is used by being added to the total log-likelihood. As the data can be viewed as an evidence that support the estimated parameters, this process can be interpreted as "support from independent evidence ''adds",'' and the log-likelihood is the "weight of evidence". Interpreting negative log-probability as [[information content]] or [[surprisal]], the support (log-likelihood) of a model, given an event, is the negative of the surprisal of the event, given the model: a model is supported by an event to the extent that the event is unsurprising, given the model. A logarithm of a likelihood ratio is equal to the difference of the log-likelihoods: <math display="block">\log \frac{\mathcal{L}(A)}{\mathcal{L}(B)} = \log \mathcal{L}(A) - \log \mathcal{L}(B) = \ell(A) - \ell(B).</math> Just as the likelihood, given no event, being 1, the log-likelihood, given no event, is 0, which corresponds to the value of the empty sum: without any data, there is no support for any models. ===Graph=== The [[Graph of a function|graph]] of the log-likelihood is called the '''support curve''' (in the [[univariate]] case).<ref name="Edwards72">{{cite book|last=Edwards|first=A. W. F.|authorlink=A. W. F. Edwards| orig-date=1972| year=1992| title=Likelihood| publisher=[[Johns Hopkins University Press]]|isbn=0-8018-4443-6}}</ref> In the multivariate case, the concept generalizes into a '''support surface''' over the [[parameter space]]. It has a relation to, but is distinct from, the [[Support (mathematics)#Support (statistics)|support of a distribution]]. The term was coined by [[A. W. F. Edwards]]<ref name="Edwards72" /> in the context of [[statistical hypothesis testing]], i.e. whether or not the data "support" one hypothesis (or parameter value) being tested more than any other. The log-likelihood function being plotted is used in the computation of the [[score (statistics)|score]] (the gradient of the log-likelihood) and [[Fisher information]] (the curvature of the log-likelihood). Thus, the graph has a direct interpretation in the context of [[maximum likelihood estimation]] and [[likelihood-ratio test]]s. ===Likelihood equations=== If the log-likelihood function is [[Smoothness|smooth]], its [[gradient]] with respect to the parameter, known as the [[Score (statistics)|score]] and written <math display="inline">s_{n}(\theta) \equiv \nabla_{\theta} \ell_{n}(\theta)</math>, exists and allows for the application of [[differential calculus]]. The basic way to maximize a differentiable function is to find the [[stationary point]]s (the points where the [[derivative]] is zero); since the derivative of a sum is just the sum of the derivatives, but the derivative of a product requires the [[product rule]], it is easier to compute the stationary points of the log-likelihood of independent events than for the likelihood of independent events. The equations defined by the stationary point of the score function serve as [[estimating equations]] for the maximum likelihood estimator. <math display="block">s_{n}(\theta) = \mathbf{0}</math> In that sense, the maximum likelihood estimator is implicitly defined by the value at <math display="inline">\mathbf{0}</math> of the [[inverse function]] <math display="inline">s_{n}^{-1}: \mathbb{E}^{d} \to \Theta</math>, where <math display="inline">\mathbb{E}^{d}</math> is the <var>d</var>-dimensional [[Euclidean space]], and <math display="inline">\Theta</math> is the parameter space. Using the [[inverse function theorem]], it can be shown that <math display="inline">s_{n}^{-1}</math> is [[well-defined]] in an [[open neighborhood]] about <math display="inline">\mathbf{0}</math> with probability going to one, and <math display="inline">\hat{\theta}_{n} = s_{n}^{-1}(\mathbf{0})</math> is a consistent estimate of <math display="inline">\theta</math>. As a consequence there exists a sequence <math display="inline">\left\{ \hat{\theta}_{n} \right\}</math> such that <math display="inline">s_{n}(\hat{\theta}_{n}) = \mathbf{0}</math> asymptotically [[almost surely]], and <math display="inline">\hat{\theta}_{n} \xrightarrow{\text{p}} \theta_{0}</math>.<ref>{{cite journal |first=Robert V. |last=Foutz |title=On the Unique Consistent Solution to the Likelihood Equations |journal=[[Journal of the American Statistical Association]] |volume=72 |year=1977 |issue=357 |pages=147β148 |doi=10.1080/01621459.1977.10479926 }}</ref> A similar result can be established using [[Rolle's theorem]].<ref>{{cite journal |first1=Robert E. |last1=Tarone |first2=Gary |last2=Gruenhage |title=A Note on the Uniqueness of Roots of the Likelihood Equations for Vector-Valued Parameters |journal=Journal of the American Statistical Association |volume=70 |year=1975 |issue=352 |pages=903β904 |doi=10.1080/01621459.1975.10480321 }}</ref><ref>{{cite journal |first1=Kamta |last1=Rai |first2=John |last2=Van Ryzin |title=A Note on a Multivariate Version of Rolle's Theorem and Uniqueness of Maximum Likelihood Roots |journal=Communications in Statistics |series=Theory and Methods |volume=11 |year=1982 |issue=13 |pages=1505β1510 |doi=10.1080/03610928208828325 }}</ref> The second derivative evaluated at <math display="inline">\hat{\theta}</math>, known as [[Fisher information]], determines the curvature of the likelihood surface,<ref>{{citation |first=B. Raja |last=Rao |title=A formula for the curvature of the likelihood surface of a sample drawn from a distribution admitting sufficient statistics |journal=[[Biometrika]] |volume=47 |issue=1β2 |year=1960 |pages=203β207 |doi=10.1093/biomet/47.1-2.203 |mode=cs1 }}</ref> and thus indicates the [[Precision (statistics)|precision]] of the estimate.<ref>{{citation |first1=Michael D. |last1=Ward |first2=John S. |last2=Ahlquist |title=Maximum Likelihood for Social Science : Strategies for Analysis |publisher= [[Cambridge University Press]] |year=2018 |pages=25β27 |mode=cs1 }}</ref> ===Exponential families=== {{further|Exponential family}} The log-likelihood is also particularly useful for [[exponential families]] of distributions, which include many of the common [[parametric model|parametric probability distributions]]. The probability distribution function (and thus likelihood function) for exponential families contain products of factors involving [[exponentiation]]. The logarithm of such a function is a sum of products, again easier to differentiate than the original function. An exponential family is one whose probability density function is of the form (for some functions, writing <math display="inline">\langle -, - \rangle</math> for the [[inner product]]): <math display="block"> p(x \mid \boldsymbol \theta) = h(x) \exp\Big(\langle \boldsymbol\eta({\boldsymbol \theta}), \mathbf{T}(x)\rangle -A({\boldsymbol \theta}) \Big).</math> Each of these terms has an interpretation,{{efn|See {{slink|Exponential family|Interpretation}}}} but simply switching from probability to likelihood and taking logarithms yields the sum: <math display="block"> \ell(\boldsymbol \theta \mid x) = \langle \boldsymbol\eta({\boldsymbol \theta}), \mathbf{T}(x)\rangle - A({\boldsymbol \theta}) + \log h(x).</math> The <math display="inline">\boldsymbol \eta(\boldsymbol \theta)</math> and <math display="inline">h(x)</math> each correspond to a [[change of coordinates]], so in these coordinates, the log-likelihood of an exponential family is given by the simple formula: <math display="block"> \ell(\boldsymbol \eta \mid x) = \langle \boldsymbol\eta, \mathbf{T}(x)\rangle - A({\boldsymbol \eta}).</math> In words, the log-likelihood of an exponential family is inner product of the natural parameter {{tmath|\boldsymbol\eta}} and the [[sufficient statistic]] {{tmath|\mathbf{T}(x)}}, minus the normalization factor ([[log-partition function]]) {{tmath|A({\boldsymbol \eta})}}. Thus for example the maximum likelihood estimate can be computed by taking derivatives of the sufficient statistic {{math|''T''}} and the log-partition function {{math|''A''}}. ====Example: the gamma distribution==== The [[gamma distribution]] is an exponential family with two parameters, <math display="inline">\alpha</math> and <math display="inline">\beta</math>. The likelihood function is <math display="block">\mathcal{L} (\alpha, \beta \mid x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}.</math> Finding the maximum likelihood estimate of <math display="inline">\beta</math> for a single observed value <math display="inline">x</math> looks rather daunting. Its logarithm is much simpler to work with: <math display="block">\log \mathcal{L}(\alpha,\beta \mid x) = \alpha \log \beta - \log \Gamma(\alpha) + (\alpha-1) \log x - \beta x. \, </math> To maximize the log-likelihood, we first take the [[partial derivative]] with respect to <math display="inline">\beta</math>: <math display="block">\frac{\partial \log \mathcal{L}(\alpha,\beta \mid x)}{\partial \beta} = \frac{\alpha}{\beta} - x.</math> If there are a number of independent observations <math display="inline">x_1, \ldots, x_n</math>, then the joint log-likelihood will be the sum of individual log-likelihoods, and the derivative of this sum will be a sum of derivatives of each individual log-likelihood: <math display="block"> \begin{align} & \frac{\partial \log \mathcal{L}(\alpha,\beta \mid x_1, \ldots, x_n)}{\partial \beta} \\ &= \frac{\partial \log \mathcal{L}(\alpha,\beta \mid x_1)}{\partial \beta} + \cdots + \frac{\partial \log \mathcal{L}(\alpha,\beta \mid x_n)}{\partial \beta} \\ &= \frac{n \alpha} \beta - \sum_{i=1}^n x_i. \end{align} </math> To complete the maximization procedure for the joint log-likelihood, the equation is set to zero and solved for <math display="inline">\beta</math>: <math display="block">\widehat\beta = \frac{\alpha}{\bar{x}}.</math> Here <math display="inline">\widehat\beta</math> denotes the maximum-likelihood estimate, and <math display="inline">\textstyle \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i</math> is the [[sample mean]] of the observations.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)