Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Dirichlet distribution
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Properties== ===Moments=== Let <math>X = (X_1, \ldots, X_K)\sim\operatorname{Dir}(\boldsymbol\alpha)</math>. Let <math display=block>\alpha_0 = \sum_{i=1}^K \alpha_i.</math> Then<ref>Eq. (49.9) on page 488 of [http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471183873.html Kotz, Balakrishnan & Johnson (2000). Continuous Multivariate Distributions. Volume 1: Models and Applications. New York: Wiley.]</ref><ref>{{cite book|author=BalakrishV. B.|year=2005|title=A Primer on Statistical Distributions|publisher=John Wiley & Sons, Inc.|location=Hoboken, NJ|isbn=978-0-471-42798-8|chapter="Chapter 27. Dirichlet Distribution"|page=[https://archive.org/details/primeronstatisti0000bala/page/274 274]|chapter-url=https://archive.org/details/primeronstatisti0000bala/page/274}}</ref> <math display=block>\operatorname{E}[X_i] = \frac{\alpha_i}{\alpha_0},</math> <math display=block>\operatorname{Var}[X_i] = \frac{\alpha_i (\alpha_0-\alpha_i)}{\alpha_0^2 (\alpha_0+1)}.</math> Furthermore, if <math> i\neq j</math> <math display=block>\operatorname{Cov}[X_i,X_j] = \frac{- \alpha_i \alpha_j}{\alpha_0^2 (\alpha_0+1)}.</math> The covariance matrix is [[invertible matrix|singular]]. More generally, moments of Dirichlet-distributed random variables can be expressed in the following way. For <math> \boldsymbol{t}=(t_1,\dotsc,t_K) \in \mathbb{R}^K</math>, denote by <math>\boldsymbol{t}^{\circ i} = (t_1^i,\dotsc,t_K^i)</math> its {{mvar|i}}-th [[Hadamard product (matrices)#Analogous operations|Hadamard power]]. Then,<ref>{{Cite journal |last=Dello Schiavo |first=Lorenzo |date=2019 |title=Characteristic functionals of Dirichlet measures |journal=Electron. J. Probab. |volume=24 |pages=1–38 |doi=10.1214/19-EJP371 |doi-access=free|arxiv=1810.09790 }}</ref> <math>\operatorname{E}\left[ (\boldsymbol{t} \cdot \boldsymbol{X})^n \right] = \frac{n! \, \Gamma ( \alpha_0 )}{\Gamma (\alpha_0+n)} \sum \frac{{t_1}^{k_1} \cdots {t_K}^{k_K}}{k_1! \cdots k_K!} \prod_{i=1}^K \frac{\Gamma(\alpha_i + k_i)}{\Gamma(\alpha_i)} = \frac{n! \, \Gamma ( \alpha_0 )}{\Gamma (\alpha_0+n)} Z_n(\boldsymbol{t}^{\circ 1} \cdot \boldsymbol{\alpha}, \cdots, \boldsymbol{t}^{\circ n} \cdot \boldsymbol{\alpha}),</math> where the sum is over non-negative integers <math>k_1,\ldots,k_K</math> with <math>n=k_1+\cdots+k_K</math>, and <math>Z_n</math> is the [[Cycle index#Symmetric group Sn|cycle index polynomial]] of the [[Symmetric group]] of degree {{mvar|n}}. We have the special case <math>\operatorname{E}\left[ \boldsymbol{t} \cdot \boldsymbol{X} \right] = \frac{\boldsymbol{t} \cdot \boldsymbol{\alpha}}{\alpha_0}. </math> The multivariate analogue <math display="inline">\operatorname{E}\left[ (\boldsymbol{t}_1 \cdot \boldsymbol{X})^{n_1} \cdots (\boldsymbol{t}_q \cdot \boldsymbol{X})^{n_q} \right]</math> for vectors <math>\boldsymbol{t}_1, \dotsc, \boldsymbol{t}_q \in \mathbb{R}^K</math> can be expressed<ref>{{ cite arXiv | last1=Dello Schiavo | first1=Lorenzo | last2=Quattrocchi | first2=Filippo | date=2023 | title=Multivariate Dirichlet Moments and a Polychromatic Ewens Sampling Formula | eprint=2309.11292 | class=math.PR }}</ref> in terms of a color pattern of the exponents <math>n_1, \dotsc, n_q</math> in the sense of [[Pólya enumeration theorem]]. Particular cases include the simple computation<ref>{{cite web|last1=Hoffmann|first1=Till|title=Moments of the Dirichlet distribution|url=https://tillahoffmann.github.io/Moments-of-the-Dirichlet-distribution/|archive-url=https://web.archive.org/web/20160214015422/https://tillahoffmann.github.io/Moments-of-the-Dirichlet-distribution/ |access-date=14 February 2016|archive-date=2016-02-14 }}</ref> <math display=block>\operatorname{E}\left[\prod_{i=1}^K X_i^{\beta_i}\right] = \frac{B\left(\boldsymbol{\alpha} + \boldsymbol{\beta}\right)}{B\left(\boldsymbol{\alpha}\right)} = \frac{\Gamma\left(\sum\limits_{i=1}^K \alpha_{i}\right)}{\Gamma\left[\sum\limits_{i=1}^K (\alpha_i+\beta_i)\right]}\times\prod_{i=1}^K \frac{\Gamma(\alpha_i+\beta_i)}{\Gamma(\alpha_i)}.</math> ===Mode=== The [[mode (statistics)|mode]] of the distribution is<ref name="Bishop2006">{{cite book|author=Christopher M. Bishop|title=Pattern Recognition and Machine Learning|url=https://books.google.com/books?id=kTNoQgAACAAJ|date=17 August 2006|publisher=Springer|isbn=978-0-387-31073-2}}</ref> the vector {{math|(''x''{{sub|1}}, ..., ''x{{sub|K}}'')}} with <math display=block>x_i = \frac{\alpha_i - 1}{\alpha_0 - K}, \qquad \alpha_i > 1. </math> ===Marginal distributions=== The [[marginal distribution]]s are [[beta distribution]]s:<ref>{{cite web|last=Farrow|first=Malcolm|title=MAS3301 Bayesian Statistics|url=http://www.mas.ncl.ac.uk/~nmf16/teaching/mas3301/week6.pdf|work=Newcastle University|access-date=10 April 2013}}</ref> <math display=block>X_i \sim \operatorname{Beta} (\alpha_i, \alpha_0 - \alpha_i). </math> Also see {{slink||Related distributions}} below. ===Conjugate to categorical or multinomial=== The Dirichlet distribution is the [[conjugate prior]] distribution of the [[categorical distribution]] (a generic [[discrete probability distribution]] with a given number of possible outcomes) and [[multinomial distribution]] (the distribution over observed counts of each possible category in a set of categorically distributed observations). This means that if a data point has either a categorical or multinomial distribution, and the [[prior distribution]] of the distribution's parameter (the vector of probabilities that generates the data point) is distributed as a Dirichlet, then the [[posterior distribution]] of the parameter is also a Dirichlet. Intuitively, in such a case, starting from what we know about the parameter prior to observing the data point, we then can update our knowledge based on the data point and end up with a new distribution of the same form as the old one. This means that we can successively update our knowledge of a parameter by incorporating new observations one at a time, without running into mathematical difficulties. Formally, this can be expressed as follows. Given a model <math display=block>\begin{array}{rcccl} \boldsymbol\alpha &=& \left(\alpha_1, \ldots, \alpha_K \right) &=& \text{concentration hyperparameter} \\ \mathbf{p}\mid\boldsymbol\alpha &=& \left(p_1, \ldots, p_K \right ) &\sim& \operatorname{Dir}(K, \boldsymbol\alpha) \\ \mathbb{X}\mid\mathbf{p} &=& \left(\mathbf{x}_1, \ldots, \mathbf{x}_K \right ) &\sim& \operatorname{Cat}(K,\mathbf{p}) \end{array}</math> then the following holds: <math display=block>\begin{array}{rcccl} \mathbf{c} &=& \left(c_1, \ldots, c_K \right ) &=& \text{number of occurrences of category }i \\ \mathbf{p} \mid \mathbb{X},\boldsymbol\alpha &\sim& \operatorname{Dir}(K,\mathbf{c}+\boldsymbol\alpha) &=& \operatorname{Dir} \left (K,c_1+\alpha_1,\ldots,c_K+\alpha_K \right) \end{array}</math> This relationship is used in [[Bayesian statistics]] to estimate the underlying parameter {{math|'''p'''}} of a [[categorical distribution]] given a collection of {{mvar|N}} samples. Intuitively, we can view the [[hyperprior]] vector {{math|'''α'''}} as [[pseudocount]]s, i.e. as representing the number of observations in each category that we have already seen. Then we simply add in the counts for all the new observations (the vector {{math|'''c'''}}) in order to derive the posterior distribution. In Bayesian [[mixture model]]s and other [[hierarchical Bayesian model]]s with mixture components, Dirichlet distributions are commonly used as the prior distributions for the [[categorical distribution|categorical variable]]s appearing in the models. See the section on [[#Occurrence and applications|applications]] below for more information. ===Relation to Dirichlet-multinomial distribution=== In a model where a Dirichlet prior distribution is placed over a set of [[categorical distribution|categorical-valued]] observations, the [[marginal distribution|marginal]] [[joint distribution]] of the observations (i.e. the joint distribution of the observations, with the prior parameter [[marginalized out]]) is a [[Dirichlet-multinomial distribution]]. This distribution plays an important role in [[hierarchical Bayesian model]]s, because when doing [[statistical inference|inference]] over such models using methods such as [[Gibbs sampling]] or [[variational Bayes]], Dirichlet prior distributions are often marginalized out. See the [[Dirichlet-multinomial distribution|article on this distribution]] for more details. ===Entropy=== If {{mvar|X}} is a <math>\operatorname{Dir}(\boldsymbol\alpha)</math> random variable, the [[differential entropy]] of {{mvar|X}} (in [[nat (unit)|nat units]]) is<ref>{{cite book |last1=Lin |first1=Jiayu |title=On The Dirichlet Distribution |date=2016 |publisher=Queen's University |location=Kingston, Canada |pages=§ 2.4.9 |url=https://mast.queensu.ca/~communications/Papers/msc-jiayu-lin.pdf}}</ref> <math display=block>h(\boldsymbol X) = \operatorname{E}[- \ln f(\boldsymbol X)] = \ln \operatorname{B}(\boldsymbol\alpha) + (\alpha_0-K)\psi(\alpha_0) - \sum_{j=1}^K (\alpha_j-1)\psi(\alpha_j) </math> where <math>\psi</math> is the [[digamma function]]. The following formula for <math> \operatorname{E}[\ln(X_i)]</math> can be used to derive the differential [[information entropy|entropy]] above. Since the functions <math>\ln(X_i)</math> are the sufficient statistics of the Dirichlet distribution, the [[Exponential family#Moments and cumulants of the sufficient statistic|exponential family differential identities]] can be used to get an analytic expression for the expectation of <math>\ln(X_i)</math> (see equation (2.62) in <ref>{{cite web|last=Nguyen|first=Duy|title=AN IN DEPTH INTRODUCTION TO VARIATIONAL BAYES NOTE|date=15 August 2023 |ssrn=4541076 |url=https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4541076|access-date=15 August 2023}}</ref>) and its associated covariance matrix: <math display=block>\operatorname{E}[\ln(X_i)] = \psi(\alpha_i)-\psi(\alpha_0)</math> and <math display=block>\operatorname{Cov}[\ln(X_i),\ln(X_j)] = \psi'(\alpha_i) \delta_{ij} - \psi'(\alpha_0)</math> where <math>\psi</math> is the [[digamma function]], <math>\psi'</math> is the [[trigamma function]], and <math>\delta_{ij}</math> is the [[Kronecker delta]]. The spectrum of [[Rényi entropy|Rényi information]] for values other than <math> \lambda = 1</math> is given by<ref>{{cite journal | journal=Journal of Statistical Planning and Inference | volume=93 | issue=325 | pages=51–69 | year=2001 | author=Song, Kai-Sheng | title=Rényi information, loglikelihood, and an intrinsic distribution measure| doi = 10.1016/S0378-3758(00)00169-5 | publisher=Elsevier}}</ref> <math display=block>F_R(\lambda) = (1-\lambda)^{-1} \left( - \lambda \log \mathrm{B}(\boldsymbol\alpha) + \sum_{i=1}^K \log \Gamma(\lambda(\alpha_i - 1) + 1) - \log \Gamma(\lambda (\alpha_0 - K) + K ) \right) </math> and the information entropy is the limit as <math>\lambda</math> goes to 1. Another related interesting measure is the entropy of a discrete categorical (one-of-K binary) vector {{math|'''Z'''}} with probability-mass distribution {{math|'''X'''}}, i.e., <math> P(Z_i=1, Z_{j\ne i} = 0 | \boldsymbol X) = X_i </math>. The conditional [[information entropy]] of {{math|'''Z'''}}, given {{math|'''X'''}} is <math display=block>S(\boldsymbol X) = H(\boldsymbol Z | \boldsymbol X) = \operatorname{E}_{\boldsymbol Z}[- \log P(\boldsymbol Z | \boldsymbol X ) ] = \sum_{i=1}^K - X_i \log X_i </math> This function of {{math|'''X'''}} is a scalar random variable. If {{math|'''X'''}} has a symmetric Dirichlet distribution with all <math>\alpha_i = \alpha</math>, the expected value of the entropy (in [[nat (unit)|nat units]]) is<ref>{{cite conference |last1=Nemenman |first1=Ilya |last2=Shafee |first2=Fariel |last3=Bialek |first3=William |title= Entropy and Inference, revisited |date=2002 |conference=NIPS 14 |url=http://papers.nips.cc/paper/1965-entropy-and-inference-revisited.pdf}}, eq. 8</ref> <math display=block>\operatorname{E}[S(\boldsymbol X)] = \sum_{i=1}^K \operatorname{E}[- X_i \ln X_i] = \psi(K\alpha + 1) - \psi(\alpha + 1) </math> ===Aggregation=== If <math display=block>X = (X_1, \ldots, X_K)\sim\operatorname{Dir}(\alpha_1,\ldots,\alpha_K)</math> then, if the random variables with subscripts {{mvar|i}} and {{mvar|j}} are dropped from the vector and replaced by their sum, <math display=block>X' = (X_1, \ldots, X_i + X_j, \ldots, X_K)\sim\operatorname{Dir} (\alpha_1, \ldots, \alpha_i + \alpha_j, \ldots, \alpha_K).</math> This aggregation property may be used to derive the marginal distribution of <math>X_i</math> mentioned above. ===Neutrality=== {{main|Neutral vector}} If <math>X = (X_1, \ldots, X_K)\sim\operatorname{Dir}(\boldsymbol\alpha)</math>, then the vector {{mvar|X}} is said to be ''neutral''<ref>{{cite journal | journal=Journal of the American Statistical Association | volume=64 | issue=325 | pages=194–206 | year=1969 | author=Connor, Robert J. | title=Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution | doi = 10.2307/2283728 | jstor=2283728 | author2=Mosimann, James E | publisher=American Statistical Association }}</ref> in the sense that ''X{{sub|K}}'' is independent of <math>X^{(-K)}</math><ref name=FKG/> where <math display=block>X^{(-K)}=\left(\frac{X_1}{1-X_K},\frac{X_2}{1-X_K},\ldots,\frac{X_{K-1}}{1-X_K} \right),</math> and similarly for removing any of <math>X_2,\ldots,X_{K-1}</math>. Observe that any permutation of {{mvar|X}} is also neutral (a property not possessed by samples drawn from a [[generalized Dirichlet distribution]]).<ref>See Kotz, Balakrishnan & Johnson (2000), Section 8.5, "Connor and Mosimann's Generalization", pp. 519–521.</ref> Combining this with the property of aggregation it follows that {{math|''X''{{sub|''j''}} + ... + ''X''{{sub|''K''}}}} is independent of <math>\left(\frac{X_1}{X_1+\cdots +X_{j-1}},\frac{X_2}{X_1+\cdots +X_{j-1}},\ldots,\frac{X_{j-1}}{X_1+\cdots +X_{j-1}} \right)</math>. In fact it is true, further, for the Dirichlet distribution, that for <math>3\le j\le K-1</math>, the pair <math>\left(X_1+\cdots +X_{j-1}, X_j+\cdots +X_K\right)</math>, and the two vectors <math>\left(\frac{X_1}{X_1+\cdots +X_{j-1}},\frac{X_2}{X_1+\cdots +X_{j-1}},\ldots,\frac{X_{j-1}}{X_1+\cdots +X_{j-1}} \right)</math> and <math>\left(\frac{X_j}{X_j+\cdots +X_K},\frac{X_{j+1}}{X_j+\cdots +X_K},\ldots,\frac{X_K}{X_j+\cdots +X_K} \right)</math>, viewed as triple of normalised random vectors, are [[Independence (probability theory)#More than two random variables|mutually independent]]. The analogous result is true for partition of the indices {{math|{{mset|1, 2, ..., ''K''}}}} into any other pair of non-singleton subsets. ===Characteristic function=== The characteristic function of the Dirichlet distribution is a [[confluent hypergeometric function|confluent]] form of the [[Lauricella hypergeometric series]]. It is given by [[Peter C. B. Phillips|Phillips]] as<ref name="phillips1988">{{cite journal |first=P. C. B. |last=Phillips |year=1988 |url=https://cowles.yale.edu/sites/default/files/files/pub/d08/d0865.pdf |title=The characteristic function of the Dirichlet and multivariate F distribution |journal=Cowles Foundation Discussion Paper 865 }}</ref> <math display=block> CF\left(s_1,\ldots,s_{K-1}\right) = \operatorname{E}\left(e^{i\left(s_1X_1+\cdots+s_{K-1}X_{K-1} \right)} \right)= \Psi^{\left[K-1\right]} (\alpha_1,\ldots,\alpha_{K-1};\alpha_0;is_1,\ldots, is_{K-1}) </math> where <math display=block> \Psi^{[m]} (a_1,\ldots,a_m;c;z_1,\ldots z_m) = \sum\frac{(a_1)_{k_1} \cdots (a_m)_{k_m} \, z_1^{k_1} \cdots z_m^{k_m}}{(c)_k\,k_1!\cdots k_m!}. </math> The sum is over non-negative integers <math>k_1,\ldots,k_m</math> and <math>k=k_1+\cdots+k_m</math>. Phillips goes on to state that this form is "inconvenient for numerical calculation" and gives an alternative in terms of a [[Methods of contour integration|complex path integral]]: <math display=block> \Psi^{[m]} = \frac{\Gamma(c)}{2\pi i}\int_L e^t\,t^{a_1+\cdots+a_m-c}\,\prod_{j=1}^m (t-z_j)^{-a_j} \, dt</math> where {{mvar|L}} denotes any path in the complex plane originating at <math>-\infty</math>, encircling in the positive direction all the singularities of the integrand and returning to <math>-\infty</math>. ===Inequality=== Probability density function <math>f \left(x_1,\ldots, x_{K-1}; \alpha_1,\ldots, \alpha_K \right)</math> plays a key role in a multifunctional inequality which implies various bounds for the Dirichlet distribution.<ref>{{cite journal | last1=Grinshpan | first1=A. Z. | title=An inequality for multiple convolutions with respect to Dirichlet probability measure | doi=10.1016/j.aam.2016.08.001 | year=2017 | journal=Advances in Applied Mathematics | volume=82 | issue=1 | pages=102–119 | doi-access=free }}</ref> Another inequality relates the moment-generating function of the Dirichlet distribution to the convex conjugate of the scaled reversed Kullback-Leibler divergence:<ref>{{cite arXiv | last1=Perrault| first1=P. | title=A New Bound on the Cumulant Generating Function of Dirichlet Processes |eprint=2409.18621 | year=2024| class=math.PR }} Theorem 3.3</ref> <math display=block> \log \operatorname{E}\left(\exp{\sum_{i=1}^K s_i X_i } \right) \leq \sup_p \sum_{i=1}^K \left(p_i s_i - \alpha_i\log\left(\frac{\alpha_i}{\alpha_0 p_i} \right)\right), </math> where the supremum is taken over {{mvar|p}} spanning the {{math|(''K'' − 1)}}-simplex.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)