Editing Dirichlet distribution (section)

==Properties==

===Moments===
Let <math>X = (X_1, \ldots, X_K)\sim\operatorname{Dir}(\boldsymbol\alpha)</math>.

Let

<math display=block>\alpha_0 = \sum_{i=1}^K \alpha_i.</math>

Then<ref>Eq. (49.9) on page 488 of [http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471183873.html Kotz, Balakrishnan & Johnson (2000). Continuous Multivariate Distributions. Volume 1: Models and Applications. New York: Wiley.]</ref><ref>{{cite book|author=BalakrishV. B.|year=2005|title=A Primer on Statistical Distributions|publisher=John Wiley & Sons, Inc.|location=Hoboken, NJ|isbn=978-0-471-42798-8|chapter="Chapter 27. Dirichlet Distribution"|page=[https://archive.org/details/primeronstatisti0000bala/page/274 274]|chapter-url=https://archive.org/details/primeronstatisti0000bala/page/274}}</ref>

<math display=block>\operatorname{E}[X_i] = \frac{\alpha_i}{\alpha_0},</math>
<math display=block>\operatorname{Var}[X_i] = \frac{\alpha_i (\alpha_0-\alpha_i)}{\alpha_0^2 (\alpha_0+1)}.</math>

Furthermore, if <math> i\neq j</math>

<math display=block>\operatorname{Cov}[X_i,X_j] = \frac{- \alpha_i \alpha_j}{\alpha_0^2 (\alpha_0+1)}.</math>

The covariance matrix is [[invertible matrix|singular]].

More generally, moments of Dirichlet-distributed random variables can be expressed in the following way. For <math> \boldsymbol{t}=(t_1,\dotsc,t_K) \in \mathbb{R}^K</math>, denote by <math>\boldsymbol{t}^{\circ i} = (t_1^i,\dotsc,t_K^i)</math> its {{mvar|i}}-th [[Hadamard product (matrices)#Analogous operations|Hadamard power]]. Then,<ref>{{Cite journal |last=Dello Schiavo |first=Lorenzo |date=2019 |title=Characteristic functionals of Dirichlet measures |journal=Electron. J. Probab. |volume=24 |pages=1–38 |doi=10.1214/19-EJP371 |doi-access=free|arxiv=1810.09790 }}</ref>

<math>\operatorname{E}\left[ (\boldsymbol{t} \cdot \boldsymbol{X})^n \right] = \frac{n! \, \Gamma ( \alpha_0 )}{\Gamma (\alpha_0+n)} \sum \frac{{t_1}^{k_1} \cdots {t_K}^{k_K}}{k_1! \cdots k_K!} \prod_{i=1}^K \frac{\Gamma(\alpha_i + k_i)}{\Gamma(\alpha_i)} = \frac{n! \, \Gamma ( \alpha_0 )}{\Gamma (\alpha_0+n)} Z_n(\boldsymbol{t}^{\circ 1} \cdot \boldsymbol{\alpha}, \cdots, \boldsymbol{t}^{\circ n} \cdot \boldsymbol{\alpha}),</math>

where the sum is over non-negative integers <math>k_1,\ldots,k_K</math> with <math>n=k_1+\cdots+k_K</math>, and <math>Z_n</math> is the [[Cycle index#Symmetric group Sn|cycle index polynomial]] of the [[Symmetric group]] of degree {{mvar|n}}.

We have the special case <math>\operatorname{E}\left[ \boldsymbol{t} \cdot \boldsymbol{X} \right] = \frac{\boldsymbol{t} \cdot \boldsymbol{\alpha}}{\alpha_0}.  </math>

The multivariate analogue <math display="inline">\operatorname{E}\left[ (\boldsymbol{t}_1 \cdot \boldsymbol{X})^{n_1} \cdots (\boldsymbol{t}_q \cdot \boldsymbol{X})^{n_q} \right]</math> for vectors <math>\boldsymbol{t}_1, \dotsc, \boldsymbol{t}_q \in \mathbb{R}^K</math> can be expressed<ref>{{ cite arXiv | last1=Dello Schiavo | first1=Lorenzo | last2=Quattrocchi | first2=Filippo | date=2023 | title=Multivariate Dirichlet Moments and a Polychromatic Ewens Sampling Formula | eprint=2309.11292 | class=math.PR }}</ref> in terms of a color pattern of the exponents <math>n_1, \dotsc, n_q</math> in the sense of [[Pólya enumeration theorem]].

Particular cases include the simple computation<ref>{{cite web|last1=Hoffmann|first1=Till|title=Moments of the Dirichlet distribution|url=https://tillahoffmann.github.io/Moments-of-the-Dirichlet-distribution/|archive-url=https://web.archive.org/web/20160214015422/https://tillahoffmann.github.io/Moments-of-the-Dirichlet-distribution/ |access-date=14 February 2016|archive-date=2016-02-14 }}</ref>

<math display=block>\operatorname{E}\left[\prod_{i=1}^K X_i^{\beta_i}\right] = \frac{B\left(\boldsymbol{\alpha} + \boldsymbol{\beta}\right)}{B\left(\boldsymbol{\alpha}\right)} = \frac{\Gamma\left(\sum\limits_{i=1}^K \alpha_{i}\right)}{\Gamma\left[\sum\limits_{i=1}^K (\alpha_i+\beta_i)\right]}\times\prod_{i=1}^K \frac{\Gamma(\alpha_i+\beta_i)}{\Gamma(\alpha_i)}.</math>

===Mode===
The [[mode (statistics)|mode]] of the distribution is<ref name="Bishop2006">{{cite book|author=Christopher M. Bishop|title=Pattern Recognition and Machine Learning|url=https://books.google.com/books?id=kTNoQgAACAAJ|date=17 August 2006|publisher=Springer|isbn=978-0-387-31073-2}}</ref> the vector {{math|(''x''{{sub|1}}, ..., ''x{{sub|K}}'')}} with

<math display=block>x_i = \frac{\alpha_i - 1}{\alpha_0 - K}, \qquad \alpha_i > 1. </math>

===Marginal distributions===
The [[marginal distribution]]s are [[beta distribution]]s:<ref>{{cite web|last=Farrow|first=Malcolm|title=MAS3301 Bayesian Statistics|url=http://www.mas.ncl.ac.uk/~nmf16/teaching/mas3301/week6.pdf|work=Newcastle University|access-date=10 April 2013}}</ref>

<math display=block>X_i \sim \operatorname{Beta} (\alpha_i, \alpha_0 - \alpha_i). </math>

Also see {{slink||Related distributions}} below.

===Conjugate to categorical or multinomial===
The Dirichlet distribution is the [[conjugate prior]] distribution of the [[categorical distribution]] (a generic [[discrete probability distribution]] with a given number of possible outcomes) and [[multinomial distribution]] (the distribution over observed counts of each possible category in a set of categorically distributed observations).  This means that if a data point has either a categorical or multinomial distribution, and the [[prior distribution]] of the distribution's parameter (the vector of probabilities that generates the data point) is distributed as a Dirichlet, then the [[posterior distribution]] of the parameter is also a Dirichlet.  Intuitively, in such a case, starting from what we know about the parameter prior to observing the data point, we then can update our knowledge based on the data point and end up with a new distribution of the same form as the old one.  This means that we can successively update our knowledge of a parameter by incorporating new observations one at a time, without running into mathematical difficulties.

Formally, this can be expressed as follows.  Given a model

<math display=block>\begin{array}{rcccl}
\boldsymbol\alpha &=& \left(\alpha_1, \ldots, \alpha_K \right) &=& \text{concentration hyperparameter} \\
\mathbf{p}\mid\boldsymbol\alpha &=& \left(p_1, \ldots, p_K \right ) &\sim& \operatorname{Dir}(K, \boldsymbol\alpha) \\
\mathbb{X}\mid\mathbf{p} &=& \left(\mathbf{x}_1, \ldots, \mathbf{x}_K \right ) &\sim& \operatorname{Cat}(K,\mathbf{p})
\end{array}</math>

then the following holds:

<math display=block>\begin{array}{rcccl}
\mathbf{c} &=& \left(c_1, \ldots, c_K \right ) &=& \text{number of occurrences of category }i \\
\mathbf{p} \mid \mathbb{X},\boldsymbol\alpha &\sim& \operatorname{Dir}(K,\mathbf{c}+\boldsymbol\alpha) &=& \operatorname{Dir} \left (K,c_1+\alpha_1,\ldots,c_K+\alpha_K \right)
\end{array}</math>

This relationship is used in [[Bayesian statistics]] to estimate the underlying parameter {{math|'''p'''}} of a [[categorical distribution]] given a collection of {{mvar|N}} samples. Intuitively, we can view the [[hyperprior]] vector {{math|'''α'''}} as [[pseudocount]]s, i.e. as representing the number of observations in each category that we have already seen.  Then we simply add in the counts for all the new observations (the vector {{math|'''c'''}}) in order to derive the posterior distribution.

In Bayesian [[mixture model]]s and other [[hierarchical Bayesian model]]s with mixture components, Dirichlet distributions are commonly used as the prior distributions for the [[categorical distribution|categorical variable]]s appearing in the models.  See the section on [[#Occurrence and applications|applications]] below for more information.

===Relation to Dirichlet-multinomial distribution===
In a model where a Dirichlet prior distribution is placed over a set of [[categorical distribution|categorical-valued]] observations, the [[marginal distribution|marginal]] [[joint distribution]] of the observations (i.e. the joint distribution of the observations, with the prior parameter [[marginalized out]]) is a [[Dirichlet-multinomial distribution]].  This distribution plays an important role in [[hierarchical Bayesian model]]s, because when doing [[statistical inference|inference]] over such models using methods such as [[Gibbs sampling]] or [[variational Bayes]], Dirichlet prior distributions are often marginalized out.  See the [[Dirichlet-multinomial distribution|article on this distribution]] for more details.

===Entropy===

If {{mvar|X}} is a <math>\operatorname{Dir}(\boldsymbol\alpha)</math> random variable, the [[differential entropy]] of {{mvar|X}} (in [[nat (unit)|nat units]]) is<ref>{{cite book |last1=Lin |first1=Jiayu |title=On The Dirichlet Distribution |date=2016 |publisher=Queen's University |location=Kingston, Canada |pages=§ 2.4.9 |url=https://mast.queensu.ca/~communications/Papers/msc-jiayu-lin.pdf}}</ref>

<math display=block>h(\boldsymbol X) = \operatorname{E}[- \ln f(\boldsymbol X)] = \ln \operatorname{B}(\boldsymbol\alpha) + (\alpha_0-K)\psi(\alpha_0) - \sum_{j=1}^K (\alpha_j-1)\psi(\alpha_j) </math>

where <math>\psi</math> is the [[digamma function]].

The following formula for <math> \operatorname{E}[\ln(X_i)]</math> can be used to derive the differential [[information entropy|entropy]] above. Since the functions <math>\ln(X_i)</math> are the sufficient statistics of the Dirichlet distribution, the [[Exponential family#Moments and cumulants of the sufficient statistic|exponential family differential identities]] can be used to get an analytic expression for the expectation of <math>\ln(X_i)</math> (see equation (2.62) in <ref>{{cite web|last=Nguyen|first=Duy|title=AN IN DEPTH INTRODUCTION TO VARIATIONAL BAYES NOTE|date=15 August 2023 |ssrn=4541076 |url=https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4541076|access-date=15 August 2023}}</ref>) and its associated covariance matrix:

<math display=block>\operatorname{E}[\ln(X_i)] = \psi(\alpha_i)-\psi(\alpha_0)</math>

and

<math display=block>\operatorname{Cov}[\ln(X_i),\ln(X_j)] = \psi'(\alpha_i) \delta_{ij} - \psi'(\alpha_0)</math>

where <math>\psi</math> is the [[digamma function]], <math>\psi'</math> is the [[trigamma function]], and <math>\delta_{ij}</math> is the [[Kronecker delta]].

The spectrum of [[Rényi entropy|Rényi information]] for values other than <math> \lambda = 1</math> is given by<ref>{{cite journal | journal=Journal of Statistical Planning and Inference | volume=93 | issue=325 | pages=51–69 | year=2001  | author=Song, Kai-Sheng | title=Rényi information, loglikelihood, and an intrinsic distribution measure| doi = 10.1016/S0378-3758(00)00169-5 | publisher=Elsevier}}</ref>

<math display=block>F_R(\lambda) = (1-\lambda)^{-1} \left( - \lambda \log \mathrm{B}(\boldsymbol\alpha) + \sum_{i=1}^K \log \Gamma(\lambda(\alpha_i - 1) + 1) - \log \Gamma(\lambda (\alpha_0 - K) + K ) \right) </math>

and the information entropy is the limit as <math>\lambda</math> goes to 1.

Another related interesting measure is the entropy of a discrete categorical (one-of-K binary) vector {{math|'''Z'''}} with probability-mass distribution {{math|'''X'''}}, i.e.,  <math> P(Z_i=1, Z_{j\ne i} = 0 | \boldsymbol X) = X_i </math>. The conditional [[information entropy]] of {{math|'''Z'''}}, given {{math|'''X'''}} is

<math display=block>S(\boldsymbol X) = H(\boldsymbol Z | \boldsymbol X) = \operatorname{E}_{\boldsymbol Z}[- \log P(\boldsymbol Z | \boldsymbol X ) ] = \sum_{i=1}^K - X_i \log X_i </math>

This function of {{math|'''X'''}} is a scalar random variable. If {{math|'''X'''}} has a symmetric Dirichlet distribution with all <math>\alpha_i = \alpha</math>, the expected value of the entropy (in [[nat (unit)|nat units]]) is<ref>{{cite conference |last1=Nemenman |first1=Ilya |last2=Shafee |first2=Fariel |last3=Bialek |first3=William |title= Entropy and Inference, revisited |date=2002 |conference=NIPS 14 |url=http://papers.nips.cc/paper/1965-entropy-and-inference-revisited.pdf}}, eq. 8</ref> 

<math display=block>\operatorname{E}[S(\boldsymbol X)] = \sum_{i=1}^K \operatorname{E}[- X_i \ln X_i] = \psi(K\alpha + 1) - \psi(\alpha + 1) </math>

===Aggregation===
If

<math display=block>X = (X_1, \ldots, X_K)\sim\operatorname{Dir}(\alpha_1,\ldots,\alpha_K)</math>

then, if the random variables with subscripts {{mvar|i}} and {{mvar|j}} are dropped from the vector and replaced by their sum,

<math display=block>X' = (X_1, \ldots, X_i + X_j, \ldots, X_K)\sim\operatorname{Dir} (\alpha_1, \ldots, \alpha_i + \alpha_j, \ldots, \alpha_K).</math>

This aggregation property may be used to derive the marginal distribution of <math>X_i</math> mentioned above.

===Neutrality===
{{main|Neutral vector}}

If <math>X = (X_1, \ldots, X_K)\sim\operatorname{Dir}(\boldsymbol\alpha)</math>, then the vector&nbsp;{{mvar|X}} is said to be ''neutral''<ref>{{cite journal | journal=Journal of the American Statistical Association | volume=64 | issue=325 | pages=194–206 | year=1969  | author=Connor, Robert J. | title=Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution | doi = 10.2307/2283728 | jstor=2283728 | author2=Mosimann, James E | publisher=American Statistical Association }}</ref> in the sense that ''X{{sub|K}}'' is independent of <math>X^{(-K)}</math><ref name=FKG/> where

<math display=block>X^{(-K)}=\left(\frac{X_1}{1-X_K},\frac{X_2}{1-X_K},\ldots,\frac{X_{K-1}}{1-X_K} \right),</math>

and similarly for removing any of <math>X_2,\ldots,X_{K-1}</math>. Observe that any permutation of {{mvar|X}} is also neutral (a property not possessed by samples drawn from a [[generalized Dirichlet distribution]]).<ref>See Kotz, Balakrishnan & Johnson (2000), Section 8.5, "Connor and Mosimann's Generalization", pp. 519–521.</ref>

Combining this with the property of aggregation it follows that {{math|''X''{{sub|''j''}} + ... + ''X''{{sub|''K''}}}} is independent of <math>\left(\frac{X_1}{X_1+\cdots +X_{j-1}},\frac{X_2}{X_1+\cdots +X_{j-1}},\ldots,\frac{X_{j-1}}{X_1+\cdots +X_{j-1}} \right)</math>. In fact it is true, further, for the Dirichlet distribution, that for <math>3\le j\le K-1</math>, the pair <math>\left(X_1+\cdots +X_{j-1}, X_j+\cdots +X_K\right)</math>, and the two vectors <math>\left(\frac{X_1}{X_1+\cdots +X_{j-1}},\frac{X_2}{X_1+\cdots +X_{j-1}},\ldots,\frac{X_{j-1}}{X_1+\cdots +X_{j-1}} \right)</math> and <math>\left(\frac{X_j}{X_j+\cdots +X_K},\frac{X_{j+1}}{X_j+\cdots +X_K},\ldots,\frac{X_K}{X_j+\cdots +X_K} \right)</math>, viewed as triple of normalised random vectors, are [[Independence (probability theory)#More than two random variables|mutually independent]]. The analogous result is true for partition of the indices {{math|{{mset|1, 2, ..., ''K''}}}} into any other pair of non-singleton subsets.

===Characteristic function===

The characteristic function of the Dirichlet distribution is a [[confluent hypergeometric function|confluent]] form of the [[Lauricella hypergeometric series]].  It is given by [[Peter C. B. Phillips|Phillips]] as<ref name="phillips1988">{{cite journal |first=P. C. B. |last=Phillips |year=1988 |url=https://cowles.yale.edu/sites/default/files/files/pub/d08/d0865.pdf |title=The characteristic function of the Dirichlet and multivariate F distribution |journal=Cowles Foundation Discussion Paper 865 }}</ref>

<math display=block>
CF\left(s_1,\ldots,s_{K-1}\right) = \operatorname{E}\left(e^{i\left(s_1X_1+\cdots+s_{K-1}X_{K-1} \right)} \right)= \Psi^{\left[K-1\right]} (\alpha_1,\ldots,\alpha_{K-1};\alpha_0;is_1,\ldots, is_{K-1})
</math>

where

<math display=block>
\Psi^{[m]} (a_1,\ldots,a_m;c;z_1,\ldots z_m) = \sum\frac{(a_1)_{k_1} \cdots (a_m)_{k_m} \, z_1^{k_1} \cdots z_m^{k_m}}{(c)_k\,k_1!\cdots k_m!}.
</math>

The sum is over non-negative integers <math>k_1,\ldots,k_m</math> and <math>k=k_1+\cdots+k_m</math>.  Phillips goes on to state that this form is "inconvenient for numerical calculation" and gives an alternative in terms of a [[Methods of contour integration|complex path integral]]:

<math display=block>
\Psi^{[m]} = \frac{\Gamma(c)}{2\pi i}\int_L e^t\,t^{a_1+\cdots+a_m-c}\,\prod_{j=1}^m (t-z_j)^{-a_j} \, dt</math>

where {{mvar|L}} denotes any path in the complex plane originating at <math>-\infty</math>, encircling in the positive direction all the singularities of the integrand and returning to <math>-\infty</math>.

===Inequality===
Probability density function <math>f \left(x_1,\ldots, x_{K-1}; \alpha_1,\ldots, \alpha_K \right)</math> plays a key role in a multifunctional inequality which implies various bounds for the Dirichlet distribution.<ref>{{cite journal | last1=Grinshpan | first1=A. Z. | title=An inequality for multiple convolutions with respect to Dirichlet probability measure | doi=10.1016/j.aam.2016.08.001 | year=2017 | 
journal=Advances in Applied Mathematics | volume=82 | issue=1 | pages=102–119 | doi-access=free }}</ref>

Another inequality relates the moment-generating function of the Dirichlet distribution to the convex conjugate of the scaled reversed Kullback-Leibler divergence:<ref>{{cite arXiv | last1=Perrault| first1=P. | title=A New Bound on the Cumulant Generating Function of Dirichlet Processes |eprint=2409.18621 | year=2024| class=math.PR }} Theorem 3.3</ref>

<math display=block>
\log \operatorname{E}\left(\exp{\sum_{i=1}^K s_i X_i } \right) 
\leq 
\sup_p \sum_{i=1}^K \left(p_i s_i - \alpha_i\log\left(\frac{\alpha_i}{\alpha_0 p_i} \right)\right),
</math>
where the supremum is taken over {{mvar|p}} spanning the {{math|(''K'' − 1)}}-simplex.