Editing Dirichlet distribution (section)

===Entropy===

If {{mvar|X}} is a <math>\operatorname{Dir}(\boldsymbol\alpha)</math> random variable, the [[differential entropy]] of {{mvar|X}} (in [[nat (unit)|nat units]]) is<ref>{{cite book |last1=Lin |first1=Jiayu |title=On The Dirichlet Distribution |date=2016 |publisher=Queen's University |location=Kingston, Canada |pages=§ 2.4.9 |url=https://mast.queensu.ca/~communications/Papers/msc-jiayu-lin.pdf}}</ref>

<math display=block>h(\boldsymbol X) = \operatorname{E}[- \ln f(\boldsymbol X)] = \ln \operatorname{B}(\boldsymbol\alpha) + (\alpha_0-K)\psi(\alpha_0) - \sum_{j=1}^K (\alpha_j-1)\psi(\alpha_j) </math>

where <math>\psi</math> is the [[digamma function]].

The following formula for <math> \operatorname{E}[\ln(X_i)]</math> can be used to derive the differential [[information entropy|entropy]] above. Since the functions <math>\ln(X_i)</math> are the sufficient statistics of the Dirichlet distribution, the [[Exponential family#Moments and cumulants of the sufficient statistic|exponential family differential identities]] can be used to get an analytic expression for the expectation of <math>\ln(X_i)</math> (see equation (2.62) in <ref>{{cite web|last=Nguyen|first=Duy|title=AN IN DEPTH INTRODUCTION TO VARIATIONAL BAYES NOTE|date=15 August 2023 |ssrn=4541076 |url=https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4541076|access-date=15 August 2023}}</ref>) and its associated covariance matrix:

<math display=block>\operatorname{E}[\ln(X_i)] = \psi(\alpha_i)-\psi(\alpha_0)</math>

and

<math display=block>\operatorname{Cov}[\ln(X_i),\ln(X_j)] = \psi'(\alpha_i) \delta_{ij} - \psi'(\alpha_0)</math>

where <math>\psi</math> is the [[digamma function]], <math>\psi'</math> is the [[trigamma function]], and <math>\delta_{ij}</math> is the [[Kronecker delta]].

The spectrum of [[Rényi entropy|Rényi information]] for values other than <math> \lambda = 1</math> is given by<ref>{{cite journal | journal=Journal of Statistical Planning and Inference | volume=93 | issue=325 | pages=51–69 | year=2001  | author=Song, Kai-Sheng | title=Rényi information, loglikelihood, and an intrinsic distribution measure| doi = 10.1016/S0378-3758(00)00169-5 | publisher=Elsevier}}</ref>

<math display=block>F_R(\lambda) = (1-\lambda)^{-1} \left( - \lambda \log \mathrm{B}(\boldsymbol\alpha) + \sum_{i=1}^K \log \Gamma(\lambda(\alpha_i - 1) + 1) - \log \Gamma(\lambda (\alpha_0 - K) + K ) \right) </math>

and the information entropy is the limit as <math>\lambda</math> goes to 1.

Another related interesting measure is the entropy of a discrete categorical (one-of-K binary) vector {{math|'''Z'''}} with probability-mass distribution {{math|'''X'''}}, i.e.,  <math> P(Z_i=1, Z_{j\ne i} = 0 | \boldsymbol X) = X_i </math>. The conditional [[information entropy]] of {{math|'''Z'''}}, given {{math|'''X'''}} is

<math display=block>S(\boldsymbol X) = H(\boldsymbol Z | \boldsymbol X) = \operatorname{E}_{\boldsymbol Z}[- \log P(\boldsymbol Z | \boldsymbol X ) ] = \sum_{i=1}^K - X_i \log X_i </math>

This function of {{math|'''X'''}} is a scalar random variable. If {{math|'''X'''}} has a symmetric Dirichlet distribution with all <math>\alpha_i = \alpha</math>, the expected value of the entropy (in [[nat (unit)|nat units]]) is<ref>{{cite conference |last1=Nemenman |first1=Ilya |last2=Shafee |first2=Fariel |last3=Bialek |first3=William |title= Entropy and Inference, revisited |date=2002 |conference=NIPS 14 |url=http://papers.nips.cc/paper/1965-entropy-and-inference-revisited.pdf}}, eq. 8</ref> 

<math display=block>\operatorname{E}[S(\boldsymbol X)] = \sum_{i=1}^K \operatorname{E}[- X_i \ln X_i] = \psi(K\alpha + 1) - \psi(\alpha + 1) </math>