Editing Dirichlet distribution (section)

==Related distributions==
When <math>\boldsymbol X=(X_1, \ldots,X_K)\sim \operatorname{Dir}\left(\alpha_1, \ldots, \alpha_K \right)</math>, the marginal distribution of each component <math>X_i \sim  \operatorname{Beta}(\alpha_i, \alpha_0-\alpha_i)</math>, a [[Beta distribution]]. In particular, if {{math|''K'' {{=}} 2}} then <math>X_1 \sim \operatorname{Beta}(\alpha_1, \alpha_2)</math> is equivalent to <math>\boldsymbol X=(X_1,1-X_1) \sim \operatorname{Dir}\left(\alpha_1, \alpha_2 \right)</math>.  

For {{mvar|K}} independently distributed [[Gamma distribution]]s:

<math display=block>Y_1 \sim \operatorname{Gamma}(\alpha_1, \theta), \ldots, Y_K \sim \operatorname{Gamma}(\alpha_K, \theta)</math>

we have:<ref name=devroye>{{cite book |publisher=Springer-Verlag |year=1986 |last=Devroye |first=Luc |url=http://luc.devroye.org/rnbookindex.html |title=Non-Uniform Random Variate Generation |isbn=0-387-96305-7}}</ref>{{Rp|402}}

<math display=block>V=\sum_{i=1}^K Y_i\sim\operatorname{Gamma} \left(\alpha_0, \theta \right ),</math>
<math display=block>X = (X_1, \ldots, X_K) = \left(\frac{Y_1}{V}, \ldots, \frac{Y_K}{V} \right)\sim \operatorname{Dir}\left (\alpha_1, \ldots, \alpha_K \right).</math>

Although the ''X{{sub|i}}''s are not independent from one another, they can be seen to be generated from a set of {{mvar|K}} independent [[Gamma distribution|gamma]] random variables.<ref name=devroye/>{{Rp|594}} Unfortunately, since the sum {{mvar|V}} is lost in forming {{mvar|X}} (in fact it can be shown that {{mvar|V}} is stochastically independent of {{mvar|X}}), it is not possible to recover the original gamma random variables from these values alone. Nevertheless, because independent random variables are simpler to work with, this reparametrization can still be useful for proofs about properties of the Dirichlet distribution.

===Conjugate prior of the Dirichlet distribution===
Because the Dirichlet distribution is an [[exponential family|exponential family distribution]] it has a conjugate prior.
The conjugate prior is of the form:<ref name=Lefkimmiatis2009>{{cite journal |first1=Stamatios |last1=Lefkimmiatis |first2=Petros |last2=Maragos |first3=George |last3=Papandreou |year=2009 |title=Bayesian Inference on Multiscale Models for Poisson Intensity Estimation: Applications to Photon-Limited Image Denoising |journal=IEEE Transactions on Image Processing |volume=18 |issue=8 |pages=1724–1741 |doi=10.1109/TIP.2009.2022008 |pmid=19414285 |bibcode=2009ITIP...18.1724L |s2cid=859561 }}</ref>

<math display=block>\operatorname{CD}(\boldsymbol\alpha \mid \boldsymbol{v},\eta) \propto \left(\frac{1}{\operatorname{B}(\boldsymbol\alpha)}\right)^\eta \exp\left(-\sum_k v_k \alpha_k\right).</math>

Here <math>\boldsymbol{v}</math> is a {{mvar|K}}-dimensional real vector and <math>\eta</math> is a scalar parameter.  The domain of <math>(\boldsymbol{v},\eta)</math> is restricted to the set of parameters for which the above unnormalized density function can be normalized. The (necessary and sufficient) condition is:<ref name=Andreoli2018>{{cite arXiv |last=Andreoli |first=Jean-Marc |year=2018 |eprint=1811.05266 |title=A conjugate prior for the Dirichlet distribution |class=cs.LG }}</ref>

<math display=block>
\forall k\;\;v_k>0\;\;\;\;\text{ and } \;\;\;\;\eta>-1 \;\;\;\;\text{ and } \;\;\;\;(\eta\leq0\;\;\;\;\text{ or }\;\;\;\;\sum_k \exp-\frac{v_k} \eta < 1)
</math>

The conjugation property can be expressed as

: if [''prior'': <math>\boldsymbol{\alpha}\sim\operatorname{CD}(\cdot \mid \boldsymbol{v},\eta)</math>] and [''observation'': <math>\boldsymbol{x}\mid\boldsymbol{\alpha}\sim\operatorname{Dirichlet}(\cdot \mid \boldsymbol{\alpha})</math>] then [''posterior'': <math>\boldsymbol{\alpha}\mid\boldsymbol{x}\sim\operatorname{CD}(\cdot \mid \boldsymbol{v}-\log \boldsymbol{x}, \eta+1)</math>].

In the published literature there is no practical algorithm to efficiently generate samples from <math>\operatorname{CD}(\boldsymbol{\alpha} \mid \boldsymbol{v},\eta)</math>.

===Generalization by scaling and translation of log-probabilities===
As noted above, Dirichlet variates can be generated by normalizing independent [[Gamma distribution|gamma]] variates. If instead one normalizes [[Generalized gamma distribution|generalized gamma]] variates, one obtains variates from the [[simplicial generalized beta distribution]] (SGB).<ref name="sgb">{{cite web |last1=Graf |first1=Monique (2019)|title=The Simplicial Generalized Beta distribution - R-package SGB and applications |url=https://libra.unine.ch/server/api/core/bitstreams/dd593778-b1fd-4856-855b-7b21e005ee77/content |website=Libra |access-date=26 May 2025}}</ref> On the other hand, SGB variates can also be obtained by applying the [[softmax function]] to scaled and translated logarithms of Dirichlet variates. Specifically, let <math>\mathbf x = (x_1, \ldots, x_K)\sim\operatorname{Dir}(\boldsymbol\alpha)</math> and let <math>\mathbf y = (y_1, \ldots, y_k)</math>, where applying the logarithm elementwise:
<math display=block>
\mathbf y = \operatorname{softmax}(a^{-1}\log\mathbf x + \log\mathbf b)\;\iff\;\mathbf x = \operatorname{softmax}(a\log\mathbf y - a\log\mathbf b)
</math>
or
<math display=block>
y_k = \frac{b_kx_k^{1/a}}{\sum_{i=1}^Kb_ix_i^{1/a}}\; \iff\;
x_k = \frac{(y_k/b_k)^a}{\sum_{i=1}^K(y_i/b_i)^a}
</math>
where <math>a>0</math> and <math>\mathbf b = (b_1, \ldots, b_K)</math>, with all <math>b_k>0</math>, then <math>\mathbf y\sim\operatorname{SGB}(a, \mathbf b, \boldsymbol\alpha)</math>. The SGB density function can be derived by noting that the transformation <math>\mathbf x\mapsto\mathbf y</math>, which is a [[bijection]] from the simplex to itself, induces a differential volume change factor<ref name='manifold_flow'>{{cite web |last1=Sorrenson |first1=Peter |last2=et al. (2024) |title=Learning Distributions on Manifolds with Free-Form Flows |url=https://arxiv.org/abs/2312.09852 |website=arXiv}}</ref> of:
<math display=block>
R(\mathbf y, a,\mathbf b) = a^{1-K}\prod_{k=1}^K\frac{y_k}{x_k} 
</math>
where it is understood that <math>\mathbf x</math> is recovered as a function of <math>\mathbf y</math>, as shown above. This facilitates writing the SGB density in terms of the Dirichlet density, as:
<math display=block>
f_{\text{SGB}}(\mathbf y\mid a, \mathbf b, \boldsymbol\alpha) = \frac{f_{\text{Dir}}(\mathbf x\mid\boldsymbol\alpha)}{R(\mathbf y,a,\mathbf b)}
</math>
This generalization of the Dirichlet density, via a [[change of variables]], is closely related to a [[normalizing flow]], while it must be noted that the differential volume change is not given by the [[Jacobian determinant]] of <math>\mathbf x\mapsto\mathbf y:\mathbb R^K\to\mathbb R^K</math> which is zero, but by the Jacobian determinant of <math>(x_1,\ldots,x_{K-1})\mapsto\mathbf (y_1,\ldots,y_{K-1})</math>. 

For further insight into the interaction between the Dirichlet shape parameters <math>\boldsymbol\alpha</math>, and the transformation parameters <math>a, \mathbf b</math>, it may be helpful to consider the logarithmic marginals, <math>\log\frac{x_k}{1-x_k}</math>, which follow the [[logistic-beta distribution]], <math>B_\sigma(\alpha_k,\sum_{i\ne k} \alpha_i)</math>. See in particular the sections on [[Generalized_logistic_distribution#Tail_behaviour|tail behaviour]] and [[Generalized_logistic_distribution#Generalization_with_location_and_scale_parameters|generalization with location and scale parameters]].      

====Application====
When <math>b_1=b_2=\cdots=b_K</math>, then the transformation simplifies to <math>\mathbf x\mapsto\operatorname{softmax}(a^{-1}\log\mathbf x)</math>, which is known as [[Platt_scaling#Analysis|temperature scaling]] in [[machine learning]], where it is used as a calibration transform for multiclass probabilistic classiers.<ref>{{cite journal |last1=Ferrer |first1=Luciana |last2=Ramos |first2=Daniel |title=Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration |journal=Transactions on Machine Learning Research |date=2025 |url=https://openreview.net/forum?id=qbrE0LR7fF}}</ref> Traditionally the temperature parameter (<math>a</math> here) is learnt [[Discriminative_model|discriminatively]] by minimizing multiclass [[cross-entropy]] over a supervised calibration data set with known class labels. But the above PDF transformation mechanism can be used to facilitate also the design of [[Generative_model|generatively trained]] calibration models with a temperature scaling component.