Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Dirichlet distribution
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Related distributions== When <math>\boldsymbol X=(X_1, \ldots,X_K)\sim \operatorname{Dir}\left(\alpha_1, \ldots, \alpha_K \right)</math>, the marginal distribution of each component <math>X_i \sim \operatorname{Beta}(\alpha_i, \alpha_0-\alpha_i)</math>, a [[Beta distribution]]. In particular, if {{math|''K'' {{=}} 2}} then <math>X_1 \sim \operatorname{Beta}(\alpha_1, \alpha_2)</math> is equivalent to <math>\boldsymbol X=(X_1,1-X_1) \sim \operatorname{Dir}\left(\alpha_1, \alpha_2 \right)</math>. For {{mvar|K}} independently distributed [[Gamma distribution]]s: <math display=block>Y_1 \sim \operatorname{Gamma}(\alpha_1, \theta), \ldots, Y_K \sim \operatorname{Gamma}(\alpha_K, \theta)</math> we have:<ref name=devroye>{{cite book |publisher=Springer-Verlag |year=1986 |last=Devroye |first=Luc |url=http://luc.devroye.org/rnbookindex.html |title=Non-Uniform Random Variate Generation |isbn=0-387-96305-7}}</ref>{{Rp|402}} <math display=block>V=\sum_{i=1}^K Y_i\sim\operatorname{Gamma} \left(\alpha_0, \theta \right ),</math> <math display=block>X = (X_1, \ldots, X_K) = \left(\frac{Y_1}{V}, \ldots, \frac{Y_K}{V} \right)\sim \operatorname{Dir}\left (\alpha_1, \ldots, \alpha_K \right).</math> Although the ''X{{sub|i}}''s are not independent from one another, they can be seen to be generated from a set of {{mvar|K}} independent [[Gamma distribution|gamma]] random variables.<ref name=devroye/>{{Rp|594}} Unfortunately, since the sum {{mvar|V}} is lost in forming {{mvar|X}} (in fact it can be shown that {{mvar|V}} is stochastically independent of {{mvar|X}}), it is not possible to recover the original gamma random variables from these values alone. Nevertheless, because independent random variables are simpler to work with, this reparametrization can still be useful for proofs about properties of the Dirichlet distribution. ===Conjugate prior of the Dirichlet distribution=== Because the Dirichlet distribution is an [[exponential family|exponential family distribution]] it has a conjugate prior. The conjugate prior is of the form:<ref name=Lefkimmiatis2009>{{cite journal |first1=Stamatios |last1=Lefkimmiatis |first2=Petros |last2=Maragos |first3=George |last3=Papandreou |year=2009 |title=Bayesian Inference on Multiscale Models for Poisson Intensity Estimation: Applications to Photon-Limited Image Denoising |journal=IEEE Transactions on Image Processing |volume=18 |issue=8 |pages=1724β1741 |doi=10.1109/TIP.2009.2022008 |pmid=19414285 |bibcode=2009ITIP...18.1724L |s2cid=859561 }}</ref> <math display=block>\operatorname{CD}(\boldsymbol\alpha \mid \boldsymbol{v},\eta) \propto \left(\frac{1}{\operatorname{B}(\boldsymbol\alpha)}\right)^\eta \exp\left(-\sum_k v_k \alpha_k\right).</math> Here <math>\boldsymbol{v}</math> is a {{mvar|K}}-dimensional real vector and <math>\eta</math> is a scalar parameter. The domain of <math>(\boldsymbol{v},\eta)</math> is restricted to the set of parameters for which the above unnormalized density function can be normalized. The (necessary and sufficient) condition is:<ref name=Andreoli2018>{{cite arXiv |last=Andreoli |first=Jean-Marc |year=2018 |eprint=1811.05266 |title=A conjugate prior for the Dirichlet distribution |class=cs.LG }}</ref> <math display=block> \forall k\;\;v_k>0\;\;\;\;\text{ and } \;\;\;\;\eta>-1 \;\;\;\;\text{ and } \;\;\;\;(\eta\leq0\;\;\;\;\text{ or }\;\;\;\;\sum_k \exp-\frac{v_k} \eta < 1) </math> The conjugation property can be expressed as : if [''prior'': <math>\boldsymbol{\alpha}\sim\operatorname{CD}(\cdot \mid \boldsymbol{v},\eta)</math>] and [''observation'': <math>\boldsymbol{x}\mid\boldsymbol{\alpha}\sim\operatorname{Dirichlet}(\cdot \mid \boldsymbol{\alpha})</math>] then [''posterior'': <math>\boldsymbol{\alpha}\mid\boldsymbol{x}\sim\operatorname{CD}(\cdot \mid \boldsymbol{v}-\log \boldsymbol{x}, \eta+1)</math>]. In the published literature there is no practical algorithm to efficiently generate samples from <math>\operatorname{CD}(\boldsymbol{\alpha} \mid \boldsymbol{v},\eta)</math>. ===Generalization by scaling and translation of log-probabilities=== As noted above, Dirichlet variates can be generated by normalizing independent [[Gamma distribution|gamma]] variates. If instead one normalizes [[Generalized gamma distribution|generalized gamma]] variates, one obtains variates from the [[simplicial generalized beta distribution]] (SGB).<ref name="sgb">{{cite web |last1=Graf |first1=Monique (2019)|title=The Simplicial Generalized Beta distribution - R-package SGB and applications |url=https://libra.unine.ch/server/api/core/bitstreams/dd593778-b1fd-4856-855b-7b21e005ee77/content |website=Libra |access-date=26 May 2025}}</ref> On the other hand, SGB variates can also be obtained by applying the [[softmax function]] to scaled and translated logarithms of Dirichlet variates. Specifically, let <math>\mathbf x = (x_1, \ldots, x_K)\sim\operatorname{Dir}(\boldsymbol\alpha)</math> and let <math>\mathbf y = (y_1, \ldots, y_k)</math>, where applying the logarithm elementwise: <math display=block> \mathbf y = \operatorname{softmax}(a^{-1}\log\mathbf x + \log\mathbf b)\;\iff\;\mathbf x = \operatorname{softmax}(a\log\mathbf y - a\log\mathbf b) </math> or <math display=block> y_k = \frac{b_kx_k^{1/a}}{\sum_{i=1}^Kb_ix_i^{1/a}}\; \iff\; x_k = \frac{(y_k/b_k)^a}{\sum_{i=1}^K(y_i/b_i)^a} </math> where <math>a>0</math> and <math>\mathbf b = (b_1, \ldots, b_K)</math>, with all <math>b_k>0</math>, then <math>\mathbf y\sim\operatorname{SGB}(a, \mathbf b, \boldsymbol\alpha)</math>. The SGB density function can be derived by noting that the transformation <math>\mathbf x\mapsto\mathbf y</math>, which is a [[bijection]] from the simplex to itself, induces a differential volume change factor<ref name='manifold_flow'>{{cite web |last1=Sorrenson |first1=Peter |last2=et al. (2024) |title=Learning Distributions on Manifolds with Free-Form Flows |url=https://arxiv.org/abs/2312.09852 |website=arXiv}}</ref> of: <math display=block> R(\mathbf y, a,\mathbf b) = a^{1-K}\prod_{k=1}^K\frac{y_k}{x_k} </math> where it is understood that <math>\mathbf x</math> is recovered as a function of <math>\mathbf y</math>, as shown above. This facilitates writing the SGB density in terms of the Dirichlet density, as: <math display=block> f_{\text{SGB}}(\mathbf y\mid a, \mathbf b, \boldsymbol\alpha) = \frac{f_{\text{Dir}}(\mathbf x\mid\boldsymbol\alpha)}{R(\mathbf y,a,\mathbf b)} </math> This generalization of the Dirichlet density, via a [[change of variables]], is closely related to a [[normalizing flow]], while it must be noted that the differential volume change is not given by the [[Jacobian determinant]] of <math>\mathbf x\mapsto\mathbf y:\mathbb R^K\to\mathbb R^K</math> which is zero, but by the Jacobian determinant of <math>(x_1,\ldots,x_{K-1})\mapsto\mathbf (y_1,\ldots,y_{K-1})</math>. For further insight into the interaction between the Dirichlet shape parameters <math>\boldsymbol\alpha</math>, and the transformation parameters <math>a, \mathbf b</math>, it may be helpful to consider the logarithmic marginals, <math>\log\frac{x_k}{1-x_k}</math>, which follow the [[logistic-beta distribution]], <math>B_\sigma(\alpha_k,\sum_{i\ne k} \alpha_i)</math>. See in particular the sections on [[Generalized_logistic_distribution#Tail_behaviour|tail behaviour]] and [[Generalized_logistic_distribution#Generalization_with_location_and_scale_parameters|generalization with location and scale parameters]]. ====Application==== When <math>b_1=b_2=\cdots=b_K</math>, then the transformation simplifies to <math>\mathbf x\mapsto\operatorname{softmax}(a^{-1}\log\mathbf x)</math>, which is known as [[Platt_scaling#Analysis|temperature scaling]] in [[machine learning]], where it is used as a calibration transform for multiclass probabilistic classiers.<ref>{{cite journal |last1=Ferrer |first1=Luciana |last2=Ramos |first2=Daniel |title=Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration |journal=Transactions on Machine Learning Research |date=2025 |url=https://openreview.net/forum?id=qbrE0LR7fF}}</ref> Traditionally the temperature parameter (<math>a</math> here) is learnt [[Discriminative_model|discriminatively]] by minimizing multiclass [[cross-entropy]] over a supervised calibration data set with known class labels. But the above PDF transformation mechanism can be used to facilitate also the design of [[Generative_model|generatively trained]] calibration models with a temperature scaling component.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)