Editing Gibbs sampling (section)

===== Collapsing Dirichlet distributions =====

In [[hierarchical Bayesian model]]s with [[categorical distribution|categorical variable]]s, such as [[latent Dirichlet allocation]] and various other models used in [[natural language processing]], it is quite common to collapse out the [[Dirichlet distribution]]s that are typically used as [[prior distribution]]s over the categorical variables.  The result of this collapsing introduces dependencies among all the categorical variables dependent on a given Dirichlet prior, and the joint distribution of these variables after collapsing is a [[Dirichlet-multinomial distribution]].  The conditional distribution of a given categorical variable in this distribution, conditioned on the others, assumes an extremely simple form that makes Gibbs sampling even easier than if the collapsing had not been done.  The rules are as follows:
#Collapsing out a Dirichlet prior node affects only the parent and children nodes of the prior.  Since the parent is often a constant, it is typically only the children that we need to worry about.
#Collapsing out a Dirichlet prior introduces dependencies among all the categorical children dependent on that prior — but ''no'' extra dependencies among any other categorical children. (This is important to keep in mind, for example, when there are multiple Dirichlet priors related by the same hyperprior.  Each Dirichlet prior can be independently collapsed and affects only its direct children.)
#After collapsing, the conditional distribution of one dependent children on the others assumes a very simple form: The probability of seeing a given value is proportional to the sum of the corresponding hyperprior for this value, and the count of all of the ''other dependent nodes'' assuming the same value.  Nodes not dependent on the same prior '''must not''' be counted. The same rule applies in other iterative inference methods, such as [[variational Bayes]] or [[expectation maximization]]; however, if the method involves keeping partial counts, then the partial counts for the value in question must be summed across all the other dependent nodes.  Sometimes this summed up partial count is termed the ''expected count'' or similar. The probability is ''proportional to'' the resulting value; the actual probability must be determined by normalizing across all the possible values that the categorical variable can take (i.e. adding up the computed result for each possible value of the categorical variable, and dividing all the computed results by this sum).
#If a given categorical node has dependent children (e.g. when it is a [[latent variable]] in a [[mixture model]]), the value computed in the previous step (expected count plus prior, or whatever is computed) must be multiplied by the actual conditional probabilities (''not'' a computed value that is proportional to the probability!) of all children given their parents.  See the article on the [[Dirichlet-multinomial distribution]] for a detailed discussion.
#In the case where the group membership of the nodes dependent on a given Dirichlet prior may change dynamically depending on some other variable (e.g. a categorical variable indexed by another latent categorical variable, as in a [[topic model]]), the same expected counts are still computed, but need to be done carefully so that the correct set of variables is included.   See the article on the [[Dirichlet-multinomial distribution]] for more discussion, including in the context of a topic model.