Editing Gibbs sampling (section)

===== Collapsing other conjugate priors =====

In general, any conjugate prior can be collapsed out, if its only children have distributions conjugate to it.  The relevant math is discussed in the article on [[compound distribution]]s.  If there is only one child node, the result will often assume a known distribution.  For example, collapsing an [[inverse gamma distribution|inverse-gamma-distributed]] [[variance]] out of a network with a single [[Gaussian distribution|Gaussian]] child will yield a [[Student's t-distribution]]. (For that matter, collapsing both the mean and variance of a single Gaussian child will still yield a Student's t-distribution, provided both are conjugate, i.e. Gaussian mean, inverse-gamma variance.)

If there are multiple child nodes, they will all become dependent, as in the [[Dirichlet distribution|Dirichlet]]-[[categorical distribution|categorical]] case.  The resulting [[joint distribution]] will have a closed form that resembles in some ways the compound distribution, although it will have a product of a number of factors, one for each child node, in it.

In addition, and most importantly, the resulting [[conditional distribution]] of one of the child nodes given the others (and also given the parents of the collapsed node(s), but ''not'' given the children of the child nodes) will have the same density as the [[posterior predictive distribution]] of all the remaining child nodes.  Furthermore, the posterior predictive distribution has the same density as the basic compound distribution of a single node, although with different parameters.  The general formula is given in the article on [[compound distribution]]s.

For example, given a Bayes network with a set of conditionally [[independent identically distributed]] [[Gaussian distribution|Gaussian-distributed]] nodes with [[conjugate prior]] distributions placed on the mean and variance, the conditional distribution of one node given the others after compounding out both the mean and variance will be a [[Student's t-distribution]].  Similarly, the result of compounding out the [[gamma distribution|gamma]] prior of a number of [[Poisson distribution|Poisson-distributed]] nodes causes the conditional distribution of one node given the others to assume a [[negative binomial distribution]].

In these cases where compounding produces a well-known distribution, efficient sampling procedures often exist, and using them will often (although not necessarily) be more efficient than not collapsing, and instead sampling both prior and child nodes separately. However, in the case where the compound distribution is not well-known, it may not be easy to sample from, since it generally will not belong to the [[exponential family]] and typically will not be [[Logarithmically concave function|log-concave]] (which would make it easy to sample using [[adaptive rejection sampling]], since a closed form always exists).

In the case where the child nodes of the collapsed nodes themselves have children, the conditional distribution of one of these child nodes given all other nodes in the graph will have to take into account the distribution of these second-level children.  In particular, the resulting conditional distribution will be proportional to a product of the compound distribution as defined above, and the conditional distributions of all of the child nodes given their parents (but not given their own children).  This follows from the fact that the full conditional distribution is proportional to the joint distribution.  If the child nodes of the collapsed nodes are [[continuous distribution|continuous]], this distribution will generally not be of a known form, and may well be difficult to sample from despite the fact that a closed form can be written, for the same reasons as described above for non-well-known compound distributions.  However, in the particular case that the child nodes are [[discrete distribution|discrete]], sampling is feasible, regardless of whether the children of these child nodes are continuous or discrete.  In fact, the principle involved here is described in fair detail in the article on the [[Dirichlet-multinomial distribution]].