Editing Bayesian network (section)

===Structure learning===

In the simplest case, a Bayesian network is specified by an expert and is then used to perform inference. In other applications, the task of defining the network is too complex for humans. In this case, the network structure and the parameters of the local distributions must be learned from data.

Automatically learning the graph structure of a Bayesian network (BN) is a challenge pursued within [[machine learning]]. The basic idea goes back to a recovery algorithm developed by Rebane and [[Judea Pearl|Pearl]]<ref>{{cite book | vauthors = Rebane G, Pearl J | chapter = The Recovery of Causal Poly-trees from Statistical Data| title = Proceedings, 3rd Workshop on Uncertainty in AI | location = Seattle, WA | pages = 222–228 | year = 1987 | arxiv = 1304.2736}}</ref> and rests on the distinction between the three possible patterns allowed in a 3-node DAG:
{| class="wikitable"
|+Junction patterns
!Pattern
!Model
|-
|Chain
!<math>X \rightarrow Y \rightarrow Z</math>
|-
|Fork 
|<math>X \leftarrow Y \rightarrow Z</math>
|-
|Collider 
|<math>X \rightarrow Y \leftarrow Z</math>
|}
The first 2 represent the same dependencies (<math>X</math> and <math>Z</math> are independent given <math>Y</math>) and are, therefore, indistinguishable. The collider, however, can be uniquely identified, since <math>X</math> and <math>Z</math> are marginally independent and all other pairs are dependent. Thus, while the ''skeletons'' (the graphs stripped of arrows) of these three triplets are identical, the directionality of the arrows is partially identifiable. The same distinction applies when <math>X</math> and <math>Z</math> have common parents, except that one must first condition on those parents. Algorithms have been developed to systematically determine the skeleton of the underlying graph and, then, orient all arrows whose directionality is dictated by the conditional independences observed.<ref name="pearl2000">{{Cite book | first = Judea | last = Pearl | author-link = Judea Pearl | title = Causality: Models, Reasoning, and Inference |url={{google books |plainurl=y |id=LLkhAwAAQBAJ}}| publisher = [[Cambridge University Press]] | year = 2000 | isbn = 978-0-521-77362-1 | oclc = 42291253 }}</ref><ref>{{cite journal | vauthors = Spirtes P, Glymour C |title=An algorithm for fast recovery of sparse causal graphs |journal=Social Science Computer Review |volume=9 |issue=1 |pages=62–72 |year=1991 |doi=10.1177/089443939100900106 |s2cid=38398322 |url=http://repository.cmu.edu/cgi/viewcontent.cgi?article=1316&context=philosophy |format=PDF|citeseerx=10.1.1.650.2922 }}</ref><ref>{{cite book |first1=Peter |last1=Spirtes |first2=Clark N. |last2=Glymour |first3=Richard |last3=Scheines | name-list-style = vanc |title=Causation, Prediction, and Search |url={{google books |plainurl=y |id=VkawQgAACAAJ}} |year=1993 |publisher=Springer-Verlag |isbn=978-0-387-97979-3 |edition=1st}}</ref><ref>{{cite conference |title=Equivalence and synthesis of causal models |url={{google books |plainurl=y |id=ikuuHAAACAAJ}}|first1=Thomas |last1=Verma |first2=Judea |last2=Pearl |year=1991 | veditors = Bonissone P, Henrion M, Kanal LN, Lemmer JF | book-title = UAI '90 Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence |publisher=Elsevier |pages=255–270 |isbn=0-444-89264-8 }}</ref>

An alternative method of structural learning uses optimization-based search. It requires a [[scoring function]] and a search strategy. A common scoring function is [[posterior probability]] of the structure given the training data, like the [[Bayesian information criterion|BIC]] or the BDeu. The time requirement of an [[exhaustive search]] returning a structure that maximizes the score is [[Tetration|superexponential]] in the number of variables. A local search strategy makes incremental changes aimed at improving the score of the structure. A global search algorithm like [[Markov chain Monte Carlo]] can avoid getting trapped in [[maxima and minima|local minima]]. Friedman et al.<ref>{{cite journal |last1=Friedman |first1=Nir |last2=Geiger |first2=Dan |last3=Goldszmidt |first3=Moises | name-list-style = vanc |date=November 1997 |title=Bayesian Network Classifiers |journal=Machine Learning |volume=29 |issue=2–3 |pages=131–163 |doi=10.1023/A:1007465528199|doi-access=free }}</ref><ref>{{cite journal | vauthors = Friedman N, Linial M, Nachman I, Pe'er D | title = Using Bayesian networks to analyze expression data | journal = Journal of Computational Biology | volume = 7 | issue = 3–4 | pages = 601–20 | date = August 2000 | pmid = 11108481 | doi = 10.1089/106652700750050961 | citeseerx = 10.1.1.191.139 }}</ref> discuss using [[mutual information]] between variables and finding a structure that maximizes this. They do this by restricting the parent candidate set to ''k'' nodes and exhaustively searching therein.

A particularly fast method for exact BN learning is to cast the problem as an optimization problem, and solve it using [[integer programming]]. Acyclicity constraints are added to the integer program (IP) during solving in the form of [[Cutting-plane method|cutting planes]].<ref>{{Cite journal|last=Cussens|first=James | name-list-style = vanc |year=2011|title=Bayesian network learning with cutting planes|url=https://dslpitt.org/papers/11/p153-cussens.pdf|archive-url=https://web.archive.org/web/20220327163338/https://dslpitt.org/papers/11/p153-cussens.pdf|url-status=usurped|archive-date=March 27, 2022|journal=Proceedings of the 27th Conference Annual Conference on Uncertainty in Artificial Intelligence|pages=153–160|bibcode=2012arXiv1202.3713C |arxiv=1202.3713 }}</ref> Such method can handle problems with up to 100 variables.

In order to deal with problems with thousands of variables, a different approach is necessary. One is to first sample one ordering, and then find the optimal BN structure with respect to that ordering. This implies working on the search space of the possible orderings, which is convenient as it is smaller than the space of network structures. Multiple orderings are then sampled and evaluated. This method has been proven to be the best available in literature when the number of variables is huge.<ref>{{cite book | vauthors = Scanagatta M, de Campos CP, Corani G, Zaffalon M | chapter-url = https://papers.nips.cc/paper/5803-learning-bayesian-networks-with-thousands-of-variables | chapter = Learning Bayesian Networks with Thousands of Variables | title = NIPS-15: Advances in Neural Information Processing Systems | volume = 28 | pages = 1855–1863 | year = 2015 | publisher = Curran Associates }}</ref>

Another method consists of focusing on the sub-class of decomposable models, for which the [[Maximum likelihood estimate|MLE]] have a closed form. It is then possible to discover a consistent structure for hundreds of variables.<ref name="Petitjean">{{cite conference |url=http://www.tiny-clues.eu/Research/Petitjean2013-ICDM.pdf |title= Scaling log-linear analysis to high-dimensional data | vauthors = Petitjean F, Webb GI, Nicholson AE |year=2013 |publisher=IEEE |conference=International Conference on Data Mining |location=Dallas, TX, USA }}</ref>

Learning Bayesian networks with bounded treewidth is necessary to allow exact, tractable inference, since the worst-case inference complexity is exponential in the treewidth k (under the exponential time hypothesis). Yet, as a global property of the graph, it considerably increases the difficulty of the learning process. In this context it is possible to use [[K-tree]] for effective learning.<ref>M. Scanagatta, G. Corani, C. P. de Campos, and M. Zaffalon. [http://papers.nips.cc/paper/6232-learning-treewidth-bounded-bayesian-networks-with-thousands-of-variables Learning Treewidth-Bounded Bayesian Networks with Thousands of Variables.] In NIPS-16: Advances in Neural Information Processing Systems 29, 2016.</ref>