Editing Bayesian network (section)

==Definitions and concepts==
{{See also|Glossary of graph theory#Directed acyclic graphs}}
Several equivalent definitions of a Bayesian network have been offered. For the following, let ''G'' = (''V'',''E'') be a [[directed acyclic graph]] (DAG) and let ''X'' = (''X''<sub>''v''</sub>), ''v'' ∈ ''V'' be a set of [[random variable]]s indexed by ''V''.

===Factorization definition===
''X'' is a Bayesian network with respect to ''G'' if its joint [[probability density function]] (with respect to a [[product measure]]) can be written as a product of the individual density functions, conditional on their parent variables:{{sfn|Russell|Norvig|2003|p=496}}

: <math> p (x) = \prod_{v \in V} p \left(x_v \,\big|\,  x_{\operatorname{pa}(v)} \right) </math>

where pa(''v'') is the set of parents of ''v'' (i.e. those vertices pointing directly to ''v'' via a single edge).

For any set of random variables, the probability of any member of a [[joint distribution]] can be calculated from conditional probabilities using the [[chain rule (probability)|chain rule]] (given a [[topological ordering]] of ''X'') as follows:{{sfn|Russell|Norvig|2003|p=496}}

: <math>\operatorname P(X_1=x_1, \ldots, X_n=x_n) = \prod_{v=1}^n \operatorname P \left(X_v=x_v \mid X_{v+1}=x_{v+1}, \ldots, X_n=x_n \right)</math>

Using the definition above, this can be written as:

: <math>\operatorname P(X_1=x_1, \ldots, X_n=x_n) = \prod_{v=1}^n \operatorname P (X_v=x_v \mid X_j=x_j \text{ for each } X_j\, \text{ that is a parent of } X_v\, )</math>

The difference between the two expressions is the [[conditional independence]] of the variables from any of their non-descendants, given the values of their parent variables.

===Local Markov property===
''X'' is a Bayesian network with respect to ''G'' if it satisfies the ''local Markov property'': each variable is [[Conditional independence|conditionally independent]] of its non-descendants given its parent variables:{{sfn|Russell|Norvig|2003|p=499}}

:<math> X_v \perp\!\!\!\perp X_{V \,\smallsetminus\, \operatorname{de}(v)} \mid X_{\operatorname{pa}(v)} \quad\text{for all }v \in V</math>

where de(''v'') is the set of descendants and ''V''&nbsp;\&nbsp;de(''v'') is the set of non-descendants of ''v''.

This can be expressed in terms similar to the first definition, as

:<math>
\begin{align}
& \operatorname P(X_v=x_v \mid  X_i=x_i \text{ for each } X_i \text{ that is not a descendant of } X_v\, ) \\[6pt]
= {} & P(X_v=x_v \mid X_j=x_j \text{ for each } X_j \text{ that is a parent of } X_v\, )
\end{align}
</math>

The set of parents is a subset of the set of non-descendants because the graph is [[Cycle (graph theory)|acyclic]].

===Marginal independence structure===
In general, learning a Bayesian network from data is known to be [[NP-hardness|NP-hard]].<ref>{{cite journal |last1=Chickering |first1=David M. |last2=Heckerman |first2=David |last3=Meek |first3=Christopher |title=Large-sample learning of Bayesian networks is NP-hard |journal=Journal of Machine Learning Research |date=2004 |volume=5 |pages=1287–1330 |url=https://www.jmlr.org/papers/volume5/chickering04a/chickering04a.pdf}}</ref> This is due in part to the [[combinatorial explosion]] of [[Directed acyclic graph#Combinatorial enumeration|enumerating DAGs]] as the number of variables increases. Nevertheless, insights about an underlying Bayesian network can be learned from data in polynomial time by focusing on its marginal independence structure:<ref>{{cite journal |last1=Deligeorgaki |first1=Danai |last2=Markham |first2=Alex |last3=Misra |first3=Pratik |last4=Solus |first4=Liam |title=Combinatorial and algebraic perspectives on the marginal independence structure of Bayesian networks |journal=Algebraic Statistics |date=2023 |volume=14 |issue=2 |pages=233–286 |doi=10.2140/astat.2023.14.233|arxiv=2210.00822 }}</ref> while the conditional independence statements of a distribution modeled by a Bayesian network are encoded by a DAG (according to the factorization and Markov properties above), its marginal independence statements—the conditional independence statements in which the conditioning set is empty—are encoded by a [[Graph (discrete mathematics)|simple undirected graph]] with special properties such as equal [[Intersection number (graph theory)|intersection]] and [[Independent set (graph theory)|independence numbers]].

===Developing Bayesian networks===
Developing a Bayesian network often begins with creating a DAG ''G'' such that ''X'' satisfies the local Markov property with respect to ''G''. Sometimes this is a [[Causal graph|causal]] DAG. The conditional probability distributions of each variable given its parents in ''G'' are assessed. In many cases, in particular in the case where the variables are discrete, if the joint distribution of ''X'' is the product of these conditional distributions, then ''X'' is a Bayesian network with respect to ''G''.<ref>{{cite book |first=Richard E. |last=Neapolitan | name-list-style = vanc |title=Learning Bayesian networks |url={{google books |plainurl=y |id=OlMZAQAAIAAJ}} |year=2004 |publisher=Prentice Hall |isbn=978-0-13-012534-7 }}</ref>

===Markov blanket===
The [[Markov blanket]] of a node is the set of nodes consisting of its parents, its children, and any other parents of its children. The Markov blanket renders the node independent of the rest of the network; the joint distribution of the variables in the Markov blanket of a node is sufficient knowledge for calculating the distribution of the node. ''X'' is a Bayesian network with respect to ''G'' if every node is conditionally independent of all other nodes in the network, given its [[Markov blanket]].{{sfn|Russell|Norvig|2003|p=499}}

===={{anchor|d-separation}}''d''-separation====

This definition can be made more general by defining the "d"-separation of two nodes, where d stands for directional.<ref name=pearl2000/> We first define the "d"-separation of a trail and then we will define the "d"-separation of two nodes in terms of that.

Let ''P'' be a trail from node ''u'' to ''v''. A trail is a loop-free, undirected (i.e. all edge directions are ignored) path between two nodes. Then ''P'' is said to be ''d''-separated by a set of nodes ''Z'' if any of the following conditions holds:

*''P'' contains (but does not need to be entirely) a directed chain, <math> u \cdots \leftarrow m \leftarrow \cdots v</math> or <math> u \cdots \rightarrow m \rightarrow \cdots v</math>, such that the middle node ''m'' is in ''Z'',
*''P'' contains a fork, <math> u \cdots \leftarrow m \rightarrow \cdots v</math>, such that the middle node ''m'' is in ''Z'', or
*''P'' contains an inverted fork (or collider), <math> u \cdots \rightarrow m \leftarrow \cdots v</math>, such that the middle node ''m'' is not in ''Z'' and no descendant of ''m'' is in ''Z''.

The nodes ''u'' and ''v'' are ''d''-separated by ''Z'' if all trails between them are ''d''-separated. If ''u'' and ''v'' are not d-separated, they are d-connected.

''X'' is a Bayesian network with respect to ''G'' if, for any two nodes ''u'', ''v'':

: <math>X_u \perp\!\!\!\perp X_v \mid X_Z</math>

where ''Z'' is a set which ''d''-separates ''u'' and ''v''. (The [[Markov blanket]] is the minimal set of nodes which ''d''-separates node ''v'' from all other nodes.)

===Causal networks===
Although Bayesian networks are often used to represent [[causality|causal]] relationships, this need not be the case: a directed edge from ''u'' to ''v'' does not require that ''X<sub>v</sub>'' be causally dependent on ''X<sub>u</sub>''. This is demonstrated by the fact that Bayesian networks on the graphs:

:<math> a \rightarrow b \rightarrow c \qquad \text{and} \qquad a \leftarrow b \leftarrow c </math>

are equivalent: that is they impose exactly the same conditional independence requirements.

A causal network is a Bayesian network with the requirement that the relationships be causal. The additional semantics of causal networks specify that if a node ''X'' is actively caused to be in a given state ''x'' (an action written as do(''X''&nbsp;=&nbsp;''x'')), then the probability density function changes to that of the network obtained by cutting the links from the parents of ''X'' to ''X'', and setting ''X'' to the caused value ''x''.<ref name=pearl2000/> Using these semantics, the impact of external interventions from data obtained prior to intervention can be predicted.