Editing Conditional expectation (section)

== Definitions ==

=== Conditioning on an event ===
If {{mvar|A}} is an event in <math>\mathcal{F}</math> with nonzero probability,
and {{mvar|X}} is a [[discrete random variable]], the conditional expectation
of {{mvar|X}} given {{mvar|A}} is
:<math>
\begin{aligned}
\operatorname{E} (X \mid A) &= \sum_x x P(X = x \mid A) \\
& =\sum_x x \frac{P(\{X = x\} \cap A)}{P(A)}
\end{aligned}
</math>
where the sum is taken over all possible outcomes of {{mvar|X}}.

If <math>P(A) = 0</math>, the conditional expectation is undefined due to the division by zero.

=== Discrete random variables ===
If {{mvar|X}} and {{mvar|Y}} are [[discrete random variable]]s,
the conditional expectation of {{mvar|X}} given {{mvar|Y}} is
:<math>
\begin{aligned}
\operatorname{E} (X \mid Y=y) &=  \sum_x x P(X = x \mid Y = y) \\
&= \sum_x x \frac{P(X = x, Y = y)}{P(Y=y)}
\end{aligned}
</math>
where <math>P(X = x, Y = y)</math> is the [[joint probability mass function]] of {{mvar|X}} and {{mvar|Y}}. The sum is taken over all possible outcomes of {{mvar|X}}. 

As above, the expression is undefined if <math>P(Y=y) = 0</math>.

Conditioning on a discrete random variable is the same as conditioning on the corresponding event:
:<math>\operatorname{E} (X \mid Y=y) = \operatorname{E} (X \mid A)</math> 
where {{mvar|A}} is the set <math>\{ Y = y \}</math>.

=== Continuous random variables ===
Let <math>X</math> and <math>Y</math> be [[continuous random variable]]s with joint density
<math>f_{X,Y}(x,y),</math>
<math>Y</math>'s density
<math>f_{Y}(y),</math>
and conditional density <math>\textstyle f_{X\mid Y}(x\mid y) = \frac{ f_{X,Y}(x,y) }{f_{Y}(y)}</math> of <math>X</math> given the event <math>Y=y.</math>
The conditional expectation of <math>X</math> given <math>Y=y</math> is
:<math>
\begin{aligned}
\operatorname{E} (X \mid Y=y) &=  \int_{-\infty}^\infty x f_{X\mid Y}(x\mid y) \, \mathrm{d}x \\
&= \frac{1}{f_Y(y)}\int_{-\infty}^\infty x f_{X,Y}(x,y) \, \mathrm{d}x.
\end{aligned}
</math>
When the denominator is zero, the expression is undefined.

Conditioning on a continuous random variable is not the same as conditioning on the event <math>\{ Y = y \}</math> as it was in the discrete case. For a discussion, see [[Conditional probability#Conditioning on an event of probability zero|Conditioning on an event of probability zero]]. Not respecting this distinction can lead to contradictory conclusions as illustrated by the [[Borel-Kolmogorov paradox]].

=== L<sup>2</sup> random variables ===
All random variables in this section are assumed to be in <math>L^2</math>, that is [[square integrable]].
In its full generality, conditional expectation is developed without this assumption, see below under [[Conditional expectation#Conditional expectation with respect to a sub-''σ''-algebra|Conditional expectation with respect to a sub-''σ''-algebra]]. The <math>L^2</math> theory is, however, considered more intuitive<ref>{{cite web |title=probability - Intuition behind Conditional Expectation |url=https://math.stackexchange.com/a/23613/357269 |website=Mathematics Stack Exchange}}</ref> and admits [[Conditional expectation#Connections to regression|important generalizations]].
In the context of <math>L^2</math> random variables, conditional expectation is also called [[Regression analysis|regression]].
 
In what follows let <math>(\Omega, \mathcal{F}, P)</math> be a probability space, and <math>X: \Omega \to \mathbb{R}</math> in 
<math>L^2</math> with mean <math>\mu_X</math> and [[variance]] <math>\sigma_X^2</math>.
The expectation <math>\mu_X</math> minimizes the [[mean squared error]]:
:<math> \min_{x \in \mathbb{R}} \operatorname{E}\left((X - x)^2\right) = \operatorname{E}\left((X - \mu_X)^2\right)
= \sigma_X^2. </math>

The conditional expectation of {{mvar|X}} is defined analogously, except instead of a single number 
<math>\mu_X</math>, the result will be a function <math>e_X(y)</math>. Let <math>Y: \Omega \to \mathbb{R}^n</math> be a [[random vector]]. The conditional expectation <math>e_X: \mathbb{R}^n \to \mathbb{R}</math> is a measurable function such that
:<math> \min_{g \text{ measurable }} \operatorname{E}\left((X - g(Y))^2\right) = \operatorname{E}\left((X - e_X(Y))^2\right).
</math>

Note that unlike <math>\mu_X</math>, the conditional expectation <math>e_X</math> is not generally unique: there may be multiple minimizers of the mean squared error.

==== Uniqueness ====

'''Example 1''': Consider the case where {{mvar|Y}} is the constant random variable that is always 1.
Then the mean squared error is minimized by any function of the form
:<math>
e_X(y) = \begin{cases}
\mu_X & \text{if } y = 1, \\
\text{any number} & \text{otherwise.}
\end{cases}
</math>

'''Example 2''': Consider the case where {{mvar|Y}} is the 2-dimensional random vector <math>(X, 2X)</math>. Then clearly
:<math>\operatorname{E}(X \mid Y) = X</math>
but in terms of functions it can be expressed as <math>e_X(y_1, y_2) = 3y_1-y_2</math> or <math>e'_X(y_1, y_2) = y_2 - y_1</math> or infinitely many other ways. In the context of [[linear regression]], this lack of uniqueness is called [[multicollinearity]].

Conditional expectation is unique up to a set of measure zero in <math>\mathbb{R}^n</math>. The measure used is the [[pushforward measure]] induced by {{mvar|Y}}.

In the first example, the pushforward measure is a [[Dirac distribution]] at 1. In the second it is concentrated on the "diagonal" <math>\{ y : y_2 = 2 y_1 \}</math>, so that any set not intersecting it has measure 0.

==== Existence ====

The existence of a minimizer for <math> \min_g \operatorname{E}\left((X - g(Y))^2\right)</math> is non-trivial. It can be shown that
:<math> M := \{ g(Y) : g \text{ is measurable and }\operatorname{E}(g(Y)^2) < \infty \} = L^2(\Omega, \sigma(Y)) </math>
is a closed subspace of the Hilbert space <math>L^2(\Omega)</math>.<ref>{{cite book |last1=Brockwell |first1=Peter J. |title=Time series : theory and methods |date=1991 |publisher=Springer-Verlag |location=New York |isbn=978-1-4419-0320-4 |edition=2nd}}</ref>
By the [[Hilbert projection theorem]], the necessary and sufficient condition for
<math>e_X</math> to be a minimizer is that for all <math>f(Y)</math> in {{mvar|M}} we have
:<math> \langle X - e_X(Y), f(Y) \rangle = 0. </math>
In words, this equation says that the [[residual (statistics)|residual]] <math>X - e_X(Y)</math> is orthogonal to the space {{mvar|M}} of all functions of {{mvar|Y}}.
This orthogonality condition, applied to the [[indicator function]]s <math>f(Y) = 1_{Y \in H}</math>,
is used below to extend conditional expectation to the case that {{mvar|X}} and {{mvar|Y}} are not necessarily in <math>L^2</math>.

==== Connections to regression ====

The conditional expectation is often approximated in [[applied mathematics]] and [[statistics]] due to the difficulties in analytically calculating it, and for interpolation.<ref>{{cite book |last1=Hastie |first1=Trevor |title=The elements of statistical learning : data mining, inference, and prediction |date=26 August 2009 |location=New York |isbn=978-0-387-84858-7 |edition=Second, corrected 7th printing |url=https://web.stanford.edu/~hastie/Papers/ESLII.pdf}}</ref>

The Hilbert subspace
:<math> M = \{ g(Y) : \operatorname{E}(g(Y)^2) < \infty \}</math> 
defined above is replaced with subsets thereof by restricting the functional form of {{mvar|g}}, rather than allowing any measurable function. Examples of this are [[Decision tree learning|decision tree regression]] when {{mvar|g}} is required to be a [[simple function]], [[linear regression]] when {{mvar|g}} is required to be [[affine transformation|affine]], etc.

These generalizations of conditional expectation come at the cost of many of [[Conditional expectation#Basic properties|its properties]] no longer holding.
For example, let {{mvar|M}}
be the space  of all linear functions of {{mvar|Y}} and let <math>\mathcal{E}_{M}</math> denote this generalized conditional expectation/<math>L^2</math> projection. If <math>M</math> does not contain the [[constant function]]s, the [[tower property]] 
<math> \operatorname{E}(\mathcal{E}_M(X)) = \operatorname{E}(X) </math>
will not hold.

An important special case is when {{mvar|X}} and {{mvar|Y}} are jointly normally distributed. In this case
it can be shown that the conditional expectation is equivalent to linear regression:
:<math> e_X(Y) = \alpha_0 + \sum_i \alpha_i Y_i</math>
for coefficients <math>\{\alpha_i\}_{i = 0..n}</math> described in [[Multivariate normal distribution#Conditional distributions]].

=== Conditional expectation with respect to a sub-''σ''-algebra ===
[[File:LokaleMittelwertbildung.svg|thumb|upright=1.5|'''Conditional expectation with respect to a ''σ''-algebra:''' in this example the probability space <math>(\Omega, \mathcal{F}, P)</math> is the [0,1] interval with the [[Lebesgue measure]].  We define the following ''σ''-algebras: <math>\mathcal{A} = \mathcal{F}</math>; <math>\mathcal{B}</math> is the ''σ''-algebra generated by the intervals with end-points 0, {{frac|1|4}}, {{frac|1|2}}, {{frac|3|4}}, 1; and <math>\mathcal{C}</math> is the ''σ''-algebra generated by the intervals with end-points 0, {{frac|1|2}}, 1. Here the conditional expectation is effectively the average over the minimal sets of the ''σ''-algebra.]]

Consider the following:
* <math>(\Omega, \mathcal{F}, P)</math> is a [[probability space]].
* <math>X\colon\Omega \to \mathbb{R}^n</math> is a [[random variable#Definition|random variable]] on that probability space with finite expectation.
* <math>\mathcal{H} \subseteq \mathcal{F}</math> is a sub-[[sigma-algebra|''σ''-algebra]] of <math>\mathcal{F}</math>.

Since <math>\mathcal{H}</math> is a sub <math>\sigma</math>-algebra of <math>\mathcal{F}</math>, the function <math>X\colon\Omega \to \mathbb{R}^n</math> is usually not <math>\mathcal{H}</math>-measurable, thus the existence of the integrals of the form <math display="inline">\int_H X \,dP|_\mathcal{H}</math>, where <math>H\in\mathcal{H}</math> and <math>P|_\mathcal{H}</math> is the restriction of <math>P</math> to <math>\mathcal{H}</math>, cannot be stated in general. However, the local averages <math display="inline">\int_H X\,dP</math> can be recovered in <math>(\Omega, \mathcal{H}, P|_\mathcal{H})</math> with the help of the conditional expectation. 

A '''conditional expectation''' of ''X'' given <math>\mathcal{H}</math>, denoted as <math>\operatorname{E}(X\mid\mathcal{H})</math>, is any <math>\mathcal{H}</math>-[[measurable function]] <math>\Omega \to \mathbb{R}^n</math> which satisfies:

:<math> \int_H\operatorname{E}(X \mid \mathcal{H})\,\mathrm{d}P = \int_H X \,\mathrm{d}P</math>

for each <math>H \in \mathcal{H}</math>.<ref name=billingsley1995/>

As noted in the <math>L^2</math> discussion, this condition is equivalent to saying that the [[residual (statistics)|residual]] <math>X - \operatorname{E}(X \mid \mathcal{H})</math> is orthogonal to the indicator functions <math>1_H</math>:
:<math> \langle X - \operatorname{E}(X \mid \mathcal{H}), 1_H \rangle = 0 </math>

==== Existence ====

The existence of <math>\operatorname{E}(X\mid\mathcal{H})</math> can be established by noting that <math display="inline">\mu^X\colon F \mapsto \int_F X \, \mathrm{d}P</math> for <math>F \in \mathcal{F}</math> is a finite measure on <math>(\Omega, \mathcal{F})</math> that is [[absolute continuity|absolutely continuous]] with respect to  <math>P</math>.  If <math>h</math> is the [[natural injection]] from <math>\mathcal{H}</math> to <math>\mathcal{F}</math>, then <math>\mu^X \circ h = \mu^X|_\mathcal{H}</math> is the restriction of <math>\mu^X</math> to <math>\mathcal{H}</math> and <math>P \circ h = P|_\mathcal{H}</math> is the restriction of <math>P</math> to <math>\mathcal{H}</math>.  Furthermore, <math>\mu^X \circ h</math> is absolutely continuous with respect to <math>P \circ h</math>, because the condition
:<math>P \circ h (H) = 0 \iff P(h(H)) = 0</math>
implies
:<math>\mu^X(h(H)) = 0 \iff \mu^X \circ h(H) = 0.</math>

Thus, we have 
:<math>\operatorname{E}(X\mid\mathcal{H}) = \frac{\mathrm{d}\mu^X|_\mathcal{H}}{\mathrm{d}P|_\mathcal{H}} = \frac{\mathrm{d}(\mu^X \circ h)}{\mathrm{d}(P \circ h)},</math>
where the derivatives are [[Radon–Nikodym theorem|Radon–Nikodym derivatives]] of measures.

==== Conditional expectation with respect to a random variable ====
Consider, in addition to the above,
* A [[measurable space]] <math>(U, \Sigma)</math>, and
* A random variable <math>Y\colon\Omega \to U</math>.

The conditional expectation of {{mvar|X}} given {{mvar|Y}} is defined by applying the above construction on the [[Σ-algebra#σ-algebra generated by random variable or vector|''σ''-algebra generated by]] {{mvar|Y}}:
:<math>\operatorname{E}[X\mid Y] := \operatorname{E}[X\mid\sigma(Y)]. </math>

By the [[Doob–Dynkin lemma]], there exists a function <math>e_X \colon U \to \mathbb{R}^n</math> such that
:<math>\operatorname{E}[X\mid Y] = e_X(Y). </math>

==== Discussion ====

* This is not a constructive definition; we are merely given the required property that a conditional expectation must satisfy.
** The definition of <math>\operatorname{E}(X \mid \mathcal{H})</math> may resemble that of <math>\operatorname{E}(X \mid H)</math> for an event <math>H</math> but these are very different objects.  The former is a <math>\mathcal{H}</math>-measurable function <math>\Omega \to \mathbb{R}^n</math>, while the latter is an element of <math>\mathbb{R}^n</math> and <math>\operatorname{E}(X \mid H)\ P(H)= \int_H X \,\mathrm{d}P= \int_H \operatorname{E} (X\mid\mathcal{H})\,\mathrm{d}P</math> for <math>H\in\mathcal{H}</math>.
** Uniqueness can be shown to be [[almost surely|almost sure]]: that is, versions of the same conditional expectation will only differ on a [[null set|set of probability zero]].
*** Often, one would like to think of <math>\operatorname{E}(X \mid \mathcal{H})</math> as a measure on <math>\Omega</math> for fixed H. For example, it is extremely useful to claim that <math>\sum_i\operatorname{E}(X_i \mid \mathcal{H})</math> is additive for almost all H. However, this does not immediately follow because each <math>\operatorname{E}(X_i \mid \mathcal{H})</math> may have a different null set. Because countable unions of null sets are null sets, for a countable set of <math>X_i</math>, one can choose "versions" of each <math>\operatorname{E}(X_i \mid \mathcal{H})</math> with aligned null sets as to maintain additivity for almost all H. However, to align the "null sets of dysfunction" of <math>\operatorname{E}(X_i \mid \mathcal{H})</math> over all possible <math>X_i</math>, and thus treat <math>\operatorname{E}(X \mid \mathcal{H} = H)</math> as an almost surely unique measure over <math>\Omega</math> (a "regular probability measure"), we need further regularity conditions. Intuitively, to do this, we need to be able to approximate all possible <math>X_i</math> with a countable set of them. This directly corresponds to the conditions for creating a regular probability measure, which are separability and completeness.
* The ''σ''-algebra <math>\mathcal{H}</math> controls the "granularity" of the conditioning.  A conditional expectation <math>E(X\mid\mathcal{H})</math> over a finer (larger) ''σ''-algebra <math>\mathcal{H}</math> retains information about the probabilities of a larger class of events.  A conditional expectation over a coarser (smaller) ''σ''-algebra averages over more events.

==== Conditional probability ====

{{Main|Regular conditional probability}}

For a Borel subset {{mvar|B}} in <math>\mathcal{B}(\mathbb{R}^n)</math>, one can consider the collection of random variables
:<math> \kappa_\mathcal{H}(\omega, B) := \operatorname{E}(1_{X \in B}|\mathcal{H})(\omega). </math>
It can be shown that they form a [[Markov kernel]], that is, for almost all <math>\omega</math>,
<math>\kappa_\mathcal{H}(\omega, -)</math> is a probability measure.<ref>{{cite book |last1=Klenke |first1=Achim |title=Probability theory : a comprehensive course |date=30 August 2013 |location=London |isbn=978-1-4471-5361-0 |edition=Second}}</ref>

The [[Law of the unconscious statistician]] is then
:<math> \operatorname{E}[f(X)\mid\mathcal{H}] = \int f(x) \kappa_\mathcal{H}(-, \mathrm{d}x), </math>
This shows that conditional expectations are, like their unconditional counterparts, integrations,
against a conditional measure.

=== General Definition ===
In full generality, consider:
* A probability space <math>(\Omega,\mathcal{A},P)</math>.
* A [[Banach space]] <math>(E,\|\cdot\|_E)</math>.
* A [[Bochner integral|Bochner integrable]] random variable <math>X:\Omega\to E</math>.
* A sub-''σ''-algebra <math>\mathcal{H}\subseteq \mathcal{A}</math>.

The '''conditional expectation''' of <math>X</math> given <math>\mathcal{H}</math> is the up to a <math>P</math>-nullset unique and integrable <math>E</math>-valued <math>\mathcal{H}</math>-measurable random variable <math>\operatorname{E}(X \mid \mathcal{H})</math> satisfying
:<math>\int_H \operatorname{E}(X \mid \mathcal{H}) \,\mathrm{d}P = \int_H X \,\mathrm{d}P</math>
for all <math>H \in \mathcal{H}</math>.<ref>{{cite book|first1=Giuseppe|last1=Da Prato|first2=Jerzy|last2=Zabczyk|date=2014|title=Stochastic Equations in Infinite Dimensions|publisher=Cambridge University Press|doi=10.1017/CBO9781107295513|page=26|isbn=978-1-107-05584-1 }} (Definition in separable Banach spaces)</ref><ref>{{cite book|first1=Tuomas|last1=Hytönen|first2=Jan|last2=van Neerven|first3=Mark|last3=Veraar|first4=Lutz|last4=Weis|date=2016|title=Analysis in Banach Spaces, Volume I: Martingales and Littlewood-Paley Theory|publisher=Springer Cham|doi=10.1007/978-3-319-48520-1|isbn=978-3-319-48519-5 }} (Definition in general Banach spaces)</ref> 

In this setting the conditional expectation is sometimes also denoted in operator notation as <math>\operatorname{E}^\mathcal{H}X</math>.