Editing Conditional expectation (section)

=== L<sup>2</sup> random variables ===
All random variables in this section are assumed to be in <math>L^2</math>, that is [[square integrable]].
In its full generality, conditional expectation is developed without this assumption, see below under [[Conditional expectation#Conditional expectation with respect to a sub-''σ''-algebra|Conditional expectation with respect to a sub-''σ''-algebra]]. The <math>L^2</math> theory is, however, considered more intuitive<ref>{{cite web |title=probability - Intuition behind Conditional Expectation |url=https://math.stackexchange.com/a/23613/357269 |website=Mathematics Stack Exchange}}</ref> and admits [[Conditional expectation#Connections to regression|important generalizations]].
In the context of <math>L^2</math> random variables, conditional expectation is also called [[Regression analysis|regression]].
 
In what follows let <math>(\Omega, \mathcal{F}, P)</math> be a probability space, and <math>X: \Omega \to \mathbb{R}</math> in 
<math>L^2</math> with mean <math>\mu_X</math> and [[variance]] <math>\sigma_X^2</math>.
The expectation <math>\mu_X</math> minimizes the [[mean squared error]]:
:<math> \min_{x \in \mathbb{R}} \operatorname{E}\left((X - x)^2\right) = \operatorname{E}\left((X - \mu_X)^2\right)
= \sigma_X^2. </math>

The conditional expectation of {{mvar|X}} is defined analogously, except instead of a single number 
<math>\mu_X</math>, the result will be a function <math>e_X(y)</math>. Let <math>Y: \Omega \to \mathbb{R}^n</math> be a [[random vector]]. The conditional expectation <math>e_X: \mathbb{R}^n \to \mathbb{R}</math> is a measurable function such that
:<math> \min_{g \text{ measurable }} \operatorname{E}\left((X - g(Y))^2\right) = \operatorname{E}\left((X - e_X(Y))^2\right).
</math>

Note that unlike <math>\mu_X</math>, the conditional expectation <math>e_X</math> is not generally unique: there may be multiple minimizers of the mean squared error.

==== Uniqueness ====

'''Example 1''': Consider the case where {{mvar|Y}} is the constant random variable that is always 1.
Then the mean squared error is minimized by any function of the form
:<math>
e_X(y) = \begin{cases}
\mu_X & \text{if } y = 1, \\
\text{any number} & \text{otherwise.}
\end{cases}
</math>

'''Example 2''': Consider the case where {{mvar|Y}} is the 2-dimensional random vector <math>(X, 2X)</math>. Then clearly
:<math>\operatorname{E}(X \mid Y) = X</math>
but in terms of functions it can be expressed as <math>e_X(y_1, y_2) = 3y_1-y_2</math> or <math>e'_X(y_1, y_2) = y_2 - y_1</math> or infinitely many other ways. In the context of [[linear regression]], this lack of uniqueness is called [[multicollinearity]].

Conditional expectation is unique up to a set of measure zero in <math>\mathbb{R}^n</math>. The measure used is the [[pushforward measure]] induced by {{mvar|Y}}.

In the first example, the pushforward measure is a [[Dirac distribution]] at 1. In the second it is concentrated on the "diagonal" <math>\{ y : y_2 = 2 y_1 \}</math>, so that any set not intersecting it has measure 0.

==== Existence ====

The existence of a minimizer for <math> \min_g \operatorname{E}\left((X - g(Y))^2\right)</math> is non-trivial. It can be shown that
:<math> M := \{ g(Y) : g \text{ is measurable and }\operatorname{E}(g(Y)^2) < \infty \} = L^2(\Omega, \sigma(Y)) </math>
is a closed subspace of the Hilbert space <math>L^2(\Omega)</math>.<ref>{{cite book |last1=Brockwell |first1=Peter J. |title=Time series : theory and methods |date=1991 |publisher=Springer-Verlag |location=New York |isbn=978-1-4419-0320-4 |edition=2nd}}</ref>
By the [[Hilbert projection theorem]], the necessary and sufficient condition for
<math>e_X</math> to be a minimizer is that for all <math>f(Y)</math> in {{mvar|M}} we have
:<math> \langle X - e_X(Y), f(Y) \rangle = 0. </math>
In words, this equation says that the [[residual (statistics)|residual]] <math>X - e_X(Y)</math> is orthogonal to the space {{mvar|M}} of all functions of {{mvar|Y}}.
This orthogonality condition, applied to the [[indicator function]]s <math>f(Y) = 1_{Y \in H}</math>,
is used below to extend conditional expectation to the case that {{mvar|X}} and {{mvar|Y}} are not necessarily in <math>L^2</math>.

==== Connections to regression ====

The conditional expectation is often approximated in [[applied mathematics]] and [[statistics]] due to the difficulties in analytically calculating it, and for interpolation.<ref>{{cite book |last1=Hastie |first1=Trevor |title=The elements of statistical learning : data mining, inference, and prediction |date=26 August 2009 |location=New York |isbn=978-0-387-84858-7 |edition=Second, corrected 7th printing |url=https://web.stanford.edu/~hastie/Papers/ESLII.pdf}}</ref>

The Hilbert subspace
:<math> M = \{ g(Y) : \operatorname{E}(g(Y)^2) < \infty \}</math> 
defined above is replaced with subsets thereof by restricting the functional form of {{mvar|g}}, rather than allowing any measurable function. Examples of this are [[Decision tree learning|decision tree regression]] when {{mvar|g}} is required to be a [[simple function]], [[linear regression]] when {{mvar|g}} is required to be [[affine transformation|affine]], etc.

These generalizations of conditional expectation come at the cost of many of [[Conditional expectation#Basic properties|its properties]] no longer holding.
For example, let {{mvar|M}}
be the space  of all linear functions of {{mvar|Y}} and let <math>\mathcal{E}_{M}</math> denote this generalized conditional expectation/<math>L^2</math> projection. If <math>M</math> does not contain the [[constant function]]s, the [[tower property]] 
<math> \operatorname{E}(\mathcal{E}_M(X)) = \operatorname{E}(X) </math>
will not hold.

An important special case is when {{mvar|X}} and {{mvar|Y}} are jointly normally distributed. In this case
it can be shown that the conditional expectation is equivalent to linear regression:
:<math> e_X(Y) = \alpha_0 + \sum_i \alpha_i Y_i</math>
for coefficients <math>\{\alpha_i\}_{i = 0..n}</math> described in [[Multivariate normal distribution#Conditional distributions]].