Editing Law of large numbers (section)

==Forms==
There are two different versions of the law of large numbers that are described below. They are called the'' '''strong law''' of large numbers'' and the '''''weak law''' of large numbers''.<ref>{{Cite book|title=A Course in Mathematical Statistics and Large Sample Theory| last1=Bhattacharya|first1=Rabi| last2=Lin|first2=Lizhen| last3=Patrangenaru|first3=Victor| date=2016| publisher=Springer New York| isbn=978-1-4939-4030-1| series=Springer Texts in Statistics| location=New York, NY| doi=10.1007/978-1-4939-4032-5}}</ref><ref name=":0" /> Stated for the case where ''X''<sub>1</sub>, ''X''<sub>2</sub>, ... is an infinite sequence of [[Independent and identically distributed random variables|independent and identically distributed (i.i.d.)]] [[Lebesgue integration|Lebesgue integrable]] random variables with expected value E(''X''<sub>1</sub>) = E(''X''<sub>2</sub>) = ... = ''μ'', both versions of the law state that the sample average

<math display="block">\overline{X}_n=\frac1n(X_1+\cdots+X_n) </math>

converges to the expected value:
{{NumBlk||<math display="block">\overline{X}_n \to \mu \quad\textrm{as}\ n \to \infty.</math>|{{EquationRef|1}}}}

(Lebesgue integrability of ''X<sub>j</sub>'' means that the expected value E(''X<sub>j</sub>'') exists according to Lebesgue integration and is finite. It does ''not'' mean that the associated probability measure is [[absolutely continuous]] with respect to [[Lebesgue measure]].)

Introductory probability texts often additionally assume identical finite [[variance]] <math> \operatorname{Var} (X_i) = \sigma^2 </math> (for all <math>i</math>) and no correlation between random variables.  In that case, the variance of the average of ''n'' random variables is

<math display="block">\operatorname{Var}(\overline{X}_n) = \operatorname{Var}(\tfrac1n(X_1+\cdots+X_n)) = \frac{1}{n^2} \operatorname{Var}(X_1+\cdots+X_n) = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}.</math>

which can be used to shorten and simplify the proofs. This assumption of finite [[variance]] is ''not necessary''. Large or infinite variance will make the convergence slower, but the law of large numbers holds anyway.<ref name="TaoBlog">{{cite web|title=The strong law of large numbers – What's new|date=19 June 2008|url=http://terrytao.wordpress.com/2008/06/18/the-strong-law-of-large-numbers/|access-date=2012-06-09|publisher=Terrytao.wordpress.com}}</ref> 

[[Independence (probability theory)#More than two random variables|Mutual independence]] of the random variables can be replaced by [[pairwise independence]]<ref>{{cite journal|last1=Etemadi|first1=N. Z.|date=1981|title=An elementary proof of the strong law of large numbers|journal=Wahrscheinlichkeitstheorie Verw Gebiete| volume=55| issue=1| pages=119–122| doi=10.1007/BF01013465|s2cid=122166046|doi-access=free}}</ref> or [[Exchangeable random variables|exchangeability]]<ref>{{Cite journal| last=Kingman|first=J. F. C.|date=April 1978|title=Uses of Exchangeability|journal=The Annals of Probability| language=en| volume=6|issue=2|doi=10.1214/aop/1176995566|issn=0091-1798|doi-access=free}}</ref> in both versions of the law.

The difference between the strong and the weak version is concerned with the mode of convergence being asserted. For interpretation of these modes, see [[Convergence of random variables]].

===Weak law===
{{multiple image |width1=50 |image1=Blank300.png
|width2=100 |image2=Lawoflargenumbersanimation2.gif |footer=Simulation illustrating the law of large numbers. Each frame, a coin that is red on one side and blue on the other is flipped, and a dot is added in the corresponding column. A pie chart shows the proportion of red and blue so far. Notice that while the proportion varies significantly at first, it approaches 50% as the number of trials increases.
|width3=50 |image3=Blank300.png}}
The '''weak law of large numbers''' (also called [[Aleksandr Khinchin|Khinchin]]'s law) states that given a collection of [[Independent and identically distributed random variables|independent and identically distributed]] (iid) samples from a random variable with finite mean, the sample mean [[Convergence in probability|converges in probability]] to the expected value<ref>{{harvnb|Loève|1977|loc=Chapter 1.4, p. 14}}</ref>
{{NumBlk||<math display="block">
    \overline{X}_n\ \overset{P}{\rightarrow}\ \mu \qquad\textrm{when}\ n \to \infty.
</math>|{{EquationRef|2}}}}

That is, for any positive number ''ε'',

<math display="block">
    \lim_{n\to\infty}\Pr\!\left(\,|\overline{X}_n-\mu| < \varepsilon\,\right) = 1.
</math>

Interpreting this result, the weak law states that for any nonzero margin specified (''ε''), no matter how small, with a sufficiently large sample there will be a very high probability that the average of the observations will be close to the expected value; that is, within the margin.

As mentioned earlier, the weak law applies in the case of i.i.d. random variables, but it also applies in some other cases. For example, the variance may be different for each random variable in the series, keeping the expected value constant. If the variances are bounded, then the law applies, as shown by [[Pafnuty Chebyshev|Chebyshev]] as early as 1867. (If the expected values change during the series, then we can simply apply the law to the average deviation from the respective expected values. The law then states that this converges in probability to zero.) In fact, Chebyshev's proof works so long as the variance of the average of the first ''n'' values goes to zero as ''n'' goes to infinity.<ref name=EncMath/> As an example, assume that each random variable in the series follows a [[Gaussian distribution]] (normal distribution) with mean zero, but with variance equal to <math>2n/\log(n+1)</math>, which is not bounded. At each stage, the average will be normally distributed (as the average of a set of normally distributed variables). The variance of the sum is equal to the sum of the variances, which is [[asymptotic]] to <math>n^2 / \log n</math>. The variance of the average is therefore asymptotic to <math>1 / \log n</math> and goes to zero.

There are also examples of the weak law applying even though the expected value does not exist.

===Strong law===
The '''strong law of large numbers''' (also called [[Andrey Kolmogorov|Kolmogorov]]'s law) states that the sample average [[Almost sure convergence|converges almost surely]] to the expected value<ref>{{harvnb|Loève|1977|loc=Chapter 17.3, p. 251}}</ref>
{{NumBlk||<math display="block">
    \overline{X}_n\ \overset{\text{a.s.}}{\longrightarrow}\ \mu \qquad\textrm{when}\ n \to \infty.
</math>|{{EquationRef|3}}}}

That is,

<math display="block">
    \Pr\!\left( \lim_{n\to\infty}\overline{X}_n = \mu \right) = 1.
</math>

What this means is that, as the number of trials ''n'' goes to infinity, the probability that the average of the observations converges to the expected value, is equal to one. The modern proof of the strong law is more complex than that of the weak law, and relies on passing to an appropriate sub-sequence.<ref name="TaoBlog" /> 

The strong law of large numbers can itself be seen as a special case of the [[Ergodic theory#Ergodic theorems|pointwise ergodic theorem]]. This view justifies the intuitive interpretation of the expected value (for Lebesgue integration only) of a random variable when sampled repeatedly as the "long-term average".  

Law 3 is called the strong law because random variables which converge strongly (almost surely) are guaranteed to converge weakly (in probability). However the weak law is known to hold in certain conditions where the strong law does not hold and then the convergence is only weak (in probability). See [[#Differences between the weak law and the strong law|Differences between the weak law and the strong law]].

The strong law applies to independent identically distributed random variables having an expected value (like the weak law). This was proved by Kolmogorov in 1930. It can also apply in other cases. Kolmogorov also showed, in 1933, that if the variables are independent and identically distributed, then for the average to converge almost surely on ''something'' (this can be considered another statement of the strong law), it is necessary that they have an expected value (and then of course the average will converge almost surely on that).<ref name=EMStrong>{{cite web|author1=Yuri Prokhorov| title=Strong law of large numbers|url=https://www.encyclopediaofmath.org/index.php/Strong_law_of_large_numbers| website=Encyclopedia of Mathematics}}</ref>

If the summands are independent but not identically distributed, then
{{NumBlk||<math display="block">
    \overline{X}_n - \operatorname{E}\big[\overline{X}_n\big]\ \overset{\text{a.s.}}{\longrightarrow}\ 0,
</math>|{{EquationRef|2}}}}

provided that each ''X''<sub>''k''</sub> has a finite second moment and

<math display="block">
    \sum_{k=1}^{\infty} \frac{1}{k^2} \operatorname{Var}[X_k] < \infty.
</math>

This statement is known as ''Kolmogorov's strong law'', see e.g. {{harvtxt|Sen|Singer|1993|loc=Theorem 2.3.10}}.

===Differences between the weak law and the strong law===
The ''weak law'' states that for a specified large ''n'', the average <math style="vertical-align:-.35em">\overline{X}_n</math> is likely to be near ''μ''.<ref>{{Cite web |title=What Is the Law of Large Numbers? (Definition) {{!}} Built In |url=https://builtin.com/data-science/law-of-large-numbers |access-date=2023-10-20 |website=builtin.com |language=en}}</ref> Thus, it leaves open the possibility that <math style="vertical-align:-.4em">|\overline{X}_n -\mu| > \varepsilon</math> happens an infinite number of times, although at infrequent intervals. (Not necessarily <math style="vertical-align:-.4em">|\overline{X}_n -\mu| \neq 0</math> for all ''n'').

The ''strong law'' shows that this [[almost surely]] will not occur. It does not imply that with probability 1, we have that for any {{math|''ε'' > 0}} the inequality <math style="vertical-align:-.4em">|\overline{X}_n -\mu| < \varepsilon</math> holds for all large enough ''n'', since the convergence is not necessarily uniform on the set where it holds.<ref>{{harvtxt|Ross|2009}}</ref>

The strong law does not hold in the following cases, but the weak law does.<ref name="Weak law converges to constant">{{cite book |last1=Lehmann |first1=Erich L. |last2=Romano |first2=Joseph P. |date=2006-03-30 |title=Weak law converges to constant |publisher=Springer |isbn=9780387276052 |url=https://books.google.com/books?id=K6t5qn-SEp8C&pg=PA432}}</ref><ref>{{cite journal| title=A Note on the Weak Law of Large Numbers for Exchangeable Random Variables |author1=Dguvl Hun Hong |author2=Sung Ho Lee |url=http://www.mathnet.or.kr/mathnet/kms_tex/31810.pdf |journal=Communications of the Korean Mathematical Society| volume=13|year=1998|issue=2|pages=385–391 |access-date=2014-06-28|archive-url=https://web.archive.org/web/20160701234328/http://www.mathnet.or.kr/mathnet/kms_tex/31810.pdf|archive-date=2016-07-01|url-status=dead}}</ref><!-- Stack Exchange is not a reliable source -->

{{ordered list
|1= Let X be an [[Exponential distribution|exponentially]] distributed random variable with parameter 1. The random variable <math>\sin(X)e^X X^{-1}</math> has no expected value according to Lebesgue integration, but using conditional convergence and interpreting the integral as a [[Dirichlet integral]], which is an improper [[Riemann integral]], we can say:

<math display="block"> E\left(\frac{\sin(X)e^X}{X}\right) =\ \int_{x=0}^{\infty}\frac{\sin(x)e^x}{x}e^{-x}dx = \frac{\pi}{2} </math>

|2= Let X be a [[Geometric distribution|geometrically]] distributed random variable with probability 0.5. The random variable <math>2^X(-1)^X X^{-1}</math> does not have an expected value in the conventional sense because the infinite [[Series (mathematics)|series]] is not absolutely convergent, but using conditional convergence, we can say:

<math display="block"> E\left(\frac{2^X(-1)^X}{X}\right) =\ \sum_{x=1}^{\infty}\frac{2^x(-1)^x}{x}2^{-x}=-\ln(2) </math>

|3= If the [[cumulative distribution function]] of a random variable is

<math display="block">\begin{cases}
1-F(x)&=\frac{e}{2x\ln(x)},&x \ge e \\
F(x)&=\frac{e}{-2x\ln(-x)},&x \le -e
\end{cases}</math>

then it has no expected value, but the weak law is true.<ref>{{cite web|last1=Mukherjee|first1=Sayan|title=Law of large numbers| url=http://www.isds.duke.edu/courses/Fall09/sta205/lec/lln.pdf|access-date=2014-06-28|archive-url=https://web.archive.org/web/20130309032810/http://www.isds.duke.edu/courses/Fall09/sta205/lec/lln.pdf|archive-date=2013-03-09| url-status=dead}}</ref><ref>{{cite web|last1=J. Geyer|first1=Charles|title=Law of large numbers| url=http://www.stat.umn.edu/geyer/8112/notes/weaklaw.pdf}}</ref>

|4= Let ''X''<sub>''k''</sub> be plus or minus <math display="inline">\sqrt{k/\log\log\log k}</math> (starting at sufficiently large ''k'' so that the denominator is positive) with probability {{frac|1|2}} for each.<ref name=EMStrong/> The variance of ''X''<sub>''k''</sub> is then <math display="inline">k/\log\log\log k.</math> Kolmogorov's strong law does not apply because the partial sum in his criterion up to ''k''&nbsp;=&nbsp;''n'' is asymptotic to <math>\log n/\log\log\log n</math> and this is unbounded. If we replace the random variables with Gaussian variables having the same variances, namely <math display="inline">\sqrt{k/\log\log\log k}</math>, then the average at any point will also be normally distributed. The width of the distribution of the average will tend toward zero (standard deviation asymptotic to <math display="inline">1/\sqrt{2\log\log\log n}</math>), but for a given ''ε'', there is probability which does not go to zero with ''n'', while the average sometime after the ''n''th trial will come back up to ''ε''. Since the width of the distribution of the average is not zero, it must have a positive lower bound ''p''(''ε''), which means there is a probability of at least ''p''(''ε'') that the average will attain ε after ''n'' trials. It will happen with probability ''p''(''ε'')/2 before some ''m'' which depends on ''n''. But even after ''m'', there is still a probability of at least ''p''(''ε'') that it will happen. (This seems to indicate that ''p''(''ε'')=1 and the average will attain ε an infinite number of times.)
}}

===Uniform laws of large numbers===
There are extensions of the law of large numbers to collections of estimators, where the convergence is uniform over the collection; thus the name ''uniform law of large numbers''.

Suppose ''f''(''x'',''θ'') is some [[Function (mathematics)|function]] defined for ''θ'' ∈ Θ, and continuous in ''θ''. Then for any fixed ''θ'', the sequence {''f''(''X''<sub>1</sub>,''θ''), ''f''(''X''<sub>2</sub>,''θ''), ...} will be a sequence of independent and identically distributed random variables, such that the sample mean of this sequence converges in probability to E[''f''(''X'',''θ'')]. This is the ''pointwise'' (in ''θ'') convergence.

A particular example of a '''uniform law of large numbers''' states the conditions under which the convergence happens ''uniformly'' in ''θ''. If<ref>{{harvnb|Newey|McFadden|1994|loc=Lemma 2.4}}</ref><ref>{{cite journal|doi=10.1214/aoms/1177697731|title=Asymptotic Properties of Non-Linear Least Squares Estimators|year=1969|last1=Jennrich|first1=Robert I.|journal=The Annals of Mathematical Statistics|volume=40|issue=2|pages=633–643|doi-access=free}}</ref>

# ''Θ'' is compact,
# ''f''(''x'',''θ'') is continuous at each ''θ'' ∈ Θ for [[Almost everywhere|almost all]] ''x''s, and measurable function of ''x'' at each ''θ''.
# there exists a [[Dominated convergence theorem|dominating]] function ''d''(''x'') such that E[''d''(''X'')] < ∞, and <math display="block"> \left\| f(x,\theta) \right\| \leq d(x) \quad\text{for all}\ \theta\in\Theta.</math>

Then E[''f''(''X'',''θ'')] is continuous in ''θ'', and

<math display="block">
    \sup_{\theta\in\Theta} \left\| \frac 1 n \sum_{i=1}^n f(X_i,\theta) - \operatorname{E}[f(X,\theta)] \right\| \overset{\mathrm{P}}{\rightarrow} \ 0.
  </math>

This result is useful to derive consistency of a large class of estimators (see [[Extremum estimator]]).

===Borel's law of large numbers===
'''Borel's law of large numbers''', named after [[Émile Borel]], states that if an experiment is repeated a large number of times, independently under identical conditions, then the proportion of times that any specified event is expected to occur approximately equals the probability of the event's occurrence on any particular trial; the larger the number of repetitions, the better the approximation tends to be. More precisely, if ''E'' denotes the event in question, ''p'' its probability of occurrence, and ''N<sub>n</sub>''(''E'') the number of times ''E'' occurs in the first ''n'' trials, then with probability one,<ref>{{cite journal | url=https://www.jstor.org/stable/2323947 | jstor=2323947 | doi=10.2307/2323947 | last1=Wen | first1=Liu | title=An Analytic Technique to Prove Borel's Strong Law of Large Numbers | journal=The American Mathematical Monthly | date=1991 | volume=98 | issue=2 | pages=146–148 }}</ref>
<math display="block"> \frac{N_n(E)}{n}\to p\text{ as }n\to\infty.</math>

This theorem makes rigorous the intuitive notion of probability as the expected long-run relative frequency of an event's occurrence.  It is a special case of any of several more general laws of large numbers in probability theory.

'''[[Chebyshev's inequality]]'''. Let ''X'' be a [[random variable]] with finite [[expected value]] ''μ'' and finite non-zero [[variance]] ''σ''<sup>2</sup>. Then for any [[real number]] {{math|''k'' > 0}},

<math display="block">
    \Pr(|X-\mu|\geq k\sigma) \leq \frac{1}{k^2}.
</math>