Editing Central limit theorem (section)

==Remarks==

===Proof of classical CLT===
The central limit theorem has a proof using [[characteristic function (probability theory)|characteristic functions]].<ref>{{cite book|url=https://jhupbooks.press.jhu.edu/content/introduction-stochastic-processes-physics|title=An Introduction to Stochastic Processes in Physics|publisher=Johns Hopkins University Press|year=2003 |doi=10.56021/9780801868665 |access-date=2016-08-11 |last1=Lemons |first1=Don |isbn=9780801876387 }}</ref> It is similar to the proof of the (weak) [[Proof of the law of large numbers|law of large numbers]].

Assume <math display="inline">\{X_1, \ldots, X_n, \ldots \}</math> are independent and identically distributed random variables, each with mean <math display="inline">\mu</math> and finite variance {{nowrap|<math display="inline">\sigma^2</math>.}} The sum <math display="inline">X_1 + \cdots + X_n</math> has [[Linearity of expectation|mean]] <math display="inline">n\mu</math> and [[Variance#Sum of uncorrelated variables (Bienaymé formula)|variance]] {{nowrap|<math display="inline">n\sigma^2</math>.}} Consider the random variable

<math display="block">Z_n = \frac{X_1+\cdots+X_n - n \mu}{\sqrt{n \sigma^2}} = \sum_{i=1}^n \frac{X_i - \mu}{\sqrt{n \sigma^2}} = \sum_{i=1}^n \frac{1}{\sqrt{n}} Y_i,</math>

where in the last step we defined the new random variables {{nowrap|<math display="inline">Y_i = \frac{X_i - \mu}{\sigma} </math>,}} each with zero mean and unit variance {{nowrap|(<math display="inline">\operatorname{var}(Y) = 1</math>).}} The [[Characteristic function (probability theory)|characteristic function]] of <math display="inline">Z_n</math> is given by

<math display="block">\varphi_{Z_n}\!(t) = \varphi_{\sum_{i=1}^n {\frac{1}{\sqrt{n}}Y_i}}\!(t) \ =\ \varphi_{Y_1}\!\!\left(\frac{t}{\sqrt{n}}\right) \varphi_{Y_2}\!\! \left(\frac{t}{\sqrt{n}}\right)\cdots \varphi_{Y_n}\!\! \left(\frac{t}{\sqrt{n}}\right) \ =\ \left[\varphi_{Y_1}\!\!\left(\frac{t}{\sqrt{n}}\right)\right]^n,
</math>

where in the last step we used the fact that all of the <math display="inline">Y_i</math> are identically distributed. The characteristic function of <math display="inline">Y_1</math> is, by [[Taylor's theorem]],
<math display="block">\varphi_{Y_1}\!\left(\frac{t}{\sqrt{n}}\right) = 1 - \frac{t^2}{2n} + o\!\left(\frac{t^2}{n}\right), \quad \left(\frac{t}{\sqrt{n}}\right) \to 0</math>

where <math display="inline">o(t^2 / n)</math> is "[[Little-o notation|little {{mvar|o}} notation]]" for some function of <math display="inline">t</math> that goes to zero more rapidly than {{nowrap|<math display="inline">t^2 / n</math>.}} By the limit of the [[exponential function]] {{nowrap|(<math display="inline">e^x = \lim_{n \to \infty} \left(1 + \frac{x}{n}\right)^n</math>),}} the characteristic function of <math>Z_n</math> equals

<math display="block">\varphi_{Z_n}(t) = \left(1 - \frac{t^2}{2n} + o\left(\frac{t^2}{n}\right) \right)^n \rightarrow e^{-\frac{1}{2} t^2}, \quad n \to \infty.</math>

All of the higher order terms vanish in the limit {{nowrap|<math display="inline">n\to\infty</math>.}} The right hand side equals the characteristic function of a standard normal distribution <math display="inline">\mathcal{N}(0, 1)</math>, which implies through [[Lévy continuity theorem|Lévy's continuity theorem]] that the distribution of <math display="inline">Z_n</math> will approach <math display="inline">\mathcal{N}(0,1)</math> as {{nowrap|<math display="inline">n\to\infty</math>.}} Therefore, the [[sample mean|sample average]]

<math display="block">\bar{X}_n = \frac{X_1+\cdots+X_n}{n}</math>

is such that 

<math display="block">\frac{\sqrt{n}}{\sigma}(\bar{X}_n - \mu) = Z_n</math>

converges to the normal distribution {{nowrap|<math display="inline">\mathcal{N}(0, 1)</math>,}} from which the central limit theorem follows.

===Convergence to the limit===
The central limit theorem gives only an [[asymptotic distribution]]. As an approximation for a finite number of observations, it provides a reasonable approximation only when close to the peak of the normal distribution; it requires a very large number of observations to stretch into the tails.{{citation needed|reason=Not immediately obvious, I didn't find a source via google|date=July 2016}}

The convergence in the central limit theorem is [[uniform convergence|uniform]] because the limiting cumulative distribution function is continuous. If the third central [[Moment (mathematics)|moment]] <math display="inline">\operatorname{E}\left[(X_1 - \mu)^3\right]</math> exists and is finite, then the speed of convergence is at least on the order of <math display="inline">1 / \sqrt{n}</math> (see [[Berry–Esseen theorem]]). [[Stein's method]]<ref name="stein1972">{{Cite journal| last = Stein |first=C. |author-link=Charles Stein (statistician)| title = A bound for the error in the normal approximation to the distribution of a sum of dependent random variables| journal = Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability| pages= 583–602| year = 1972|volume=6 |issue=2 | mr=402873 | zbl = 0278.60026| url=http://projecteuclid.org/euclid.bsmsp/1200514239 }}</ref> can be used not only to prove the central limit theorem, but also to provide bounds on the rates of convergence for selected metrics.<ref>{{Cite book|  title = Normal approximation by Stein's method|  publisher = Springer| year = 2011|last1=Chen |first1=L. H. Y. |last2=Goldstein |first2=L. |last3=Shao |first3=Q. M. |isbn = 978-3-642-15006-7}}</ref>

The convergence to the normal distribution is monotonic, in the sense that the [[information entropy|entropy]] of <math display="inline">Z_n</math> increases [[monotonic function|monotonically]] to that of the normal distribution.<ref name=ABBN/>

The central limit theorem applies in particular to sums of independent and identically distributed [[discrete random variable]]s.  A sum of [[discrete random variable]]s is still a [[discrete random variable]], so that we are confronted with a sequence of [[discrete random variable]]s whose cumulative probability distribution function converges towards a cumulative probability distribution function corresponding to a continuous variable (namely that of the [[normal distribution]]).  This means that if we build a [[histogram]] of the realizations of the sum of {{mvar|n}} independent identical discrete variables, the piecewise-linear curve that joins the centers of the upper faces of the rectangles forming the histogram converges toward a Gaussian curve as {{mvar|n}} approaches infinity; this relation is known as [[de Moivre–Laplace theorem]]. The [[binomial distribution]] article details such an application of the central limit theorem in the simple case of a discrete variable taking only two possible values.

===Common misconceptions===
Studies have shown that the central limit theorem is subject to several common but serious misconceptions, some of which appear in widely used textbooks.<ref>{{cite journal |last=Brewer |first=J. K. |date=1985 |title=Behavioral statistics textbooks: Source of myths and misconceptions? |journal=Journal of Educational Statistics |volume=10 |issue=3 |pages=252–268|doi=10.3102/10769986010003252 |s2cid=119611584 }}</ref><ref>Yu, C.; Behrens, J.; Spencer, A. Identification of Misconception in the Central Limit Theorem and Related Concepts, ''American Educational Research Association'' lecture 19 April 1995</ref><ref>{{cite journal |last1=Sotos |first1=A. E. C. |last2=Vanhoof |first2=S. |last3=Van den Noortgate |first3=W. |last4=Onghena |first4=P. |date=2007 |title=Students' misconceptions of statistical inference: A review of the empirical evidence from research on statistics education |journal=Educational Research Review |volume=2 |issue=2 |pages=98–113|doi=10.1016/j.edurev.2007.04.001 |url=https://lirias.kuleuven.be/handle/123456789/136347 }}</ref> These include: 
* The misconceived belief that the theorem applies to random sampling of any variable, rather than to the mean values (or sums) of [[iid]] random variables extracted from a population by repeated sampling. That is, the theorem assumes the random sampling produces a sampling distribution formed from different values of means (or sums) of such random variables. 
* The misconceived belief that the theorem ensures that random sampling leads to the emergence of a normal distribution for sufficiently large samples of any random variable, regardless of the population distribution. In reality, such sampling asymptotically reproduces the properties of the population, an intuitive result underpinned by the [[Glivenko–Cantelli theorem]]. 
* The misconceived belief that the theorem leads to a good approximation of a normal distribution for sample sizes greater than around 30,<ref>{{Cite web |date=2023-06-02 |title=Sampling distribution of the sample mean |format=video |website=Khan Academy |url=https://www.khanacademy.org/math/statistics-probability/sampling-distributions-library/sample-means/v/sampling-distribution-of-the-sample-mean |access-date=2023-10-08 |archive-url=https://web.archive.org/web/20230602200310/https://www.khanacademy.org/math/statistics-probability/sampling-distributions-library/sample-means/v/sampling-distribution-of-the-sample-mean |archive-date=2 June 2023 }}</ref> allowing reliable inferences regardless of the nature of the population. In reality, this empirical rule of thumb has no valid justification, and can lead to seriously flawed inferences. See [[Z-test]] for where the approximation holds.

===Relation to the law of large numbers===
The [[law of large numbers]] as well as the central limit theorem are partial solutions to a general problem: "What is the limiting behavior of {{math|S<sub>{{mvar|n}}</sub>}} as {{mvar|n}} approaches infinity?" In mathematical analysis, [[asymptotic series]] are one of the most popular tools employed to approach such questions.

Suppose we have an asymptotic expansion of <math display="inline">f(n)</math>:

<math display="block">f(n)= a_1 \varphi_{1}(n)+a_2 \varphi_{2}(n)+O\big(\varphi_{3}(n)\big) \qquad  (n \to \infty).</math>

Dividing both parts by {{math|''φ''<sub>1</sub>(''n'')}} and taking the limit will produce {{math|''a''<sub>1</sub>}}, the coefficient of the highest-order term in the expansion, which represents the rate at which {{math|''f''(''n'')}} changes in its leading term.

<math display="block">\lim_{n\to\infty} \frac{f(n)}{\varphi_{1}(n)} = a_1.</math>

Informally, one can say: "{{math|''f''(''n'')}} grows approximately as {{math|''a''<sub>1</sub>''φ''<sub>1</sub>(''n'')}}". Taking the difference between {{math|''f''(''n'')}} and its approximation and then dividing by the next term in the expansion, we arrive at a more refined statement about {{math|''f''(''n'')}}:

<math display="block">\lim_{n\to\infty} \frac{f(n)-a_1 \varphi_{1}(n)}{\varphi_{2}(n)} = a_2 .</math>

Here one can say that the difference between the function and its approximation grows approximately as {{math|''a''<sub>2</sub>''φ''<sub>2</sub>(''n'')}}.  The idea is that dividing the function by appropriate normalizing functions, and looking at the limiting behavior of the result, can tell us much about the limiting behavior of the original function itself.

Informally, something along these lines happens when the sum, {{mvar|S<sub>n</sub>}}, of independent identically distributed random variables, {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}}, is studied in classical probability theory.{{Citation needed|date=April 2012}}  If each {{mvar|X<sub>i</sub>}} has finite mean {{mvar|μ}}, then by the law of large numbers, {{math|{{sfrac|''S<sub>n</sub>''|''n''}} → ''μ''}}.<ref>{{cite book|last=Rosenthal |first=Jeffrey Seth |date=2000 |title=A First Look at Rigorous Probability Theory |publisher=World Scientific |isbn=981-02-4322-7 |at=Theorem 5.3.4, p. 47}}</ref>  If in addition each {{mvar|X<sub>i</sub>}} has finite variance {{math|''σ''<sup>2</sup>}}, then by the central limit theorem,

<math display="block"> \frac{S_n-n\mu}{\sqrt{n}} \to \xi ,</math>

where {{mvar|ξ}} is distributed as {{math|''N''(0,''σ''<sup>2</sup>)}}.  This provides values of the first two constants in the informal expansion

<math display="block">S_n \approx \mu n+\xi \sqrt{n}. </math>

In the case where the {{mvar|X<sub>i</sub>}} do not have finite mean or variance, convergence of the shifted and rescaled sum can also occur with different centering and scaling factors:

<math display="block">\frac{S_n-a_n}{b_n} \rightarrow \Xi,</math>

or informally

<math display="block">S_n \approx a_n+\Xi b_n. </math>

Distributions {{math|Ξ}} which can arise in this way are called ''[[stable distribution|stable]]''.<ref>{{cite book|last=Johnson |first=Oliver Thomas |date=2004 |title=Information Theory and the Central Limit Theorem |publisher=Imperial College Press |isbn= 1-86094-473-6 |page= 88}}</ref>  Clearly, the normal distribution is stable, but there are also other stable distributions, such as the [[Cauchy distribution]], for which the mean or variance are not defined.  The scaling factor {{mvar|b<sub>n</sub>}} may be proportional to {{mvar|n<sup>c</sup>}}, for any {{math|''c'' ≥ {{sfrac|1|2}}}}; it may also be multiplied by a [[slowly varying function]] of {{mvar|n}}.<ref name=Uchaikin>{{cite book |first1=Vladimir V. |last1=Uchaikin |first2=V.M. |last2=Zolotarev |year=1999 |title=Chance and Stability: Stable distributions and their applications |publisher=VSP |isbn=90-6764-301-7 |pages=61–62}}</ref><ref>{{cite book|last1=Borodin |first1=A. N. |last2=Ibragimov |first2=I. A. |last3=Sudakov |first3=V. N. |date=1995 |title=Limit Theorems for Functionals of Random Walks |publisher=AMS Bookstore |isbn= 0-8218-0438-3 |at=Theorem 1.1, p. 8}}</ref>

The [[law of the iterated logarithm]] specifies what is happening "in between" the [[law of large numbers]] and the central limit theorem. Specifically it says that the normalizing function {{math|{{sqrt|''n'' log log ''n''}}}}, intermediate in size between {{mvar|n}} of the law of large numbers and {{math|{{sqrt|''n''}}}} of the central limit theorem, provides a non-trivial limiting behavior.

===Alternative statements of the theorem===

====Density functions====
The [[probability density function|density]] of the sum of two or more independent variables is the [[convolution]] of their densities (if these densities exist).  Thus the central limit theorem can be interpreted as a statement about the properties of density functions under convolution: the convolution of a number of density functions tends to the normal density as the number of density functions increases without bound. These theorems require stronger hypotheses than the forms of the central limit theorem given above. Theorems of this type are often called local limit theorems. See Petrov<ref>{{Cite book|last=Petrov|first=V. V. |title=Sums of Independent Random Variables|year=1976|publisher=Springer-Verlag|location=New York-Heidelberg | isbn=9783642658099 | at=ch. 7|url=https://books.google.com/books?id=zSDqCAAAQBAJ}}</ref> for a particular local limit theorem for sums of [[independent and identically distributed random variables]].

====Characteristic functions====
Since the [[characteristic function (probability theory)|characteristic function]] of a convolution is the product of the characteristic functions of the densities involved, the central limit theorem has yet another restatement: the product of the characteristic functions of a number of density functions becomes close to the characteristic function of the normal density as the number of density functions increases without bound, under the conditions stated above. Specifically, an appropriate scaling factor needs to be applied to the argument of the characteristic function.

An equivalent statement can be made about [[Fourier transform]]s, since the characteristic function is essentially a Fourier transform.

===Calculating the variance===
Let {{mvar|S<sub>n</sub>}} be the sum of {{mvar|n}} random variables. Many central limit theorems provide conditions such that {{math|{{mvar|S<sub>n</sub>}}/{{sqrt|Var({{mvar|S<sub>n</sub>}})}}}} converges in distribution to {{math|''N''(0,1)}} (the normal distribution with mean 0, variance 1) as {{math|{{mvar|n}} → ∞}}. In some cases, it is possible to find a constant {{math|''σ''<sup>2</sup>}} and function {{mvar|f(n)}} such that {{math|{{mvar|S<sub>n</sub>}}/(σ{{sqrt|{{mvar|n⋅f}}({{mvar|n}})}})}} converges in distribution to {{math|''N''(0,1)}} as {{math|{{mvar|n}}→ ∞}}.

{{math theorem | name = Lemma<ref>{{cite journal|last1=Hew|first1=Patrick Chisan|title=Asymptotic distribution of rewards accumulated by alternating renewal processes|journal=Statistics and Probability Letters|date=2017|volume=129 |pages=355–359 |doi=10.1016/j.spl.2017.06.027}}</ref> | math_statement = Suppose <math>X_1, X_2, \dots</math> is a sequence of real-valued and strictly stationary random variables with <math>\operatorname E(X_i) = 0</math> for all {{nowrap|<math>i</math>,}} {{nowrap|<math>g : [0,1] \to \R</math>,}} and {{nowrap|<math>S_n = \sum_{i=1}^{n} g\left(\tfrac{i}{n}\right) X_i</math>.}} Construct

<math display="block">\sigma^2 = \operatorname E(X_1^2) + 2\sum_{i=1}^\infty \operatorname E(X_1 X_{1+i})</math>

# If <math>\sum_{i=1}^\infty \operatorname E(X_1 X_{1+i})</math> is absolutely convergent, <math>\left| \int_0^1 g(x)g'(x) \, dx\right| < \infty</math>, and <math>0 < \int_0^1 (g(x))^2 dx < \infty</math> then <math>\mathrm{Var}(S_n)/(n \gamma_n) \to \sigma^2</math> as <math>n \to \infty</math> where {{nowrap|<math>\gamma_n = \frac{1}{n}\sum_{i=1}^{n} \left(g\left(\tfrac{i}{n}\right)\right)^2</math>.}}
# If in addition <math>\sigma > 0</math> and <math>S_n/\sqrt{\mathrm{Var}(S_n)}</math> converges in distribution to <math>\mathcal{N}(0,1)</math> as <math>n \to \infty</math> then <math>S_n/(\sigma\sqrt{n \gamma_n})</math> also converges in distribution to <math>\mathcal{N}(0,1)</math> as {{nowrap|<math>n \to \infty</math>.}}
}}