Editing Principal component analysis (section)

=== PCA and information theory ===
Dimensionality reduction results in a loss of information, in general. PCA-based dimensionality reduction tends to minimize that information loss, under certain signal and noise models.

Under the assumption that

:<math>\mathbf{x}=\mathbf{s}+\mathbf{n},</math>

that is, that the data vector <math>\mathbf{x}</math> is the sum of the desired information-bearing signal <math>\mathbf{s}</math> and a noise signal <math>\mathbf{n}</math> one can show that PCA can be optimal for dimensionality reduction, from an information-theoretic point-of-view.

In particular, Linsker showed that if <math>\mathbf{s}</math> is Gaussian and <math>\mathbf{n}</math> is Gaussian noise with a covariance matrix proportional to the identity matrix, the PCA maximizes the [[mutual information]] <math>I(\mathbf{y};\mathbf{s})</math> between the desired information <math>\mathbf{s}</math> and the dimensionality-reduced output <math>\mathbf{y}=\mathbf{W}_L^T\mathbf{x}</math>.<ref>{{cite journal|last=Linsker|first=Ralph|title=Self-organization in a perceptual network|journal=IEEE Computer|date=March 1988|volume=21|issue=3|pages=105–117|doi=10.1109/2.36|s2cid=1527671}}</ref>

If the noise is still Gaussian and has a covariance matrix proportional to the identity matrix (that is, the components of the vector <math>\mathbf{n}</math> are [[iid]]), but the information-bearing signal <math>\mathbf{s}</math> is non-Gaussian (which is a common scenario), PCA at least minimizes an upper bound on the ''information loss'', which is defined as<ref>{{cite book|last=Deco & Obradovic|title=An Information-Theoretic Approach to Neural Computing|year=1996|publisher=Springer|location=New York, NY|url=https://books.google.com/books?id=z4XTBwAAQBAJ|isbn=9781461240167}}</ref><ref>{{cite book |last=Plumbley|first=Mark|title=Information theory and unsupervised neural networks|year=1991}}Tech Note</ref>

:<math>I(\mathbf{x};\mathbf{s}) - I(\mathbf{y};\mathbf{s}).</math>

The optimality of PCA is also preserved if the noise <math>\mathbf{n}</math> is iid and at least more Gaussian (in terms of the [[Kullback–Leibler divergence]]) than the information-bearing signal <math>\mathbf{s}</math>.<ref>{{cite journal|last=Geiger|first=Bernhard|author2=Kubin, Gernot|title=Signal Enhancement as Minimization of Relevant Information Loss|journal=Proc. ITG Conf. On Systems, Communication and Coding|date=January 2013|arxiv=1205.6935|bibcode=2012arXiv1205.6935G}}</ref> In general, even if the above signal model holds, PCA loses its information-theoretic optimality as soon as the noise <math>\mathbf{n}</math> becomes dependent.