Editing Principal component analysis (section)

== Properties and limitations ==

=== Properties ===

Some properties of PCA include:<ref name="Jolliffe2002"/>{{page needed|date=November 2020}}

:<big>'''''Property 1'':'''</big> For any integer ''q'', 1 ≤ ''q'' ≤ ''p'', consider the orthogonal [[linear transformation]]
::<math>y =\mathbf{B'}x</math>
:where <math>y</math> is a ''q-element'' vector and <math>\mathbf{B'}</math> is a (''q'' × ''p'') matrix, and let <math>\mathbf{{\Sigma}}_y = \mathbf{B'}\mathbf{\Sigma}\mathbf{B}</math> be the [[variance]]-[[covariance]] matrix for <math>y</math>. Then the trace of <math>\mathbf{\Sigma}_y</math>, denoted <math>\operatorname{tr} (\mathbf{\Sigma}_y)</math>, is maximized by taking <math>\mathbf{B} = \mathbf{A}_q</math>, where <math>\mathbf{A}_q</math> consists of the first ''q'' columns of <math>\mathbf{A}</math> <math>(\mathbf{B'}</math> is the transpose of <math>\mathbf{B})</math>. (<math>\mathbf{A}</math> is not defined here)

:<big>'''''Property 2'':'''</big> Consider again the [[orthonormal transformation]]
::<math>y = \mathbf{B'}x</math>
:with <math>x, \mathbf{B}, \mathbf{A}</math> and <math>\mathbf{\Sigma}_y</math> defined as before. Then <math>\operatorname{tr}(\mathbf{\Sigma}_y)</math> is minimized by taking <math>\mathbf{B} = \mathbf{A}_q^*,</math> where <math>\mathbf{A}_q^*</math> consists of the last ''q'' columns of <math>\mathbf{A}</math>.

The statistical implication of this property is that the last few PCs are not simply unstructured left-overs after removing the important PCs. Because these last PCs have variances as small as possible they are useful in their own right. They can help to detect unsuspected near-constant linear relationships between the elements of {{mvar|x}}, and they may also be useful in [[regression analysis|regression]], in selecting a subset of variables from {{mvar|x}}, and in outlier detection.

:<big>'''''Property 3'':'''</big> (Spectral decomposition of {{math|'''Σ'''}})
::<math>\mathbf{{\Sigma}} = \lambda_1\alpha_1\alpha_1' + \cdots + \lambda_p\alpha_p\alpha_p'</math>

Before we look at its usage, we first look at [[diagonal]] elements,
:<math>\operatorname{Var}(x_j) = \sum_{k=1}^P \lambda_k\alpha_{kj}^2</math>
Then, perhaps the main statistical implication of the result is that not only can we decompose the combined variances of all the elements of {{mvar|x}} into decreasing contributions due to each PC, but we can also decompose the whole [[covariance matrix]] into contributions <math>\lambda_k\alpha_k\alpha_k'</math> from each PC. Although not strictly decreasing, the elements of <math>\lambda_k\alpha_k\alpha_k'</math> will tend to become smaller as <math>k</math> increases, as <math>\lambda_k\alpha_k\alpha_k'</math> is nonincreasing for increasing <math>k</math>, whereas the elements of <math>\alpha_k</math> tend to stay about the same size because of the normalization constraints: <math>\alpha_{k}'\alpha_{k}=1, k=1, \dots, p</math>.

=== Limitations ===

As noted above, the results of PCA depend on the scaling of the variables. This can be cured by scaling each feature by its standard deviation, so that one ends up with dimensionless features with unital variance.<ref name=Leznik>Leznik, M; Tofallis, C. 2005 [https://uhra.herts.ac.uk/bitstream/handle/2299/715/S56.pdf Estimating Invariant Principal Components Using Diagonal Regression.]</ref>

The applicability of PCA as described above is limited by certain (tacit) assumptions<ref>Jonathon Shlens, [https://arxiv.org/abs/1404.1100 A Tutorial on Principal Component Analysis.]</ref> made in its derivation. In particular, PCA can capture linear correlations between the features but fails when this assumption is violated (see Figure 6a in the reference). In some cases, coordinate transformations can restore the linearity assumption and PCA can then be applied (see [[Kernel principal component analysis|kernel PCA]]).

Another limitation is the mean-removal process before constructing the covariance matrix for PCA. In fields such as astronomy, all the signals are non-negative, and the mean-removal process will force the mean of some astrophysical exposures to be zero, which consequently creates unphysical negative fluxes,<ref name="soummer12"/> and forward modeling has to be performed to recover the true magnitude of the signals.<ref name="pueyo16">{{Cite journal|arxiv= 1604.06097 |last1= Pueyo|first1= Laurent |title= Detection and Characterization of Exoplanets using Projections on Karhunen Loeve Eigenimages: Forward Modeling |journal= The Astrophysical Journal |volume= 824|issue= 2|pages= 117|year= 2016|doi= 10.3847/0004-637X/824/2/117|bibcode = 2016ApJ...824..117P|s2cid= 118349503|doi-access= free}}</ref> As an alternative method, [[non-negative matrix factorization]] focusing only on the non-negative elements in the matrices is well-suited for astrophysical observations.<ref name="blantonRoweis07"/><ref name="zhu16"/><ref name="ren18"/> See more at [[#Non-negative matrix factorization|the relation between PCA and non-negative matrix factorization]].

PCA is at a disadvantage if the data has not been standardized before applying the algorithm to it. PCA transforms the original data into data that is relevant to the principal components of that data, which means that the new data variables cannot be interpreted in the same ways that the originals were. They are linear interpretations of the original variables. Also, if PCA is not performed properly, there is a high likelihood of information loss.<ref>{{cite web | title=What are the Pros and cons of the PCA? | website=i2tutorials | date=September 1, 2019 | url=https://www.i2tutorials.com/what-are-the-pros-and-cons-of-the-pca/ | access-date=June 4, 2021}}</ref>

PCA relies on a linear model. If a dataset has a pattern hidden inside it that is nonlinear, then PCA can actually steer the analysis in the complete opposite direction of progress.<ref name=abbott>{{cite book | title=Applied Predictive Analytics | last=Abbott | first=Dean | isbn=9781118727966 | date=May 2014 | publisher=Wiley}}</ref>{{Page needed|date=June 2021}} Researchers at Kansas State University discovered that the sampling error in their experiments impacted the bias of PCA results. "If the number of subjects or blocks is smaller than 30, and/or the researcher is interested in PC's beyond the first, it may be better to first correct for the serial correlation, before PCA is conducted".<ref name=jiang /> The researchers at Kansas State also found that PCA could be "seriously biased if the autocorrelation structure of the data is not correctly handled".<ref name=jiang>{{cite journal| title=Bias in Principal Components Analysis Due to Correlated Observations| url=https://newprairiepress.org/agstatconference/2000/proceedings/13/ |last1=Jiang | first1=Hong| last2=Eskridge | first2=Kent M.| year=2000 | journal=Conference on Applied Statistics in Agriculture |issn=2475-7772| doi=10.4148/2475-7772.1247| doi-access=free}}</ref>

=== PCA and information theory ===
Dimensionality reduction results in a loss of information, in general. PCA-based dimensionality reduction tends to minimize that information loss, under certain signal and noise models.

Under the assumption that

:<math>\mathbf{x}=\mathbf{s}+\mathbf{n},</math>

that is, that the data vector <math>\mathbf{x}</math> is the sum of the desired information-bearing signal <math>\mathbf{s}</math> and a noise signal <math>\mathbf{n}</math> one can show that PCA can be optimal for dimensionality reduction, from an information-theoretic point-of-view.

In particular, Linsker showed that if <math>\mathbf{s}</math> is Gaussian and <math>\mathbf{n}</math> is Gaussian noise with a covariance matrix proportional to the identity matrix, the PCA maximizes the [[mutual information]] <math>I(\mathbf{y};\mathbf{s})</math> between the desired information <math>\mathbf{s}</math> and the dimensionality-reduced output <math>\mathbf{y}=\mathbf{W}_L^T\mathbf{x}</math>.<ref>{{cite journal|last=Linsker|first=Ralph|title=Self-organization in a perceptual network|journal=IEEE Computer|date=March 1988|volume=21|issue=3|pages=105–117|doi=10.1109/2.36|s2cid=1527671}}</ref>

If the noise is still Gaussian and has a covariance matrix proportional to the identity matrix (that is, the components of the vector <math>\mathbf{n}</math> are [[iid]]), but the information-bearing signal <math>\mathbf{s}</math> is non-Gaussian (which is a common scenario), PCA at least minimizes an upper bound on the ''information loss'', which is defined as<ref>{{cite book|last=Deco & Obradovic|title=An Information-Theoretic Approach to Neural Computing|year=1996|publisher=Springer|location=New York, NY|url=https://books.google.com/books?id=z4XTBwAAQBAJ|isbn=9781461240167}}</ref><ref>{{cite book |last=Plumbley|first=Mark|title=Information theory and unsupervised neural networks|year=1991}}Tech Note</ref>

:<math>I(\mathbf{x};\mathbf{s}) - I(\mathbf{y};\mathbf{s}).</math>

The optimality of PCA is also preserved if the noise <math>\mathbf{n}</math> is iid and at least more Gaussian (in terms of the [[Kullback–Leibler divergence]]) than the information-bearing signal <math>\mathbf{s}</math>.<ref>{{cite journal|last=Geiger|first=Bernhard|author2=Kubin, Gernot|title=Signal Enhancement as Minimization of Relevant Information Loss|journal=Proc. ITG Conf. On Systems, Communication and Coding|date=January 2013|arxiv=1205.6935|bibcode=2012arXiv1205.6935G}}</ref> In general, even if the above signal model holds, PCA loses its information-theoretic optimality as soon as the noise <math>\mathbf{n}</math> becomes dependent.