Editing Principal component analysis (section)

=== Dimensionality reduction ===
The transformation '''P''' = '''X''' '''W''' maps a data vector '''x'''<sub>(''i'')</sub> from an original space of ''x'' variables to a new space of ''p'' variables which are uncorrelated over the dataset.
To non-dimensionalize the centered data, let ''X<sub>c</sub>'' represent the characteristic values of data vectors ''X<sub>i</sub>'', given by:
* <math>\|X\|_{\infty}</math> (maximum norm),
* <math>\frac{1}{n} \|X\|_1</math> (mean absolute value), or
* <math>\frac{1}{\sqrt{n}} \|X\|_2</math> (normalized Euclidean norm),
for a dataset of size ''n''. These norms are used to transform the original space of variables ''x, y'' to a new space of uncorrelated variables ''p, q'' (given ''Y<sub>c</sub>'' with same meaning), such that <math>p_i = \frac{X_i}{X_c}, \quad q_i = \frac{Y_i}{Y_c}</math>;
and the new variables are linearly related as: <math>q = \alpha p</math>.
To find the optimal linear relationship, we minimize the total squared reconstruction error:
<math>E(\alpha) = \frac{1}{1 - \alpha^2} \sum_{i=1}^{n} (\alpha p_i - q_i)^2</math>; such that setting the derivative of the error function to zero <math>(E'(\alpha) = 0)</math> yields:<math>\alpha = \frac{1}{2} \left( -\lambda \pm \sqrt{\lambda^2 + 4} \right)</math> where<math>\lambda = \frac{p \cdot p - q \cdot q}{p \cdot q}</math>.<ref name="Holmes2023" />

[[File:PCA of Haplogroup J using 37 STRs.png|thumb|right|A principal components analysis scatterplot of [[Y-STR]] [[haplotype]]s calculated from repeat-count values for 37 Y-chromosomal STR markers from 354 individuals.<br /> PCA has successfully found linear combinations of the markers that separate out different clusters corresponding to different lines of individuals' Y-chromosomal genetic descent.]]
Such [[dimensionality reduction]] can be a very useful step for visualising and processing high-dimensional datasets, while still retaining as much of the variance in the dataset as possible. For example, selecting ''L''&nbsp;=&nbsp;2 and keeping only the first two principal components finds the two-dimensional plane through the high-dimensional dataset in which the data is most spread out, so if the data contains [[Cluster analysis|clusters]] these too may be most spread out, and therefore most visible to be plotted out in a two-dimensional diagram; whereas if two directions through the data (or two of the original variables) are chosen at random, the clusters may be much less spread apart from each other, and may in fact be much more likely to substantially overlay each other, making them indistinguishable.

Similarly, in [[regression analysis]], the larger the number of [[explanatory variable]]s allowed, the greater is the chance of [[overfitting]] the model, producing conclusions that fail to generalise to other datasets. One approach, especially when there are strong correlations between different possible explanatory variables, is to reduce them to a few principal components and then run the regression against them, a method called [[principal component regression]].

Dimensionality reduction may also be appropriate when the variables in a dataset are noisy. If each column of the dataset contains independent identically distributed Gaussian noise, then the columns of '''T''' will also contain similarly identically distributed Gaussian noise (such a distribution is invariant under the effects of the matrix '''W''', which can be thought of as a high-dimensional rotation of the co-ordinate axes). However, with more of the total variance concentrated in the first few principal components compared to the same noise variance, the proportionate effect of the noise is less—the first few components achieve a higher [[signal-to-noise ratio]]. PCA thus can have the effect of concentrating much of the signal into the first few principal components, which can usefully be captured by dimensionality reduction; while the later principal components may be dominated by noise, and so disposed of without great loss. If the dataset is not too large, the significance of the principal components can be tested using [[Bootstrapping (statistics)#Parametric bootstrap|parametric bootstrap]], as an aid in determining how many principal components to retain.<ref>{{Cite journal|author=Forkman J., Josse, J., Piepho, H. P. |year=2019 |title= Hypothesis tests for principal component analysis when variables are standardized |journal= Journal of Agricultural, Biological, and Environmental Statistics|volume=24 |issue=2 |pages= 289–308 |doi=10.1007/s13253-019-00355-5|doi-access=free |bibcode=2019JABES..24..289F }}</ref>