Editing Dimensionality reduction (section)

==Feature projection==
{{Main|Feature extraction}}

Feature projection (also called feature extraction) transforms the data from the [[high-dimensional space]] to a space of fewer dimensions. The data transformation may be linear, as in [[principal component analysis]] (PCA), but many [[nonlinear dimensionality reduction]] techniques also exist.<ref>Samet, H. (2006) ''Foundations of Multidimensional and Metric Data Structures''. Morgan Kaufmann. {{ISBN|0-12-369446-9}}</ref><ref>C. Ding, X. He, H. Zha, H.D. Simon, [https://escholarship.org/uc/item/8pv153t1 Adaptive Dimension Reduction for Clustering High Dimensional Data], Proceedings of International Conference on Data Mining, 2002</ref> For multidimensional data, [[tensor representation]] can be used in dimensionality reduction through [[multilinear subspace learning]].<ref name="MSLsurvey">{{cite journal
 |first1=Haiping |last1=Lu
 |first2=K.N. |last2=Plataniotis
 |first3=A.N. |last3=Venetsanopoulos
 |url=https://www.dsp.utoronto.ca/~haiping/Publication/SurveyMSL_PR2011.pdf
 |title=A Survey of Multilinear Subspace Learning for Tensor Data
 |journal=Pattern Recognition
 |volume=44 |number=7 |pages=1540–1551 |year=2011
 |doi=10.1016/j.patcog.2011.01.004
|bibcode=2011PatRe..44.1540L
 }}</ref>
[[File:PCA Projection Illustration.gif|alt=A scatterplot showing two groups points. An axis runs through the groups. They transition into a histogram showing where each point lands in the PCA projection.|thumb|A visual depiction of the resulting PCA projection for a set of 2D points.]]

===Principal component analysis (PCA)===
{{Main|Principal component analysis}}

The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In practice, the [[covariance]] (and sometimes the [[correlation and dependence|correlation]]) [[matrix (mathematics)|matrix]] of the data is constructed and the [[eigenvalues and eigenvectors|eigenvectors]] on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behavior of the system, because they often contribute the vast majority of the system's energy, especially in low-dimensional systems. Still, this must be proved on a case-by-case basis as not all systems exhibit this behavior. The original space (with dimension of the number of points) has been reduced (with data loss, but hopefully retaining the most important variance) to the space spanned by a few eigenvectors. {{Citation needed|date=September 2017}}

===Non-negative matrix factorization (NMF)===
{{Main|Non-negative matrix factorization}}

NMF decomposes a non-negative matrix to the product of two non-negative ones, which has been a promising tool in fields where only non-negative signals exist,<ref name="lee-seung">{{cite journal
 |author=Daniel D. Lee
 |author2=H. Sebastian Seung
 |author2-link=Sebastian Seung
 |name-list-style=amp
 |year=1999
 |title=Learning the parts of objects by non-negative matrix factorization
 |journal=[[Nature (journal)|Nature]]
 |volume=401
 |issue=6755
 |pages=788–791
 |doi=10.1038/44565
 |pmid=10548103
 |bibcode=1999Natur.401..788L
 |s2cid=4428232
 }}</ref><ref name="lee2001algorithms">{{cite conference
 |author1=Daniel D. Lee |author2=H. Sebastian Seung
 |name-list-style=amp |year=2001
 |url=https://proceedings.neurips.cc/paper/2000/file/f9d1152547c0bde01830b7e8bd60024c-Paper.pdf
 |title=Algorithms for Non-negative Matrix Factorization
 |conference=Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference
 |pages=556–562
 |publisher=[[MIT Press]]
}}</ref> such as astronomy.<ref name="blantonRoweis07">{{cite journal |arxiv=astro-ph/0606170 |last1=Blanton |first1=Michael R. |title=K-corrections and filter transformations in the ultraviolet, optical, and near infrared |journal=The Astronomical Journal |volume=133 |issue=2 |pages=734–754 |last2=Roweis |first2=Sam |year=2007 |doi=10.1086/510127 |bibcode=2007AJ....133..734B |s2cid=18561804}}</ref><ref name="ren18">{{cite journal |arxiv=1712.10317 |last1=Ren |first1=Bin |title=Non-negative Matrix Factorization: Robust Extraction of Extended Structures |journal=The Astrophysical Journal |volume=852 |issue=2 |pages=104 |last2=Pueyo |first2=Laurent |last3=Zhu |first3=Guangtun B. |last4=Duchêne |first4=Gaspard |year=2018 |doi=10.3847/1538-4357/aaa1f2 |bibcode=2018ApJ...852..104R |s2cid=3966513 |doi-access=free }}</ref> NMF is well known since the multiplicative update rule by Lee & Seung,<ref name="lee-seung"/> which has been continuously developed: the inclusion of uncertainties,<ref name="blantonRoweis07"/> the consideration of missing data and parallel computation,<ref name="zhu16">{{cite arXiv |last=Zhu |first=Guangtun B. |date=2016-12-19 |title=Nonnegative Matrix Factorization (NMF) with Heteroscedastic Uncertainties and Missing data |eprint=1612.06037 |class=astro-ph.IM}}</ref> sequential construction<ref name="zhu16"/> which leads to the stability and linearity of NMF,<ref name="ren18"/> as well as other [[non-negative matrix factorization|updates]] including handling missing data in [[digital image processing]].<ref name="ren20">{{cite journal |arxiv=2001.00563 |last1=Ren |first1=Bin |title=Using Data Imputation for Signal Separation in High Contrast Imaging |journal=The Astrophysical Journal |volume=892 |issue=2 |pages=74 |last2=Pueyo |first2=Laurent |last3=Chen |first3=Christine |last4=Choquet |first4=Elodie |last5=Debes |first5=John H. |last6=Duechene |first6=Gaspard |last7=Menard |first7=Francois |last8=Perrin |first8=Marshall D. |year=2020 |doi=10.3847/1538-4357/ab7024
 |bibcode=2020ApJ...892...74R |s2cid=209531731 |doi-access=free }}</ref>

With a stable component basis during construction, and a linear modeling process, [[non-negative matrix factorization#Sequential NMF|sequential NMF]]<ref name="zhu16"/> is able to preserve the flux in direct imaging of circumstellar structures in astronomy,<ref name="ren18"/> as one of the [[methods of detecting exoplanets]], especially for the direct imaging of [[circumstellar disc]]s. In comparison with PCA, NMF does not remove the mean of the matrices, which leads to physical non-negative fluxes; therefore NMF is able to preserve more information than PCA as demonstrated by Ren et al.<ref name="ren18"/>

===Kernel PCA===
{{Main|Kernel principal component analysis}}
Principal component analysis can be employed in a nonlinear way by means of the [[kernel trick]]. The resulting technique is capable of constructing nonlinear mappings that maximize the variance in the data. The resulting technique is called [[kernel principal component analysis|kernel PCA]].

===Graph-based kernel PCA===
Other prominent nonlinear techniques include [[manifold learning]] techniques such as [[Isomap]], [[locally linear embedding]] (LLE),<ref>{{cite journal |last1=Roweis |first1=S. T. |last2=Saul |first2=L. K. |title=Nonlinear Dimensionality Reduction by Locally Linear Embedding |doi=10.1126/science.290.5500.2323 |journal=Science |volume=290 |issue=5500 |pages=2323–2326 |year=2000 |pmid=11125150 |bibcode=2000Sci...290.2323R |citeseerx=10.1.1.111.3313|s2cid=5987139 }}</ref> Hessian LLE, Laplacian eigenmaps, and methods based on tangent space analysis.<ref>{{cite journal |last1=Zhang |first1=Zhenyue |last2=Zha |first2=Hongyuan |date=2004 |title=Principal Manifolds and Nonlinear Dimensionality Reduction via Tangent Space Alignment |journal=SIAM Journal on Scientific Computing |volume=26 |issue=1 |pages=313–338 |doi=10.1137/s1064827502419154|bibcode=2004SJSC...26..313Z }}</ref> These techniques construct a low-dimensional data representation using a cost function that retains local properties of the data, and can be viewed as defining a graph-based kernel for Kernel PCA.

More recently, techniques have been proposed that, instead of defining a fixed kernel, try to learn the kernel using [[semidefinite programming]]. The most prominent example of such a technique is [[maximum variance unfolding]] (MVU). The central idea of MVU is to exactly preserve all pairwise distances between nearest neighbors (in the inner product space) while maximizing the distances between points that are not nearest neighbors.

An alternative approach to neighborhood preservation is through the minimization of a cost function that measures differences between distances in the input and output spaces. Important examples of such techniques include: classical [[multidimensional scaling]], which is identical to PCA; [[Isomap]], which uses geodesic distances in the data space; [[diffusion map]]s, which use diffusion distances in the data space; [[t-distributed stochastic neighbor embedding]] (t-SNE), which minimizes the divergence between distributions over pairs of points; and curvilinear component analysis.

A different approach to nonlinear dimensionality reduction is through the use of [[autoencoder]]s, a special kind of [[feedforward neural network]]s with a bottleneck hidden layer.<ref>Hongbing Hu, Stephen A. Zahorian, (2010) [http://ws2.binghamton.edu/zahorian/pdf/Hu2010Dimensionality.pdf "Dimensionality Reduction Methods for HMM Phonetic Recognition"], ICASSP 2010, Dallas, TX</ref> The training of deep encoders is typically performed using a greedy layer-wise pre-training (e.g., using a stack of [[restricted Boltzmann machine]]s) that is followed by a finetuning stage based on [[backpropagation]].
[[File:LDA Projection Illustration 01.gif|thumb|A visual depiction of the resulting LDA projection for a set of 2D points.]]

===Linear discriminant analysis (LDA)===
{{Main|Linear discriminant analysis}}
Linear discriminant analysis (LDA) is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events.

===Generalized discriminant analysis (GDA)===
GDA deals with nonlinear discriminant analysis using kernel function operator. The underlying theory is close to the [[support-vector machine]]s (SVM) insofar as the GDA method provides a mapping of the input vectors into high-dimensional feature space.<ref name="gda">{{cite journal |doi=10.1162/089976600300014980 |pmid=11032039 |title=Generalized Discriminant Analysis Using a Kernel Approach |journal=Neural Computation |volume=12 |issue=10 |pages=2385–2404 |year=2000 |last1=Baudat |first1=G. |last2=Anouar |first2=F. |citeseerx=10.1.1.412.760 |s2cid=7036341}}</ref><ref name="cloudid">{{cite journal |doi=10.1016/j.eswa.2015.06.025 |title=CloudID: Trustworthy cloud-based and cross-enterprise biometric identification |journal=Expert Systems with Applications |volume=42 |issue=21 |pages=7905–7916 |year=2015 |last1=Haghighat |first1=Mohammad |last2=Zonouz |first2=Saman |last3=Abdel-Mottaleb |first3=Mohamed}}</ref> Similar to LDA, the objective of GDA is to find a projection for the features into a lower dimensional space by maximizing the ratio of between-class scatter to within-class scatter.

===Autoencoder===
{{Main|Autoencoder}}
Autoencoders can be used to learn nonlinear dimension reduction functions and codings together with an inverse function from the coding to the original representation.

===t-SNE===
{{Main|t-distributed stochastic neighbor embedding}}
T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique useful for the visualization of high-dimensional datasets. It is not recommended for use in analysis such as clustering or outlier detection since it does not necessarily preserve densities or distances well.<ref>{{cite book |last1=Schubert |first1=Erich |last2=Gertz |first2=Michael |title=Similarity Search and Applications |chapter=Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection |date=2017 |editor-last=Beecks |editor-first=Christian |editor2-last=Borutta |editor2-first=Felix |editor3-last=Kröger |editor3-first=Peer |editor4-last=Seidl |editor4-first=Thomas |chapter-url=https://link.springer.com/chapter/10.1007/978-3-319-68474-1_13 |series=Lecture Notes in Computer Science |volume=10609 |language=en |location=Cham |publisher=Springer International Publishing |pages=188–203 |doi=10.1007/978-3-319-68474-1_13 |isbn=978-3-319-68474-1}}</ref>

===UMAP===
{{Main|Uniform manifold approximation and projection}}
[[Uniform manifold approximation and projection]] (UMAP) is a nonlinear dimensionality reduction technique. Visually, it is similar to t-SNE, but it assumes that the data is uniformly distributed on a [[locally connected]] [[Riemannian manifold]] and that the [[Riemannian metric]] is locally constant or approximately locally constant.