Editing Nonlinear dimensionality reduction (section)

== Important concepts ==

=== Sammon's mapping ===

[[Sammon's mapping]] is one of the first and most popular NLDR techniques.

[[File:SOMsPCA.PNG|thumb|200px|left|Approximation of a principal curve by one-dimensional [[Self-organizing map|SOM]] (a [[broken line]] with red squares, 20 nodes). The first [[Principal component analysis|principal component]] is presented by a blue straight line. Data points are the small grey circles. For PCA, the [[Fraction of variance unexplained]] in this example is 23.23%, for SOM it is 6.86%.<ref>The illustration is prepared using free software: {{cite web |first=E.M. |last=Mirkes |url=http://www.math.le.ac.uk/people/ag153/homepage/PCA_SOM/PCA_SOM.html |title=Principal Component Analysis and Self-Organizing Maps: applet |publisher=University of Leicester |date=2011}}</ref>]]

=== Self-organizing map ===

The [[self-organizing map]] (SOM, also called ''Kohonen map'') and its probabilistic variant [[generative topographic mapping]] (GTM) use a point representation in the embedded space to form a [[latent variable model]] based on a non-linear mapping from the embedded space to the high-dimensional space.<ref>{{cite book |last=Yin |first=Hujun |chapter-url=http://pca.narod.ru/contentsgkwz.htm |chapter=3. Learning Nonlinear Principal Manifolds by Self-Organising Maps |editor1-first=A.N. |editor1-last=Gorban |editor2-first=B. |editor2-last=Kégl |editor3-first=D.C. |editor3-last=Wunsch |editor4-first=A. |editor4-last=Zinovyev |title=Principal Manifolds for Data Visualization and Dimension Reduction |series=Lecture Notes in Computer Science and Engineering |volume=58 |publisher=Springer |date=2007 |pages=68–95 |isbn=978-3-540-73749-0}}</ref> These techniques are related to work on [[density networks]], which also are based around the same probabilistic model.

=== Kernel principal component analysis ===

Perhaps the most widely used algorithm for dimensional reduction is [[kernel principal component analysis|kernel PCA]].<ref>{{cite journal |first1=B. |last1=Schölkopf |first2=A. |last2=Smola |author3-link=Klaus-Robert Müller |first3=K.-R. |last3=Müller |title=Nonlinear Component Analysis as a Kernel Eigenvalue Problem |journal=Neural Computation  |volume=10 |issue=5 |pages=1299–1319 |date=1998 |publisher=[[MIT Press]] |doi=10.1162/089976698300017467|s2cid=6674407 }}</ref> PCA begins by computing the covariance matrix of the <math>m \times n</math> matrix <math>\mathbf{X}</math>

:<math>C = \frac{1}{m}\sum_{i=1}^m{\mathbf{x}_i\mathbf{x}_i^\mathsf{T}}.</math>

It then projects the data onto the first ''k'' eigenvectors of that matrix. By comparison, KPCA begins by computing the covariance matrix of the data after being transformed into a higher-dimensional space,

:<math>C = \frac{1}{m}\sum_{i=1}^m{\Phi(\mathbf{x}_i)\Phi(\mathbf{x}_i)^\mathsf{T}}.</math>

It then projects the transformed data onto the first ''k'' eigenvectors of that matrix, just like PCA. It uses the [[Kernel_method#Mathematics:_the_kernel_trick|kernel trick]] to factor away much of the computation, such that the entire process can be performed without actually computing <math>\Phi(\mathbf{x})</math>. Of course <math>\Phi</math> must be chosen such that it has a known corresponding kernel. Unfortunately, it is not trivial to find a good kernel for a given problem, so KPCA does not yield good results with some problems when using standard kernels. For example, it is known to perform poorly with these kernels on the [[Swiss roll]] manifold. However, one can view certain other methods that perform well in such settings (e.g., Laplacian Eigenmaps, LLE) as special cases of kernel PCA by constructing a data-dependent kernel matrix.<ref>{{cite conference |first1=Jihun |last1=Ham |first2=Daniel D. |last2=Lee |first3=Sebastian |last3=Mika |first4=Bernhard |last4=Schölkopf |title=A kernel view of the dimensionality reduction of manifolds |book-title=Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, 2004 |doi=10.1145/1015330.1015417}}</ref>

KPCA has an internal model, so it can be used to map points onto its embedding that were not available at training time.

=== Principal curves and manifolds ===

[[File:SlideQualityLife.png|thumb|300px| Application of principal curves: Nonlinear quality of life index.<ref>{{cite journal |first1=A. N. |last1=Gorban |first2=A. |last2=Zinovyev |arxiv=1001.1122 |title=Principal manifolds and graphs in practice: from molecular biology to dynamical systems |journal=[[International Journal of Neural Systems]] |volume=20 |issue=3 |year=2010 |pages=219–232 |doi=10.1142/S0129065710002383 |pmid=20556849 |s2cid=2170982 }}</ref> Points represent data of the [[United Nations|UN]] 171 countries in 4-dimensional space formed by the values of 4 indicators: [[Gross domestic product|gross product per capita]], [[life expectancy]], [[infant mortality]], [[tuberculosis]] incidence. Different forms and colors correspond to various geographical locations. Red bold line represents the '''principal curve''', approximating the dataset. This principal curve was produced by the method of [[elastic map]]. <ref>A. Zinovyev, [http://bioinfo-out.curie.fr/projects/vidaexpert/ ViDaExpert] - Multidimensional Data Visualization Tool [[Curie Institute (Paris)|Institut Curie]], Paris.</ref>]]

'''[[Principal curve]]s and manifolds''' give the natural geometric framework for nonlinear dimensionality reduction and extend the geometric interpretation of PCA by explicitly constructing an embedded manifold, and by encoding using standard geometric projection onto the manifold. This approach was originally proposed by [[Trevor Hastie]] in his 1984 thesis,<ref>{{cite thesis |first=T. |last=Hastie |url=https://apps.dtic.mil/dtic/tr/fulltext/u2/a148833.pdf |archive-url=https://web.archive.org/web/20190802195018/https://apps.dtic.mil/dtic/tr/fulltext/u2/a148833.pdf |url-status=live |archive-date=August 2, 2019 |title=Principal Curves and Surfaces |type=PhD  |publisher=Stanford Linear Accelerator Center, Stanford University |date=November 1984 }}</ref> which he formally introduced in 1989.<ref>{{cite journal|author1-last=Hastie|author1-first=T. |author1-link=Trevor Hastie|author2-last=Stuetzle|author2-first=W. |title=Principal Curves|journal=[[Journal of the American Statistical Association]]|date=June 1989|volume=84|issue=406|pages=502–6|doi=10.1080/01621459.1989.10478797 |url=https://web.stanford.edu/~hastie/Papers/Principal_Curves.pdf}}</ref> This idea has been explored further by many authors.<ref>{{cite book |editor-link=Alexander Nikolaevich Gorban |editor-first=A. N. |editor-last=Gorban |editor2-first=B. |editor2-last=Kégl |editor3-first=D. C. |editor3-last=Wunsch |editor4-first=A. |editor4-last=Zinovyev |url=https://www.researchgate.net/publication/271642170 |title=Principal Manifolds for Data Visualisation and Dimension Reduction |series=Lecture Notes in Computer Science and Engineering (LNCSE) |volume=58 |publisher=Springer |year=2007 |isbn=978-3-540-73749-0 }}</ref>
How to define the "simplicity" of the manifold is problem-dependent, however, it is commonly measured by the intrinsic dimensionality and/or the smoothness of the manifold. Usually, the principal manifold is defined as a solution to an optimization problem. The objective function includes a quality of data approximation and some penalty terms for the bending of the manifold. The popular initial approximations are generated by linear PCA and Kohonen's SOM.

=== Laplacian eigenmaps ===

{{see also|Manifold regularization}}

Laplacian eigenmaps uses spectral techniques to perform dimensionality reduction.<ref>{{cite journal |first1=Mikhail |last1=Belkin |author-link2=Partha Niyogi |first2=Partha |last2=Niyogi |title=Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering |journal=Advances in Neural Information Processing Systems |volume=14 |year=2001 |pages=586–691 |publisher=MIT Press |oclc=52710683 |isbn=0-262-27173-7 |url=http://www.its.caltech.edu/~matilde/BelkinNiyogiLaplacian.pdf }}</ref> This technique relies on the basic assumption that the data lies in a low-dimensional manifold in a high-dimensional space.<ref name="Belkin">{{cite thesis |first=Mikhail |last=Belkin |title=Problems of Learning on Manifolds |type=PhD |publisher=Department of Mathematics, The University of Chicago |date=August 2003 |url=https://web.cse.ohio-state.edu/~belkin.8/papers/papers.html#thesis }} Matlab code for Laplacian Eigenmaps can be found in algorithms at [https://www.cse.ohio-state.edu/~mbelkin/algorithms/algorithms.html Ohio-state.edu]</ref>  This algorithm cannot embed out-of-sample points, but techniques based on [[Reproducing kernel Hilbert space]] regularization exist for adding this capability.<ref>{{cite book |last1=Bengio |first1=Yoshua |first2=Jean-Francois |last2=Paiement |first3=Pascal |last3=Vincent |first4=Olivier |last4=Delalleau |first5=Nicolas |last5=Le Roux |first6=Marie |last6=Ouimet |chapter-url=https://papers.nips.cc/paper/2461-out-of-sample-extensions-for-lle-isomap-mds-eigenmaps-and-spectral-clustering.pdf |chapter=Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering |title=Advances in Neural Information Processing Systems |year=2004 |isbn=0-262-20152-6 |volume=16|publisher=MIT Press }}</ref>  Such techniques can be applied to other nonlinear dimensionality reduction algorithms as well.

Traditional techniques like principal component analysis do not consider the intrinsic geometry of the data. Laplacian eigenmaps builds a graph from neighborhood information of the data set. Each data point serves as a node on the graph and connectivity between nodes is governed by the proximity of neighboring points (using e.g. the [[k-nearest neighbor algorithm]]). The graph thus generated can be considered as a discrete approximation of the low-dimensional manifold in the high-dimensional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold are mapped close to each other in the low-dimensional space, preserving local distances. The eigenfunctions of the [[Laplace–Beltrami operator]] on the manifold serve as the embedding dimensions, since under mild conditions this operator has a countable spectrum that is a basis for square integrable functions on the manifold (compare to [[Fourier series]] on the unit circle manifold).  Attempts to place Laplacian eigenmaps on solid theoretical ground have met with some success, as under certain nonrestrictive assumptions, the graph Laplacian matrix has been shown to converge to the Laplace–Beltrami operator as the number of points goes to infinity.<ref name="Belkin" />

=== Isomap ===

[[Isomap]]<ref>{{cite journal |first1=J B. |last1=Tenenbaum |first2=V. |last2=de Silva |first3=J.C. |last3=Langford |url=http://www.robots.ox.ac.uk/~az/lectures/ml/tenenbaum-isomap-Science2000.pdf |title=A Global Geometric Framework for Nonlinear Dimensionality Reduction |journal=Science |volume=290 |date=2000 |issue=5500 |pages=2319–23|doi=10.1126/science.290.5500.2319 |pmid=11125149 |bibcode=2000Sci...290.2319T |s2cid=221338160 }}</ref> is a combination of the [[Floyd–Warshall algorithm]] with classic [[Multidimensional Scaling]] (MDS). Classic MDS takes a matrix of pair-wise distances between all points and computes a position for each point. Isomap assumes that the pair-wise distances are only known between neighboring points, and uses the Floyd–Warshall algorithm to compute the pair-wise distances between all other points. This effectively estimates the full matrix of pair-wise [[geodesic distance]]s between all of the points. Isomap then uses classic MDS to compute the reduced-dimensional positions of all the points. Landmark-Isomap is a variant of this algorithm that uses landmarks to increase speed, at the cost of some accuracy.

In manifold learning, the input data is assumed to be sampled from a low dimensional [[manifold]] that is embedded inside of a higher-dimensional vector space. The main intuition behind MVU is to exploit the local linearity of manifolds and create a mapping that preserves local neighbourhoods at every point of the underlying manifold.

=== Locally-linear embedding ===

[[Locally-linear Embedding]] (LLE) was presented at approximately the same time as Isomap.<ref>{{cite journal |first1=S. T. |last1=Roweis |first2=L. K. |last2=Saul |title=Nonlinear Dimensionality Reduction by Locally Linear Embedding |journal=Science |volume=290 |issue=5500 |year=2000 |pages=2323–6 |doi=10.1126/science.290.5500.2323 |pmid=11125150 |bibcode=2000Sci...290.2323R |s2cid=5987139 }}</ref> It has several advantages over Isomap, including faster optimization when implemented to take advantage of [[sparse matrix]] algorithms, and better results with many problems. LLE also begins by finding a set of the nearest neighbors of each point. It then computes a set of weights for each point that best describes the point as a linear combination of its neighbors. Finally, it uses an eigenvector-based optimization technique to find the low-dimensional embedding of points, such that each point is still described with the same linear combination of its neighbors. LLE tends to handle non-uniform sample densities poorly because there is no fixed unit to prevent the weights from drifting as various regions differ in sample densities. LLE has no internal model.

The original data points <math>X_i \in \R^D</math>, and the goal of LLE is to embed each point <math>X_i</math> to some low-dimensional point <math>Y_i \in \R^d</math>, where <math>d \ll D</math>.

LLE has two steps. In the '''first''' step, it computes, for each point ''X''<sub>''i''</sub>, the best approximation of ''X''<sub>''i''</sub> based on barycentric coordinates of its neighbors ''X''<sub>''j''</sub>. The original point is approximately reconstructed by a linear combination, given by the weight matrix ''W''<sub>''ij''</sub>, of its neighbors. The reconstruction error is:

:<math> E(W) = \sum_i \left|\mathbf{X}_i - \sum_j {\mathbf{W}_{ij}\mathbf{X}_j}\right|^2 </math>

The weights ''W''<sub>''ij''</sub> refer to the amount of contribution the point ''X''<sub>''j''</sub> has while reconstructing the point ''X''<sub>''i''</sub>. The cost function is minimized under two constraints:

* Each data point ''X''<sub>''i''</sub> is reconstructed only from its neighbors. That is, <math>W_{ij} = 0</math> if point ''X''<sub>''j''</sub> is not a neighbor of the point ''X''<sub>''i''</sub>.
* The sum of every row of the weight matrix equals 1, that is, <math> \sum_j {\mathbf{W}_{ij}} = 1 </math>.
These two constraints ensure that <math>W</math> is unaffected by rotation and translation.

In the '''second''' step, a neighborhood preserving map is created based the weights. Each point <math>X_i \in \R^D</math> is mapped onto a point <math>Y_i \in \R^d</math> by minimizing another cost:

:<math> C(Y) = \sum_i \left|\mathbf{Y}_i - \sum_j {\mathbf{W}_{ij}\mathbf{Y}_j}\right|^{2} </math>

Unlike in the previous cost function, the weights W<sub>ij</sub> are kept fixed and the minimization is done on the points Y<sub>i</sub> to optimize the coordinates. This minimization problem can be solved by solving a sparse ''N'' X ''N'' [[Eigendecomposition of a matrix|eigenvalue problem]] (''N'' being the number of data points), whose bottom ''d'' nonzero eigen vectors provide an orthogonal set of coordinates.

The only hyperparameter in the algorithm is what counts as a "neighbor" of a point. Generally the data points are reconstructed from ''K'' nearest neighbors, as measured by [[Euclidean distance]]. In this case, the algorithm has only one integer-valued hyperparameter ''K,'' which can be chosen by cross validation.

==== Hessian locally-linear embedding (Hessian LLE) ====

Like LLE, [[Hessian LLE]] is also based on sparse matrix techniques.<ref>{{cite journal |first1=D. |last1=Donoho |first2=C. |last2=Grimes |title=Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data |journal=Proc Natl Acad Sci U S A |date=2003 |volume=100 |issue=10 |pages=5591–6 |doi=10.1073/pnas.1031596100 |pmid=16576753 |pmc=156245 |bibcode=2003PNAS..100.5591D |doi-access=free }}</ref> It tends to yield results of a much higher quality than LLE. Unfortunately, it has a very costly computational complexity, so it is not well-suited for heavily sampled manifolds. It has no internal model.

==== Modified Locally-Linear Embedding (MLLE) ====

Modified LLE (MLLE)<ref>{{cite journal |first1=Z. |last1=Zhang |first2=J. |last2=Wang |title=MLLE: Modified Locally Linear Embedding Using Multiple Weights |journal=NIPS'06: Proceedings of the 19th International Conference on Neural Information Processing Systems |year=2006 |pages=1593–1600 |url=https://dl.acm.org/doi/abs/10.5555/2976456.2976656 }}</ref> is another LLE variant which uses multiple weights in each neighborhood to address the local weight matrix conditioning problem which leads to distortions in LLE maps. Loosely speaking the multiple weights are the local [[orthogonal projection]] of the original weights produced by LLE. The creators of this regularised variant are also the authors of Local Tangent Space Alignment (LTSA), which is implicit in the MLLE formulation when realising that the global optimisation of the orthogonal projections of each weight vector, in-essence, aligns the local tangent spaces of every data point. The theoretical and empirical implications from the correct application of this algorithm are far-reaching.<ref name=borntolose>{{cite journal|author-last=Sidhu|author-first=Gagan|title=Locally Linear Embedding and fMRI feature selection in psychiatric classification|journal=IEEE Journal of Translational Engineering in Health and Medicine|year=2019|volume=7|pages=1–11|doi=10.1109/JTEHM.2019.2936348|pmid=31497410|pmc=6726465|arxiv=1908.06319|s2cid=201832756}}</ref>

=== Local tangent space alignment ===
{{Main|Local tangent space alignment}}
[[local tangent space alignment|LTSA]]<ref>{{Cite journal |last=Zhang |first=Zhenyue |author2=Hongyuan Zha  |title=Principal Manifolds and Nonlinear Dimension Reduction via Local Tangent Space Alignment |journal=SIAM Journal on Scientific Computing |volume=26 |issue=1 |year=2005 |pages=313–338 |doi=10.1137/s1064827502419154|citeseerx=10.1.1.211.9957 }}</ref> is based on the intuition that when a manifold is correctly unfolded, all of the tangent hyperplanes to the manifold will become aligned.  It begins by computing the ''k''-nearest neighbors of every point.  It computes the tangent space at every point by computing the ''d''-first principal components in each local neighborhood.  It then optimizes to find an embedding that aligns the tangent spaces.

=== Maximum variance unfolding ===

[[Maximum Variance Unfolding]], Isomap and Locally Linear Embedding share a common intuition relying on the notion that if a manifold is properly unfolded, then variance over the points is maximized. Its initial step, like Isomap and Locally Linear Embedding, is finding the ''k''-nearest neighbors of every point. It then seeks to solve the problem of maximizing the distance between all non-neighboring points, constrained such that the distances between neighboring points are preserved. The primary contribution of this algorithm is a technique for casting this problem as a semidefinite programming problem. Unfortunately, semidefinite programming solvers have a high computational cost. Like Locally Linear Embedding, it has no internal model.

=== Autoencoders ===

An [[autoencoder]] is a feed-forward [[neural network]] which is trained to approximate the identity function. That is, it is trained to map from a vector of values to the same vector. When used for dimensionality reduction purposes, one of the hidden layers in the network is limited to contain only a small number of network units. Thus, the network must learn to encode the vector into a small number of dimensions and then decode it back into the original space. Thus, the first half of the network is a model which maps from high to low-dimensional space, and the second half maps from low to high-dimensional space. Although the idea of autoencoders is quite old,<ref>{{cite book |last1=DeMers |first1=D. |last2=Cottrell |first2=G.W. |date=1993 |chapter=Non-linear dimensionality reduction |title=Advances in neural information processing systems |pages=580–7 |volume=5 |publisher= Morgan Kaufmann|isbn=1558600159 |oclc=928936290}}</ref> training of deep autoencoders has only recently become possible through the use of [[restricted Boltzmann machine]]s and stacked denoising autoencoders. Related to autoencoders is the [[NeuroScale]] algorithm, which uses stress functions inspired by [[multidimensional scaling]] and [[Sammon mapping]]s (see above) to learn a non-linear mapping from the high-dimensional to the embedded space. The mappings in NeuroScale are based on [[radial basis function network]]s.

=== Gaussian process latent variable models ===

[[Gaussian process latent variable model]]s (GPLVM)<ref>{{cite journal |first=N. |last=Lawrence |url=http://jmlr.csail.mit.edu/papers/v6/lawrence05a.html |title=Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models |journal=Journal of Machine Learning Research |volume=6 |pages=1783–1816 |date=2005}}</ref> are probabilistic dimensionality reduction methods that use Gaussian Processes (GPs) to find a lower dimensional non-linear embedding of high dimensional data. They are an extension of the Probabilistic formulation of PCA. The model is defined probabilistically and the latent variables are then marginalized and parameters are obtained by maximizing the likelihood. Like kernel PCA they use a kernel function to form a non linear mapping (in the form of a [[Gaussian process]]). However, in the GPLVM the mapping is from the embedded(latent) space to the data space (like density networks and GTM) whereas in kernel PCA it is in the opposite direction. It was originally proposed for visualization of high dimensional data but has been extended to construct a shared manifold model between two observation spaces.
GPLVM and its many variants have been proposed specially for human motion modeling, e.g., back constrained GPLVM, GP dynamic model (GPDM), balanced GPDM (B-GPDM) and topologically constrained GPDM. To capture the coupling effect of the pose and gait manifolds in the gait analysis, a multi-layer joint gait-pose manifolds was proposed.<ref>{{cite journal |first1=M. |last1=Ding |first2=G. |last2=Fan |url=https://ieeexplore.ieee.org/document/6985586 |title=Multilayer Joint Gait-Pose Manifolds for Human Gait Motion Modeling |journal=IEEE Transactions on Cybernetics |volume=45 |issue=11 |date=2015|pages=2413–24 |doi=10.1109/TCYB.2014.2373393 |pmid=25532201 |s2cid=15591304 |url-access=subscription }}</ref>

=== t-distributed stochastic neighbor embedding ===
<!-- This paragraph should be moved to the general class of stochastic neighbor embedding techniques. -->

[[t-distributed stochastic neighbor embedding]] (t-SNE)<ref>{{cite journal|last1=van der Maaten|first1=L.J.P.|last2=Hinton |first2=G.E. |title=Visualizing High-Dimensional Data Using t-SNE|journal=Journal of Machine Learning Research |volume=9|date=2008|pages=2579–2605|url=http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf}}</ref> is widely used. It is one of a family of stochastic neighbor embedding methods. The algorithm computes the probability that pairs of datapoints in the high-dimensional space are related, and then chooses low-dimensional embeddings which produce a similar distribution.