Editing Nonlinear dimensionality reduction (section)

=== Diffusion maps ===
[[Diffusion map]]s leverages the relationship between heat [[diffusion]] and a [[random walk]] ([[Markov Chain]]); an analogy is drawn between the diffusion operator on a manifold and a Markov transition matrix operating on functions defined on the graph whose nodes were sampled from the manifold.<ref>{{cite thesis |title=Diffusion Maps and Geometric Harmonics |first=Stephane |last=Lafon |type=PhD |publisher=[[Yale University]] |date=May 2004 |url=http://search.library.yale.edu/catalog/6714668 }}</ref> In particular, let a data set be represented by <math> \mathbf{X} = [x_1,x_2,\ldots,x_n] \in \Omega \subset \mathbf {R^D}</math>. The underlying assumption of diffusion map is that the high-dimensional data lies on a low-dimensional manifold of dimension <math> \mathbf{d} </math>. Let '''X''' represent the data set and <math> \mu </math> represent the distribution of the data points on '''X'''. Further, define a '''kernel''' which represents some notion of affinity of the points in '''X'''. The kernel <math> \mathit{k} </math> has the following properties<ref name="ReferenceA">{{cite journal |title=Diffusion Maps |first1=Ronald R. |last1=Coifman |first2=Stephane |last2=Lafon |journal=Applied and Computational Harmonic Analysis |volume=21 |issue=1 |date=July 2006 |pages=5–30 |doi=10.1016/j.acha.2006.04.006 |s2cid=17160669 |url=https://www.math.ucdavis.edu/~strohmer/courses/270/diffusion_maps.pdf  }}</ref>

: <math>k(x,y) = k(y,x), </math>

''k'' is symmetric

: <math> k(x,y) \geq 0\qquad \forall x,y, k </math>

''k'' is positivity preserving

Thus one can think of the individual data points as the nodes of a graph and the kernel ''k'' as defining some sort of affinity on that graph. The graph is symmetric by construction since the kernel is symmetric. It is easy to see here that from the tuple ('''X''','''k''') one can construct a reversible [[Markov Chain]]. This technique is common to a variety of fields and is known as the graph Laplacian.

For example, the graph '''K''' = (''X'',''E'') can be constructed using a Gaussian kernel.

: <math> K_{ij} = \begin{cases}
e^{-\|x_i -x_j\|^2_2/\sigma ^2} & \text{if } x_i \sim x_j \\
0                          & \text{otherwise}
\end{cases}
</math>

In the above equation, <math> x_i \sim x_j </math> denotes that <math> x_i </math> is a nearest neighbor of <math>x_j </math>. Properly, [[Geodesic]] distance should be used to actually measure distances on the [[manifold]]. Since the exact structure of the manifold is not available, for the nearest neighbors the geodesic distance is approximated by euclidean distance. The choice <math> \sigma </math> modulates our notion of proximity in the sense that if <math> \|x_i - x_j\|_2 \gg \sigma </math> then <math> K_{ij} = 0 </math> and if <math> \|x_i - x_j\|_2 \ll \sigma </math> then <math> K_{ij} = 1 </math>. The former means that very little diffusion has taken place while the latter implies that the diffusion process is nearly complete. Different strategies to choose <math> \sigma </math> can be found in.<ref>{{cite thesis |first=B. |last=Bah |date=2008 |title=Diffusion Maps: Applications and Analysis |type=Masters |publisher=University of Oxford |url=http://solo.bodleian.ox.ac.uk/permalink/f/89vilt/oxfaleph017015682 }}</ref>

In order to faithfully represent a Markov matrix, <math> K </math> must be normalized by the corresponding [[degree matrix]] <math> D </math>:

: <math> P = D^{-1}K. </math>

<math> P </math> now represents a Markov chain. <math> P(x_i,x_j) </math> is the probability of transitioning from <math> x_i </math> to <math> x_j </math> in one time step. Similarly the probability of transitioning from <math> x_i </math> to <math> x_j </math> in '''t''' time steps is given by <math> P^t (x_i,x_j) </math>. Here <math> P^t </math> is the matrix <math> P </math> multiplied by itself '''t''' times.

The Markov matrix <math> P </math> constitutes some notion of local geometry of the data set '''X'''. The major difference between diffusion maps and [[principal component analysis]] is that only local features of the data are considered in diffusion maps as opposed to taking correlations of the entire data set.

<math> K </math> defines a random walk on the data set which means that the kernel captures some local geometry of data set. The Markov chain defines fast and slow directions of propagation through the kernel values. As the walk propagates forward in time, the local geometry information aggregates in the same way as local transitions (defined by differential equations) of the dynamical system.<ref name="ReferenceA"/> The metaphor of diffusion arises from the definition of a family diffusion distance <math>\{ D_t \}_{ t \in N} </math>

: <math> D_t^2(x,y) = \|p_t(x,\cdot) - p_t(y,\cdot)\|^2 </math>

For fixed t, <math> D_t </math> defines a distance between any two points of the data set based on path connectivity: the value of <math> D_t(x,y) </math> will be smaller the more paths that connect '''x''' to '''y''' and vice versa. Because the quantity <math> D_t(x,y) </math> involves a sum over of all paths of length t, <math> D_t </math> is much more robust to noise in the data than geodesic distance. <math> D_t </math> takes into account all the relation between points x and y while calculating the distance and serves as a better notion of proximity than just [[Euclidean distance]] or even geodesic distance.