Editing Dimensionality reduction (section)

==Dimension reduction==
For high-dimensional datasets, dimension reduction is usually performed prior to applying a [[k-nearest neighbors algorithm|''k''-nearest neighbors]] (''k''-NN) algorithm in order to mitigate the [[curse of dimensionality]].<ref>Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft (1999) [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1422 "When is "nearest neighbor" meaningful?"]. ''Database Theory—ICDT99'', 217–235</ref>

[[Feature extraction]] and dimension reduction can be combined in one step, using [[principal component analysis]] (PCA), [[linear discriminant analysis]] (LDA), [[canonical correlation analysis]] (CCA), or [[non-negative matrix factorization]] (NMF) techniques to pre-process the data, followed by clustering via ''k''-NN on [[Feature (machine learning)|feature vectors]] in a reduced-dimension space. In [[machine learning]], this process is also called low-dimensional [[embedding]].<ref>{{cite book |last1=Shaw |first1=B. |last2=Jebara |first2=T. |doi=10.1145/1553374.1553494 |chapter=Structure preserving embedding |title=Proceedings of the 26th Annual International Conference on Machine Learning – ICML '09 |pages=1 |year=2009 |isbn=9781605585161 |chapter-url=http://www.cs.columbia.edu/~jebara/papers/spe-icml09.pdf |citeseerx=10.1.1.161.451 |s2cid=8522279}}</ref>

For high-dimensional datasets (e.g., when performing similarity search on live video streams, DNA data, or high-dimensional [[time series]]), running a fast '''approximate''' ''k''-NN search using [[locality-sensitive hashing]], [[random projection]],<ref>{{cite book |last1=Bingham |first1=E. |last2=Mannila |first2=H. |doi=10.1145/502512.502546 |chapter=Random projection in dimensionality reduction |title=Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining – KDD '01 |pages=245 |year=2001 |isbn=978-1581133912 |s2cid=1854295}}</ref> "sketches",<ref>Shasha, D High (2004) ''Performance Discovery in Time Series'' Berlin: Springer. {{ISBN|0-387-00857-8}}</ref> or other high-dimensional similarity search techniques from the [[International Conference on Very Large Data Bases|VLDB conference]] toolbox may be the only feasible option.