Editing Curse of dimensionality (section)

=== Machine learning ===
In [[machine learning]] problems that involve learning a "state-of-nature" from a finite number of data samples in a high-dimensional [[feature space]] with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values. In an abstract sense, as the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.<ref>{{cite web |title=Curse of Dimensionality - Georgia Tech - Machine Learning |url=https://www.youtube.com/watch?v=QZ0DtNFdDko |language=en |access-date=2022-06-29 |author=Udacity | website=[[YouTube]] |date=2015-02-23 }}</ref> 

A typical rule of thumb is that there should be at least 5 training examples for each dimension in the representation.<ref name="Pattern recog" >{{cite book |last1=Koutroumbas |first2=Sergios |last2=Theodoridis |first1=Konstantinos |title=Pattern Recognition |edition=4th  |date=2008 |location=Burlington |url=https://www.elsevier.com/books/pattern-recognition/theodoridis/978-1-59749-272-0 |isbn=978-1-59749-272-0 |access-date=2018-01-08 }}</ref> In [[machine learning]] and insofar as predictive performance is concerned, the ''curse of dimensionality'' is used interchangeably with the ''peaking phenomenon'',<ref name="Pattern recog"/> which is also known as ''Hughes phenomenon''.<ref>{{cite journal |last=Hughes |first=G.F. |s2cid=206729491 |title=On the mean accuracy of statistical pattern recognizers |journal=IEEE Transactions on Information Theory |volume=14 |issue=1 |pages=55–63 |date=January 1968 |doi=10.1109/TIT.1968.1054102 }}</ref> This phenomenon states that with a fixed number of training samples, the average (expected) predictive power of a classifier or regressor first increases as the number of dimensions or features used is increased but beyond a certain dimensionality it starts deteriorating instead of improving steadily.<ref>{{cite journal|last1=Trunk|first1=G. V.|title=A Problem of Dimensionality: A Simple Example|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|date=July 1979|volume=PAMI-1 |issue=3 |pages=306–307 |doi=10.1109/TPAMI.1979.4766926 |pmid=21868861 |s2cid=13086902 }}</ref><ref>{{cite journal |author1=B. Chandrasekaran |author2=A. K. Jain |title=Quantization Complexity and Independent Measurements|journal= IEEE Transactions on Computers|year=1974|doi=10.1109/T-C.1974.223789 |volume=23 |issue=8 |pages=102–106|s2cid=35360973 }}</ref><ref name="McLachlan:2004">{{cite book |title=Discriminant Analysis and Statistical Pattern Recognition |first1=G. J. |last1=McLachlan |publisher=Wiley Interscience |isbn=978-0-471-69115-0 |year=2004 |mr=1190469}}</ref> 

Nevertheless, in the context of a ''simple'' classifier (e.g., [[linear discriminant analysis]] in the multivariate Gaussian model under the assumption of a common known covariance matrix), Zollanvari, ''et al.'', showed both analytically and empirically that as long as the relative cumulative efficacy of an additional feature set (with respect to features that are already part of the classifier) is greater (or less) than the size of this additional feature set, the expected error of the classifier constructed using these additional features will be less (or greater) than the expected error of the classifier constructed without them. In other words, both the size of additional features and their (relative) cumulative discriminatory effect are important in observing a decrease or increase in the average predictive power.<ref name="zollanvari">{{cite journal |first1=A. |last1=Zollanvari |first2=A. P. |last2=James |first3=R. |last3=Sameni |title=A Theoretical Analysis of the Peaking Phenomenon in Classification |journal= Journal of Classification |year=2020 |doi=10.1007/s00357-019-09327-3 |volume=37 |issue=2 |pages=421–434 |s2cid=253851666 }}</ref>

In [[Similarity learning|metric learning]], higher dimensions can sometimes allow a model to achieve better performance. After normalizing embeddings to the surface of a hypersphere, FaceNet achieves the best performance using 128 dimensions as opposed to 64, 256, or 512 dimensions in one ablation study.<ref>{{Cite book |last1=Schroff |first1=Florian |last2=Kalenichenko |first2=Dmitry |last3=Philbin |first3=James |title=2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |chapter=FaceNet: A unified embedding for face recognition and clustering |date=June 2015 |chapter-url=https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Schroff_FaceNet_A_Unified_2015_CVPR_paper.pdf |pages=815–823 |doi=10.1109/CVPR.2015.7298682 |arxiv=1503.03832 |isbn=978-1-4673-6964-0 |s2cid=206592766 }}</ref> A loss function for unitary-invariant dissimilarity between word embeddings was found to be minimized in high dimensions.<ref>{{Cite journal |last1=Yin |first1=Zi |last2=Shen |first2=Yuanyuan |date=2018 |title=On the Dimensionality of Word Embedding |url=https://proceedings.neurips.cc/paper_files/paper/2018/file/b534ba68236ba543ae44b22bd110a1d6-Paper.pdf |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=31}}</ref>