Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Curse of dimensionality
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Machine learning === In [[machine learning]] problems that involve learning a "state-of-nature" from a finite number of data samples in a high-dimensional [[feature space]] with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values. In an abstract sense, as the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.<ref>{{cite web |title=Curse of Dimensionality - Georgia Tech - Machine Learning |url=https://www.youtube.com/watch?v=QZ0DtNFdDko |language=en |access-date=2022-06-29 |author=Udacity | website=[[YouTube]] |date=2015-02-23 }}</ref> A typical rule of thumb is that there should be at least 5 training examples for each dimension in the representation.<ref name="Pattern recog" >{{cite book |last1=Koutroumbas |first2=Sergios |last2=Theodoridis |first1=Konstantinos |title=Pattern Recognition |edition=4th |date=2008 |location=Burlington |url=https://www.elsevier.com/books/pattern-recognition/theodoridis/978-1-59749-272-0 |isbn=978-1-59749-272-0 |access-date=2018-01-08 }}</ref> In [[machine learning]] and insofar as predictive performance is concerned, the ''curse of dimensionality'' is used interchangeably with the ''peaking phenomenon'',<ref name="Pattern recog"/> which is also known as ''Hughes phenomenon''.<ref>{{cite journal |last=Hughes |first=G.F. |s2cid=206729491 |title=On the mean accuracy of statistical pattern recognizers |journal=IEEE Transactions on Information Theory |volume=14 |issue=1 |pages=55β63 |date=January 1968 |doi=10.1109/TIT.1968.1054102 }}</ref> This phenomenon states that with a fixed number of training samples, the average (expected) predictive power of a classifier or regressor first increases as the number of dimensions or features used is increased but beyond a certain dimensionality it starts deteriorating instead of improving steadily.<ref>{{cite journal|last1=Trunk|first1=G. V.|title=A Problem of Dimensionality: A Simple Example|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|date=July 1979|volume=PAMI-1 |issue=3 |pages=306β307 |doi=10.1109/TPAMI.1979.4766926 |pmid=21868861 |s2cid=13086902 }}</ref><ref>{{cite journal |author1=B. Chandrasekaran |author2=A. K. Jain |title=Quantization Complexity and Independent Measurements|journal= IEEE Transactions on Computers|year=1974|doi=10.1109/T-C.1974.223789 |volume=23 |issue=8 |pages=102β106|s2cid=35360973 }}</ref><ref name="McLachlan:2004">{{cite book |title=Discriminant Analysis and Statistical Pattern Recognition |first1=G. J. |last1=McLachlan |publisher=Wiley Interscience |isbn=978-0-471-69115-0 |year=2004 |mr=1190469}}</ref> Nevertheless, in the context of a ''simple'' classifier (e.g., [[linear discriminant analysis]] in the multivariate Gaussian model under the assumption of a common known covariance matrix), Zollanvari, ''et al.'', showed both analytically and empirically that as long as the relative cumulative efficacy of an additional feature set (with respect to features that are already part of the classifier) is greater (or less) than the size of this additional feature set, the expected error of the classifier constructed using these additional features will be less (or greater) than the expected error of the classifier constructed without them. In other words, both the size of additional features and their (relative) cumulative discriminatory effect are important in observing a decrease or increase in the average predictive power.<ref name="zollanvari">{{cite journal |first1=A. |last1=Zollanvari |first2=A. P. |last2=James |first3=R. |last3=Sameni |title=A Theoretical Analysis of the Peaking Phenomenon in Classification |journal= Journal of Classification |year=2020 |doi=10.1007/s00357-019-09327-3 |volume=37 |issue=2 |pages=421β434 |s2cid=253851666 }}</ref> In [[Similarity learning|metric learning]], higher dimensions can sometimes allow a model to achieve better performance. After normalizing embeddings to the surface of a hypersphere, FaceNet achieves the best performance using 128 dimensions as opposed to 64, 256, or 512 dimensions in one ablation study.<ref>{{Cite book |last1=Schroff |first1=Florian |last2=Kalenichenko |first2=Dmitry |last3=Philbin |first3=James |title=2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |chapter=FaceNet: A unified embedding for face recognition and clustering |date=June 2015 |chapter-url=https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Schroff_FaceNet_A_Unified_2015_CVPR_paper.pdf |pages=815β823 |doi=10.1109/CVPR.2015.7298682 |arxiv=1503.03832 |isbn=978-1-4673-6964-0 |s2cid=206592766 }}</ref> A loss function for unitary-invariant dissimilarity between word embeddings was found to be minimized in high dimensions.<ref>{{Cite journal |last1=Yin |first1=Zi |last2=Shen |first2=Yuanyuan |date=2018 |title=On the Dimensionality of Word Embedding |url=https://proceedings.neurips.cc/paper_files/paper/2018/file/b534ba68236ba543ae44b22bd110a1d6-Paper.pdf |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=31}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)