Editing Neural network (machine learning) (section)

==Theoretical properties==
===Computational power===
The [[multilayer perceptron]] is a [[UTM theorem|universal function]] approximator, as proven by the [[universal approximation theorem]]. However, the proof is not constructive regarding the number of neurons required, the network topology, the weights and the learning parameters.

A specific recurrent architecture with [[rational number|rational]]-valued weights (as opposed to full precision real number-valued weights) has the power of a [[universal Turing machine]],<ref>{{Cite journal | title = Turing computability with neural nets | url = http://www.math.rutgers.edu/~sontag/FTPDIR/aml-turing.pdf | year = 1991 | journal = Appl. Math. Lett. | pages = 77–80 | volume = 4 | issue = 6 | last1 = Siegelmann | first1 = H.T. | last2 = Sontag | first2 = E.D. | doi = 10.1016/0893-9659(91)90080-F | access-date = 10 January 2017 | archive-date = 19 May 2024 | archive-url = https://web.archive.org/web/20240519082138/http://www.math.rutgers.edu/~sontag/FTPDIR/aml-turing.pdf | url-status = live }}</ref> using a finite number of neurons and standard linear connections. Further, the use of [[Irrational number|irrational]] values for weights results in a machine with [[Hypercomputation|super-Turing]] power.<ref>{{cite news |title=Analog computer trumps Turing model |first=Sunny |last=Bains |date=3 November 1998 |work=EE Times |url=https://www.eetimes.com/analog-computer-trumps-turing-model/ |access-date=11 May 2023 |archive-date=11 May 2023 |archive-url=https://web.archive.org/web/20230511152308/https://www.eetimes.com/analog-computer-trumps-turing-model/ |url-status=live }}</ref><ref>{{cite journal |last1=Balcázar |first1=José |title=Computational Power of Neural Networks: A Kolmogorov Complexity Characterization |journal=IEEE Transactions on Information Theory|date=July 1997 |volume=43 |issue=4 |pages=1175–1183 |doi=10.1109/18.605580 |citeseerx=10.1.1.411.7782 }}</ref>{{Failed verification|date=May 2023}}

===Capacity===
A model's "capacity" property corresponds to its ability to model any given function. It is related to the amount of information that can be stored in the network and to the notion of complexity.
Two notions of capacity are known by the community. The information capacity and the VC Dimension. The information capacity of a perceptron is intensively discussed in [[David J. C. MacKay|Sir David MacKay]]'s book<ref name="auto">{{cite book| last=MacKay| first=David J.C.| author-link=David J.C. MacKay| year=2003| publisher=[[Cambridge University Press]]| isbn=978-0-521-64298-9| title=Information Theory, Inference, and Learning Algorithms| url=http://www.inference.phy.cam.ac.uk/itprnn/book.pdf| access-date=11 June 2016| archive-date=19 October 2016| archive-url=https://web.archive.org/web/20161019163258/http://www.inference.phy.cam.ac.uk/itprnn/book.pdf| url-status=live}}</ref> which summarizes work by [[Thomas M. Cover|Thomas Cover]].<ref>{{cite journal|last=Cover|first=Thomas|author-link=Thomas M. Cover|year=1965|publisher=[[IEEE]]|url=http://www-isl.stanford.edu/people/cover/papers/paper2.pdf|title=Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition|journal=IEEE Transactions on Electronic Computers|issue=3|pages=326–334|volume=EC-14|doi=10.1109/PGEC.1965.264137|access-date=10 March 2020|archive-date=5 March 2016|archive-url=https://web.archive.org/web/20160305031348/http://www-isl.stanford.edu/people/cover/papers/paper2.pdf|url-status=live}}</ref> The capacity of a network of standard neurons (not convolutional) can be derived by four rules<ref>{{cite book| last=Gerald | first=Friedland| title=Proceedings of the 27th ACM International Conference on Multimedia| chapter=Reproducibility and Experimental Design for Machine Learning on Audio and Multimedia Data| author-link=Gerald Friedland|year=2019|publisher=[[Association for Computing Machinery|ACM]]| pages=2709–2710| doi=10.1145/3343031.3350545| isbn=978-1-4503-6889-6| s2cid=204837170}}</ref> that derive from understanding a neuron as an electrical element. The information capacity captures the functions modelable by the network given any data as input. The second notion, is the [[VC dimension]]. VC Dimension uses the principles of [[measure theory]] and finds the maximum capacity under the best possible circumstances. This is, given input data in a specific form. As noted in,<ref name="auto"/> the VC Dimension for arbitrary inputs is half the information capacity of a perceptron. The VC Dimension for arbitrary points is sometimes referred to as Memory Capacity.<ref>{{cite web| url=http://tfmeter.icsi.berkeley.edu/| title=Stop tinkering, start measuring! Predictable experimental design of Neural Network experiments| website=The Tensorflow Meter| access-date=10 March 2020| archive-date=18 April 2022| archive-url=https://web.archive.org/web/20220418025904/http://tfmeter.icsi.berkeley.edu/| url-status=dead}}</ref>

===Convergence===
Models may not consistently converge on a single solution, firstly because local minima may exist, depending on the cost function and the model. Secondly, the optimization method used might not guarantee to converge when it begins far from any local minimum. Thirdly, for sufficiently large data or parameters, some methods become impractical.

Another issue worthy to mention is that training may cross some [[saddle point]] which may lead the convergence to the wrong direction.

The convergence behavior of certain types of ANN architectures are more understood than others. When the width of network approaches to infinity, the ANN is well described by its first order [[Taylor expansion]] throughout training, and so inherits the convergence behavior of [[Linear model|affine models]].<ref>{{Cite journal|last1=Lee|first1=Jaehoon|last2=Xiao|first2=Lechao|last3=Schoenholz|first3=Samuel S.|last4=Bahri |first4=Yasaman|last5=Novak |first5=Roman|last6=Sohl-Dickstein|first6=Jascha|last7=Pennington |first7=Jeffrey|title=Wide neural networks of any depth evolve as linear models under gradient descent |journal=Journal of Statistical Mechanics: Theory and Experiment|year=2020|volume=2020|issue=12|page=124002 |doi=10.1088/1742-5468/abc62b|arxiv=1902.06720|bibcode=2020JSMTE2020l4002L|s2cid=62841516}}</ref><ref>{{cite conference |conference=32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, Canada |author1=Arthur Jacot |author2=Franck Gabriel |author3=Clement Hongler |date=2018 |url=https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf |title=Neural Tangent Kernel: Convergence and Generalization in Neural Networks |access-date=4 June 2022 |archive-date=22 June 2022 |archive-url=https://web.archive.org/web/20220622033100/https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf |url-status=live }}</ref> Another example is when parameters are small, it is observed that ANNs often fit target functions from low to high frequencies. This behavior is referred to as the spectral bias, or frequency principle, of neural networks.<ref>{{cite book |vauthors=Xu ZJ, Zhang Y, Xiao Y |title=Neural Information Processing |date=2019 |veditors=Gedeon T, Wong K, Lee M |series=Lecture Notes in Computer Science |volume=11953 |publisher=Springer, Cham |doi=10.1007/978-3-030-36708-4_22 |chapter=Training Behavior of Deep Neural Network in Frequency Domain |pages=264–274 |arxiv=1807.01251 |isbn=978-3-030-36707-7 |s2cid=49562099 }}</ref><ref>{{cite journal |author1=Nasim Rahaman |author2=Aristide Baratin |author3=Devansh Arpit |author4=Felix Draxler |author5=Min Lin |author6=Fred Hamprecht |author7=Yoshua Bengio |author8=Aaron Courville |journal=Proceedings of the 36th International Conference on Machine Learning |volume=97 |pages=5301–5310 |date=2019 |title=On the Spectral Bias of Neural Networks |arxiv=1806.08734 |url=http://proceedings.mlr.press/v97/rahaman19a/rahaman19a.pdf |access-date=4 June 2022 |archive-date=22 October 2022 |archive-url=https://web.archive.org/web/20221022155951/http://proceedings.mlr.press/v97/rahaman19a/rahaman19a.pdf |url-status=live }}</ref><ref>{{cite journal |arxiv=1901.06523 |author1=Zhi-Qin John Xu |author2=Yaoyu Zhang |author3=Tao Luo |author4=Yanyang Xiao |author5=Zheng Ma |title=Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks|journal=Communications in Computational Physics |year=2020 |volume=28 |issue=5 |pages=1746–1767 |doi=10.4208/cicp.OA-2020-0085 |bibcode=2020CCoPh..28.1746X |s2cid=58981616 }}</ref><ref>{{cite arXiv |eprint=1906.09235 |author1=Tao Luo |author2=Zheng Ma |author3=Zhi-Qin John Xu |author4=Yaoyu Zhang |date=2019 |title=Theory of the Frequency Principle for General Deep Neural Networks|class=cs.LG }}</ref> This phenomenon is the opposite to the behavior of some well studied iterative numerical schemes such as [[Jacobi method]]. Deeper neural networks have been observed to be more biased towards low frequency functions.<ref>{{Cite journal|last1=Xu|first1=Zhiqin John|last2=Zhou|first2=Hanxu|title=Deep Frequency Principle Towards Understanding Why Deeper Learning is Faster |date=18 May 2021|url=https://ojs.aaai.org/index.php/AAAI/article/view/17261|journal=Proceedings of the AAAI Conference on Artificial Intelligence|volume=35|issue=12|pages=10541–10550|doi=10.1609/aaai.v35i12.17261|arxiv=2007.14313|s2cid=220831156|issn=2374-3468|access-date=5 October 2021|archive-date=5 October 2021|archive-url=https://web.archive.org/web/20211005142300/https://ojs.aaai.org/index.php/AAAI/article/view/17261|url-status=live}}</ref>

===Generalization and statistics===
{{No footnotes|date=August 2019|section}}
Applications whose goal is to create a system that generalizes well to unseen examples, face the possibility of [[Overfitting|over-training]]. This arises in convoluted or over-specified systems when the network capacity significantly exceeds the needed free parameters. 

Two approaches address over-training. The first is to use [[cross-validation (statistics)|cross-validation]] and similar techniques to check for the presence of over-training and to select hyperparameters to minimize the generalization error. The second is to use some form of ''[[regularization (mathematics)|regularization]]''. This concept emerges in a probabilistic (Bayesian) framework, where regularization can be performed by selecting a larger prior probability over simpler models; but also in statistical learning theory, where the goal is to minimize over two quantities: the 'empirical risk' and the 'structural risk', which roughly corresponds to the error over the training set and the predicted error in unseen data due to overfitting.

[[File:Synapse deployment.jpg|thumb|right|upright=1.15|Confidence analysis of a neural network]]
Supervised neural networks that use a [[mean squared error]] (MSE) cost function can use formal statistical methods to determine the confidence of the trained model. The MSE on a validation set can be used as an estimate for variance. This value can then be used to calculate the [[confidence interval]] of network output, assuming a [[normal distribution]]. A confidence analysis made this way is statistically valid as long as the output [[probability distribution]] stays the same and the network is not modified.

By assigning a [[softmax activation function]], a generalization of the [[logistic function]], on the output layer of the neural network (or a softmax component in a component-based network) for categorical target variables, the outputs can be interpreted as posterior probabilities. This is useful in classification as it gives a certainty measure on classifications.

The softmax activation function is:

:<math>y_i=\frac{e^{x_i}}{\sum_{j=1}^c e^{x_j}}</math>
<section end="theory" />