Editing Neural network (machine learning) (section)

===Convergence===
Models may not consistently converge on a single solution, firstly because local minima may exist, depending on the cost function and the model. Secondly, the optimization method used might not guarantee to converge when it begins far from any local minimum. Thirdly, for sufficiently large data or parameters, some methods become impractical.

Another issue worthy to mention is that training may cross some [[saddle point]] which may lead the convergence to the wrong direction.

The convergence behavior of certain types of ANN architectures are more understood than others. When the width of network approaches to infinity, the ANN is well described by its first order [[Taylor expansion]] throughout training, and so inherits the convergence behavior of [[Linear model|affine models]].<ref>{{Cite journal|last1=Lee|first1=Jaehoon|last2=Xiao|first2=Lechao|last3=Schoenholz|first3=Samuel S.|last4=Bahri |first4=Yasaman|last5=Novak |first5=Roman|last6=Sohl-Dickstein|first6=Jascha|last7=Pennington |first7=Jeffrey|title=Wide neural networks of any depth evolve as linear models under gradient descent |journal=Journal of Statistical Mechanics: Theory and Experiment|year=2020|volume=2020|issue=12|page=124002 |doi=10.1088/1742-5468/abc62b|arxiv=1902.06720|bibcode=2020JSMTE2020l4002L|s2cid=62841516}}</ref><ref>{{cite conference |conference=32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, Canada |author1=Arthur Jacot |author2=Franck Gabriel |author3=Clement Hongler |date=2018 |url=https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf |title=Neural Tangent Kernel: Convergence and Generalization in Neural Networks |access-date=4 June 2022 |archive-date=22 June 2022 |archive-url=https://web.archive.org/web/20220622033100/https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf |url-status=live }}</ref> Another example is when parameters are small, it is observed that ANNs often fit target functions from low to high frequencies. This behavior is referred to as the spectral bias, or frequency principle, of neural networks.<ref>{{cite book |vauthors=Xu ZJ, Zhang Y, Xiao Y |title=Neural Information Processing |date=2019 |veditors=Gedeon T, Wong K, Lee M |series=Lecture Notes in Computer Science |volume=11953 |publisher=Springer, Cham |doi=10.1007/978-3-030-36708-4_22 |chapter=Training Behavior of Deep Neural Network in Frequency Domain |pages=264–274 |arxiv=1807.01251 |isbn=978-3-030-36707-7 |s2cid=49562099 }}</ref><ref>{{cite journal |author1=Nasim Rahaman |author2=Aristide Baratin |author3=Devansh Arpit |author4=Felix Draxler |author5=Min Lin |author6=Fred Hamprecht |author7=Yoshua Bengio |author8=Aaron Courville |journal=Proceedings of the 36th International Conference on Machine Learning |volume=97 |pages=5301–5310 |date=2019 |title=On the Spectral Bias of Neural Networks |arxiv=1806.08734 |url=http://proceedings.mlr.press/v97/rahaman19a/rahaman19a.pdf |access-date=4 June 2022 |archive-date=22 October 2022 |archive-url=https://web.archive.org/web/20221022155951/http://proceedings.mlr.press/v97/rahaman19a/rahaman19a.pdf |url-status=live }}</ref><ref>{{cite journal |arxiv=1901.06523 |author1=Zhi-Qin John Xu |author2=Yaoyu Zhang |author3=Tao Luo |author4=Yanyang Xiao |author5=Zheng Ma |title=Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks|journal=Communications in Computational Physics |year=2020 |volume=28 |issue=5 |pages=1746–1767 |doi=10.4208/cicp.OA-2020-0085 |bibcode=2020CCoPh..28.1746X |s2cid=58981616 }}</ref><ref>{{cite arXiv |eprint=1906.09235 |author1=Tao Luo |author2=Zheng Ma |author3=Zhi-Qin John Xu |author4=Yaoyu Zhang |date=2019 |title=Theory of the Frequency Principle for General Deep Neural Networks|class=cs.LG }}</ref> This phenomenon is the opposite to the behavior of some well studied iterative numerical schemes such as [[Jacobi method]]. Deeper neural networks have been observed to be more biased towards low frequency functions.<ref>{{Cite journal|last1=Xu|first1=Zhiqin John|last2=Zhou|first2=Hanxu|title=Deep Frequency Principle Towards Understanding Why Deeper Learning is Faster |date=18 May 2021|url=https://ojs.aaai.org/index.php/AAAI/article/view/17261|journal=Proceedings of the AAAI Conference on Artificial Intelligence|volume=35|issue=12|pages=10541–10550|doi=10.1609/aaai.v35i12.17261|arxiv=2007.14313|s2cid=220831156|issn=2374-3468|access-date=5 October 2021|archive-date=5 October 2021|archive-url=https://web.archive.org/web/20211005142300/https://ojs.aaai.org/index.php/AAAI/article/view/17261|url-status=live}}</ref>