Editing Speech recognition (section)

====2000s====
In the 2000s DARPA sponsored two speech recognition programs: Effective Affordable Reusable Speech-to-Text (EARS) in 2002 and [[DARPA Global autonomous language exploitation program|Global Autonomous Language Exploitation]] (GALE). Four teams participated in the EARS program: [[IBM]], a team led by [[BBN Technologies|BBN]] with [[LIMSI]] and [[University of Pittsburgh|Univ. of Pittsburgh]], [[Cambridge University]], and a team composed of [[International Computer Science Institute|ICSI]], [[Stanford Research Institute|SRI]] and [[University of Washington]]. EARS funded the collection of the Switchboard telephone [[speech corpus]] containing 260 hours of recorded conversations from over 500 speakers.<ref>{{Cite web |title=Switchboard-1 Release 2 |url=https://catalog.ldc.upenn.edu/LDC97S62 |url-status=live |archive-url=https://web.archive.org/web/20170711061225/https://catalog.ldc.upenn.edu/LDC97S62 |archive-date=11 July 2017 |access-date=26 July 2017 |df=dmy-all}}</ref> The GALE program focused on [[Modern Standard Arabic|Arabic]] and [[Standard Chinese|Mandarin]] broadcast news speech. [[Google]]'s first effort at speech recognition came in 2007 after hiring some researchers from Nuance.<ref>{{Cite web |last=Jason Kincaid |date=13 February 2011 |title=The Power of Voice: A Conversation With The Head Of Google's Speech Technology |url=https://techcrunch.com/2011/02/13/the-power-of-voice-a-conversation-with-the-head-of-googles-speech-technology/ |url-status=live |archive-url=https://web.archive.org/web/20150721034447/http://techcrunch.com/2011/02/13/the-power-of-voice-a-conversation-with-the-head-of-googles-speech-technology/ |archive-date=21 July 2015 |access-date=21 July 2015 |website=Tech Crunch |df=dmy-all}}</ref> The first product was [[GOOG-411]], a telephone based directory service. The recordings from GOOG-411 produced valuable data that helped Google improve their recognition systems. [[Google Voice Search]] is now supported in over 30 languages.

In the United States, the [[National Security Agency]] has made use of a type of speech recognition for [[keyword spotting]] since at least 2006.<ref>{{Cite web |last=Froomkin |first=Dan |date=2015-05-05 |title=THE COMPUTERS ARE LISTENING |url=https://firstlook.org/theintercept/2015/05/05/nsa-speech-recognition-snowden-searchable-text/ |url-status=live |archive-url=https://web.archive.org/web/20150627185007/https://firstlook.org/theintercept/2015/05/05/nsa-speech-recognition-snowden-searchable-text/ |archive-date=27 June 2015 |access-date=20 June 2015 |website=The Intercept |df=dmy-all}}</ref> This technology allows analysts to search through large volumes of recorded conversations and isolate mentions of keywords. Recordings can be indexed and analysts can run queries over the database to find conversations of interest. Some government research programs focused on intelligence applications of speech recognition, e.g. DARPA's EARS's program and [[IARPA]]'s [[Babel program]].

In the early 2000s, speech recognition was still dominated by traditional approaches such as [[hidden Markov model]]s combined with feedforward [[artificial neural networks]].<ref name="bourlard1994">Herve Bourlard and [[Nelson Morgan]], Connectionist Speech Recognition: A Hybrid Approach, The Kluwer International Series in Engineering and Computer Science; v. 247, Boston: Kluwer Academic Publishers, 1994.</ref>
Today, however, many aspects of speech recognition have been taken over by a [[deep learning]] method called [[Long short-term memory]] (LSTM), a [[recurrent neural network]] published by [[Sepp Hochreiter]] & [[Jürgen Schmidhuber]] in 1997.<ref name="lstm">{{Cite journal |last1=Sepp Hochreiter |author-link=Sepp Hochreiter |last2=J. Schmidhuber |author-link2=Jürgen Schmidhuber |year=1997 |title=Long Short-Term Memory |journal=Neural Computation |volume=9 |issue=8 |pages=1735–1780 |doi=10.1162/neco.1997.9.8.1735 |pmid=9377276 |s2cid=1915014}}</ref> LSTM RNNs avoid the [[vanishing gradient problem]] and can learn "Very Deep Learning" tasks<ref name="schmidhuber2015">{{Cite journal |last=Schmidhuber |first=Jürgen |author-link=Jürgen Schmidhuber |year=2015 |title=Deep learning in neural networks: An overview |journal=Neural Networks |volume=61 |pages=85–117 |arxiv=1404.7828 |doi=10.1016/j.neunet.2014.09.003 |pmid=25462637 |s2cid=11715509}}</ref> that require memories of events that happened thousands of discrete time steps ago, which is important for speech.
Around 2007, LSTM trained by Connectionist Temporal Classification (CTC)<ref name="graves2006">Alex Graves, Santiago Fernandez, Faustino Gomez, and [[Jürgen Schmidhuber]] (2006). [https://mediatum.ub.tum.de/doc/1292048/file.pdf Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets] {{Webarchive|url=https://web.archive.org/web/20240909053409/https://mediatum.ub.tum.de/doc/1292048/file.pdf |date=9 September 2024 }}. Proceedings of ICML'06, pp. 369–376.</ref> started to outperform traditional speech recognition in certain applications.<ref name="fernandez2007keyword">Santiago Fernandez, Alex Graves, and Jürgen Schmidhuber (2007). [http://www6.in.tum.de/pub/Main/Publications/Fernandez2007b.pdf An application of recurrent neural networks to discriminative keyword spotting]{{Dead link|date=March 2023 |bot=InternetArchiveBot |fix-attempted=yes }}. Proceedings of ICANN (2), pp. 220–229.</ref> In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which is now available through [[Google Voice]] to all smartphone users.<ref name="sak2015">Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays and Johan Schalkwyk (September 2015): "{{Cite web |title=Google voice search: faster and more accurate |url=http://googleresearch.blogspot.ch/2015/09/google-voice-search-faster-and-more.html |access-date=5 April 2016 |archive-date=9 March 2016 |archive-url=https://web.archive.org/web/20160309191532/http://googleresearch.blogspot.ch/2015/09/google-voice-search-faster-and-more.html |url-status=dead }}."</ref> [[Transformer (machine learning model)|Transformers]], a type of neural network based solely on "attention", have been widely adopted in computer vision<ref>{{Cite arXiv |eprint=2010.11929 |class=cs.CV |first1=Alexey |last1=Dosovitskiy |first2=Lucas |last2=Beyer |title=An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale |date=2021-06-03 |last3=Kolesnikov |first3=Alexander |last4=Weissenborn |first4=Dirk |last5=Zhai |first5=Xiaohua |last6=Unterthiner |first6=Thomas |last7=Dehghani |first7=Mostafa |last8=Minderer |first8=Matthias |last9=Heigold |first9=Georg |last10=Gelly |first10=Sylvain |last11=Uszkoreit |first11=Jakob |last12=Houlsby |first12=Neil}}</ref><ref>{{Cite arXiv |eprint=2103.15808 |class=cs.CV |first1=Haiping |last1=Wu |first2=Bin |last2=Xiao |title=CvT: Introducing Convolutions to Vision Transformers |date=2021-03-29 |last3=Codella |first3=Noel |last4=Liu |first4=Mengchen |last5=Dai |first5=Xiyang |last6=Yuan |first6=Lu |last7=Zhang |first7=Lei}}</ref> and language modeling,<ref>{{Cite journal |last1=Vaswani |first1=Ashish |last2=Shazeer |first2=Noam |last3=Parmar |first3=Niki |last4=Uszkoreit |first4=Jakob |last5=Jones |first5=Llion |last6=Gomez |first6=Aidan N |last7=Kaiser |first7=Łukasz |last8=Polosukhin |first8=Illia |date=2017 |title=Attention is All you Need |url=https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates |volume=30 |access-date=9 September 2024 |archive-date=9 September 2024 |archive-url=https://web.archive.org/web/20240909053411/https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html |url-status=live }}</ref><ref>{{Cite arXiv |eprint=1810.04805 |class=cs.CL |first1=Jacob |last1=Devlin |first2=Ming-Wei |last2=Chang |title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |date=2019-05-24 |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina}}</ref> sparking the interest of adapting such models to new domains, including speech recognition.<ref name=":1">{{Cite arXiv |eprint=2104.01778 |class=cs.SD |first1=Yuan |last1=Gong |first2=Yu-An |last2=Chung |title=AST: Audio Spectrogram Transformer |date=2021-07-08 |last3=Glass |first3=James}}</ref><ref name=":3">{{Cite arXiv |eprint=2203.09581 |class=cs.CV |first1=Nicolae-Catalin |last1=Ristea |first2=Radu Tudor |last2=Ionescu |title=SepTr: Separable Transformer for Audio Spectrogram Processing |date=2022-06-20 |last3=Khan |first3=Fahad Shahbaz}}</ref><ref name=":4">{{Cite arXiv |eprint=2104.00120 |class=eess.AS |first1=Timo |last1=Lohrenz |first2=Zhengyang |last2=Li |title=Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition |date=2021-07-14 |last3=Fingscheidt |first3=Tim}}</ref> Some recent papers reported superior performance levels using transformer models for speech recognition, but these models usually require large scale training datasets to reach high performance levels.

The use of deep feedforward (non-recurrent) networks for [[acoustic model]]ing was introduced during the later part of 2009 by [[Geoffrey Hinton]] and his students at the University of Toronto and by Li Deng<ref>{{Cite web |title=Li Deng |url=https://lidengsite.wordpress.com/ |publisher=Li Deng Site |access-date=9 September 2024 |archive-date=9 September 2024 |archive-url=https://web.archive.org/web/20240909052323/https://lidengsite.wordpress.com/ |url-status=live }}</ref> and colleagues at Microsoft Research, initially in the collaborative work between Microsoft and the University of Toronto which was subsequently expanded to include IBM and Google (hence "The shared views of four research groups" subtitle in their 2012 review paper).<ref name="NIPS2009">NIPS Workshop: Deep Learning for Speech Recognition and Related Applications, Whistler, BC, Canada, Dec. 2009 (Organizers: Li Deng, Geoff Hinton, D. Yu).</ref><ref name=HintonDengYu2012/><ref name="ReferenceICASSP2013" /> A Microsoft research executive called this innovation "the most dramatic change in accuracy since 1979".<ref name="Scientists-see-advances">{{Cite news |last=Markoff |first=John |date=23 November 2012 |title=Scientists See Promise in Deep-Learning Programs |url=https://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html |url-status=live |archive-url=https://web.archive.org/web/20121130080314/http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html |archive-date=30 November 2012 |access-date=20 January 2015 |work=New York Times |df=dmy-all}}</ref> In contrast to the steady incremental improvements of the past few decades, the application of deep learning decreased word error rate by 30%.<ref name="Scientists-see-advances" /> This innovation was quickly adopted across the field. Researchers have begun to use deep learning techniques for language modeling as well.

In the long history of speech recognition, both shallow form and deep form (e.g. recurrent nets) of artificial neural networks had been explored for many years during 1980s, 1990s and a few years into the 2000s.<ref name="Morgan1993">Morgan, Bourlard, Renals, Cohen, Franco (1993) "Hybrid neural network/hidden Markov model systems for continuous speech recognition. ICASSP/IJPRAI"</ref><ref name="Robinson1992">{{Cite book |last=T. Robinson |author-link=Tony Robinson (speech recognition) |title=&#91;Proceedings&#93; ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing |year=1992 |isbn=0-7803-0532-9 |pages=617–620 vol.1 |chapter=A real-time recurrent error propagation network word recognition system |doi=10.1109/ICASSP.1992.225833 |chapter-url=https://www.researchgate.net/publication/3532171 |s2cid=62446313}}</ref><ref name="Waibel1989">[[Alex Waibel|Waibel]], Hanazawa, Hinton, Shikano, Lang. (1989) "[http://www.inf.ufrgs.br/~engel/data/media/file/cmp121/waibel89_TDNN.pdf Phoneme recognition using time-delay neural networks] {{Webarchive|url=https://web.archive.org/web/20210225163001/http://www.inf.ufrgs.br/~engel/data/media/file/cmp121/waibel89_TDNN.pdf |date=25 February 2021 }}. IEEE Transactions on Acoustics, Speech, and Signal Processing."</ref>
But these methods never won over the non-uniform internal-handcrafting [[Mixture model|Gaussian mixture model]]/[[hidden Markov model]] (GMM-HMM) technology based on generative models of speech trained discriminatively.<ref name="Baker2009">{{Cite journal |last1=Baker |first1=J. |last2=Li Deng |last3=Glass |first3=J. |last4=Khudanpur |first4=S. |last5=Chin-Hui Lee |author-link5=Chin-Hui Lee |last6=Morgan |first6=N. |last7=O'Shaughnessy |first7=D. |year=2009 |title=Developments and Directions in Speech Recognition and Understanding, Part 1 |journal=IEEE Signal Processing Magazine |volume=26 |issue=3 |pages=75–80 |bibcode=2009ISPM...26...75B |doi=10.1109/MSP.2009.932166 |s2cid=357467 |hdl-access=free |hdl=1721.1/51891}}</ref> A number of key difficulties had been methodologically analyzed in the 1990s, including gradient diminishing<ref name="hochreiter1991">[[Sepp Hochreiter]] (1991), [http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf Untersuchungen zu dynamischen neuronalen Netzen] {{webarchive|url=https://web.archive.org/web/20150306075401/http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf |date=6 March 2015 }}, Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber.</ref> and weak temporal correlation structure in the neural predictive models.<ref name="Bengio1991">{{Cite thesis |last=Bengio |first=Y. |title=Artificial Neural Networks and their Application to Speech/Sequence Recognition |degree=Ph.D. |publisher=McGill University |url=https://elibrary.ru/item.asp?id=5790854 |year=1991}}</ref><ref name="Deng1994">{{Cite journal |last1=Deng |first1=L. |last2=Hassanein |first2=K. |last3=Elmasry |first3=M. |year=1994 |title=Analysis of the correlation structure for a neural predictive model with application to speech recognition |journal=Neural Networks |volume=7 |issue=2 |pages=331–339 |doi=10.1016/0893-6080(94)90027-2}}</ref> All these difficulties were in addition to the lack of big training data and big computing power in these early days. Most speech recognition researchers who understood such barriers hence subsequently moved away from neural nets to pursue generative modeling approaches until the recent resurgence of deep learning starting around 2009–2010 that had overcome all these difficulties. Hinton et al. and Deng et al. reviewed part of this recent history about how their collaboration with each other and then with colleagues across four groups (University of Toronto, Microsoft, Google, and IBM) ignited a renaissance of applications of deep feedforward neural networks for speech recognition.<ref name="HintonDengYu2012">{{Cite journal |last1=Hinton |first1=Geoffrey |last2=Deng |first2=Li |last3=Yu |first3=Dong |last4=Dahl |first4=George |last5=Mohamed |first5=Abdel-Rahman |last6=Jaitly |first6=Navdeep |last7=Senior |first7=Andrew |last8=Vanhoucke |first8=Vincent |last9=Nguyen |first9=Patrick |last10=Sainath |first10=Tara |author-link10=Tara Sainath |last11=Kingsbury |first11=Brian |year=2012 |title=Deep Neural Networks for Acoustic Modeling in Speech Recognition: The shared views of four research groups |journal=IEEE Signal Processing Magazine |volume=29 |issue=6 |pages=82–97 |bibcode=2012ISPM...29...82H |doi=10.1109/MSP.2012.2205597 |s2cid=206485943}}</ref><ref name="ReferenceICASSP2013">{{Cite book |last1=Deng |first1=L. |title=2013 IEEE International Conference on Acoustics, Speech and Signal Processing: New types of deep neural network learning for speech recognition and related applications: An overview |last2=Hinton |first2=G. |last3=Kingsbury |first3=B. |date=2013 |isbn=978-1-4799-0356-6 |pages=8599 |chapter=New types of deep neural network learning for speech recognition and related applications: An overview |doi=10.1109/ICASSP.2013.6639344 |s2cid=13953660}}</ref><ref name="HintonKeynoteICASSP2013">Keynote talk: Recent Developments in Deep Neural Networks. ICASSP, 2013 (by Geoff Hinton).</ref><ref name="interspeech2014Keynote">Keynote talk: "[https://www.isca-speech.org/archive/interspeech_2014/i14_3505.html Achievements and Challenges of Deep Learning: From Speech Analysis and Recognition To Language and Multimodal Processing] {{Webarchive|url=https://web.archive.org/web/20210305043518/https://www.isca-speech.org/archive/interspeech_2014/i14_3505.html|date=5 March 2021}}," Interspeech, September 2014 (by [[Li Deng]]).</ref>