Editing Speech recognition (section)

=== End-to-end automatic speech recognition ===
Since 2014, there has been much research interest in "end-to-end" ASR. Traditional phonetic-based (i.e., all [[Hidden Markov model|HMM]]-based model) approaches required separate components and training for the pronunciation, acoustic, and [[language model]]. End-to-end models jointly learn all the components of the speech recognizer. This is valuable since it simplifies the training process and deployment process. For example, a [[N-gram|n-gram language model]] is required for all HMM-based systems, and a typical n-gram language model often takes several gigabytes in memory making them impractical to deploy on mobile devices.<ref>{{Cite book |last=Jurafsky |first=Daniel |title=Speech and Language Processing |year=2016}}</ref> Consequently, modern commercial ASR systems from [[Google]] and [[Apple Inc.|Apple]] ({{as of|2017|lc=y}}) are deployed on the cloud and require a network connection as opposed to the device locally.

The first attempt at end-to-end ASR was with [[Connectionist temporal classification|Connectionist Temporal Classification]] (CTC)-based systems introduced by [[Alex Graves (computer scientist)|Alex Graves]] of [[DeepMind|Google DeepMind]] and Navdeep Jaitly of the [[University of Toronto]] in 2014.<ref>{{Cite journal |last=Graves |first=Alex |year=2014 |title=Towards End-to-End Speech Recognition with Recurrent Neural Networks |url=http://www.jmlr.org/proceedings/papers/v32/graves14.pdf |url-status=dead |journal=ICML |archive-url=https://web.archive.org/web/20170110184531/http://jmlr.org/proceedings/papers/v32/graves14.pdf |archive-date=10 January 2017 |access-date=22 July 2019}}</ref> The model consisted of [[recurrent neural network]]s and a CTC layer. Jointly, the RNN-CTC model learns the pronunciation and acoustic model together, however it is incapable of learning the language due to [[conditional independence]] assumptions similar to a HMM. Consequently, CTC models can directly learn to map speech acoustics to English characters, but the models make many common spelling mistakes and must rely on a separate language model to clean up the transcripts. Later, [[Baidu]] expanded on the work with extremely large datasets and demonstrated some commercial success in Chinese Mandarin and English.<ref>{{Cite arXiv |eprint=1512.02595 |class=cs.CL |first=Dario |last=Amodei |title=Deep Speech 2: End-to-End Speech Recognition in English and Mandarin |year=2016}}</ref> In 2016, [[University of Oxford]] presented [[LipNet]],<ref>{{Cite web |date=4 November 2016 |title=LipNet: How easy do you think lipreading is? |url=https://www.youtube.com/watch?v=fa5QGremQf8 |url-status=live |archive-url=https://web.archive.org/web/20170427104009/https://www.youtube.com/watch?v=fa5QGremQf8 |archive-date=27 April 2017 |access-date=5 May 2017 |website=YouTube |df=dmy-all}}</ref> the first end-to-end sentence-level lipreading model, using spatiotemporal convolutions coupled with an RNN-CTC architecture, surpassing human-level performance in a restricted grammar dataset.<ref>{{Cite arXiv |eprint=1611.01599 |class=cs.CV |first1=Yannis |last1=Assael |first2=Brendan |last2=Shillingford |title=LipNet: End-to-End Sentence-level Lipreading |date=5 November 2016 |last3=Whiteson |first3=Shimon |last4=de Freitas |first4=Nando}}</ref> A large-scale CNN-RNN-CTC architecture was presented in 2018 by [[DeepMind|Google DeepMind]] achieving 6 times better performance than human experts.<ref name=":0">{{Cite arXiv |eprint=1807.05162 |class=cs.CV |first1=Brendan |last1=Shillingford |first2=Yannis |last2=Assael |title=Large-Scale Visual Speech Recognition |date=2018-07-13 |last3=Hoffman |first3=Matthew W. |last4=Paine |first4=Thomas |last5=Hughes |first5=Cían |last6=Prabhu |first6=Utsav |last7=Liao |first7=Hank |last8=Sak |first8=Hasim |last9=Rao |first9=Kanishka}}</ref> In 2019, [[Nvidia]] launched two CNN-CTC ASR models, Jasper and QuarzNet, with an overall performance WER of 3%.<ref>{{Cite book |last1=Li |first1=Jason |last2=Lavrukhin |first2=Vitaly |last3=Ginsburg |first3=Boris |last4=Leary |first4=Ryan |last5=Kuchaiev |first5=Oleksii |last6=Cohen |first6=Jonathan M. |last7=Nguyen |first7=Huyen |last8=Gadde |first8=Ravi Teja |title=Interspeech 2019 |date=2019 |chapter=Jasper: An End-to-End Convolutional Neural Acoustic Model |chapter-url=https://www.isca-archive.org/interspeech_2019/li19_interspeech.html |pages=71–75 |doi=10.21437/Interspeech.2019-1819|arxiv=1904.03288 }}</ref><ref>{{Citation |last1=Kriman |first1=Samuel |title=QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions |date=2019-10-22 |arxiv=1910.10261 |last2=Beliaev |first2=Stanislav |last3=Ginsburg |first3=Boris |last4=Huang |first4=Jocelyn |last5=Kuchaiev |first5=Oleksii |last6=Lavrukhin |first6=Vitaly |last7=Leary |first7=Ryan |last8=Li |first8=Jason |last9=Zhang |first9=Yang}}</ref> Similar to other deep learning applications, [[transfer learning]] and [[domain adaptation]] are important strategies for reusing and extending the capabilities of deep learning models, particularly due to the high costs of training models from scratch, and the small size of available corpus in many languages and/or specific domains.<ref>{{Cite journal |last1=Medeiros |first1=Eduardo |last2=Corado |first2=Leonel |last3=Rato |first3=Luís |last4=Quaresma |first4=Paulo |last5=Salgueiro |first5=Pedro |date=May 2023 |title=Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning |journal=Future Internet |language=en |volume=15 |issue=5 |pages=159 |doi=10.3390/fi15050159 |doi-access=free |issn=1999-5903}}</ref><ref>{{Cite journal |last1=Joshi |first1=Raviraj |last2=Singh |first2=Anupam |date=May 2022 |editor-last=Malmasi |editor-first=Shervin |editor2-last=Rokhlenko |editor2-first=Oleg |editor3-last=Ueffing |editor3-first=Nicola |editor4-last=Guy |editor4-first=Ido |editor5-last=Agichtein |editor5-first=Eugene |editor6-last=Kallumadi |editor6-first=Surya |title=A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data |url=https://aclanthology.org/2022.ecnlp-1.28/ |journal=Proceedings of the Fifth Workshop on E-Commerce and NLP (ECNLP 5) |location=Dublin, Ireland |publisher=Association for Computational Linguistics |pages=244–249 |doi=10.18653/v1/2022.ecnlp-1.28|arxiv=2206.13240 }}</ref><ref>{{Cite book |last1=Sukhadia |first1=Vrunda N. |last2=Umesh |first2=S. |chapter=Domain Adaptation of Low-Resource Target-Domain Models Using Well-Trained ASR Conformer Models |date=2023-01-09 |title=2022 IEEE Spoken Language Technology Workshop (SLT) |chapter-url=https://ieeexplore.ieee.org/document/10023233 |publisher=IEEE |pages=295–301 |doi=10.1109/SLT54892.2023.10023233 |arxiv=2202.09167 |isbn=979-8-3503-9690-4}}</ref>

An alternative approach to CTC-based models are attention-based models. Attention-based ASR models were introduced simultaneously by Chan et al. of [[Carnegie Mellon University]] and [[Google Brain]] and Bahdanau et al. of the [[Université de Montréal|University of Montreal]] in 2016.<ref>{{Cite journal |last1=Chan |first1=William |last2=Jaitly |first2=Navdeep |last3=Le |first3=Quoc |last4=Vinyals |first4=Oriol |year=2016 |title=Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition |url=https://storage.googleapis.com/pub-tools-public-publication-data/pdf/44926.pdf |journal=ICASSP |access-date=9 September 2024 |archive-date=9 September 2024 |archive-url=https://web.archive.org/web/20240909053931/https://storage.googleapis.com/pub-tools-public-publication-data/pdf/44926.pdf |url-status=live }}</ref><ref>{{Cite arXiv |eprint=1508.04395 |class=cs.CL |first=Dzmitry |last=Bahdanau |title=End-to-End Attention-based Large Vocabulary Speech Recognition |year=2016}}</ref> The model named "Listen, Attend and Spell" (LAS), literally "listens" to the acoustic signal, pays "attention" to different parts of the signal and "spells" out the transcript one character at a time. Unlike CTC-based models, attention-based models do not have conditional-independence assumptions and can learn all the components of a speech recognizer including the pronunciation, acoustic and language model directly. This means, during deployment, there is no need to carry around a language model making it very practical for applications with limited memory. By the end of 2016, the attention-based models have seen considerable success including outperforming the CTC models (with or without an external language model).<ref>{{Cite arXiv |eprint=1612.02695 |class=cs.NE |first1=Jan |last1=Chorowski |first2=Navdeep |last2=Jaitly |title=Towards better decoding and language model integration in sequence to sequence models |date=8 December 2016}}</ref> Various extensions have been proposed since the original LAS model. Latent Sequence Decompositions (LSD) was proposed by [[Carnegie Mellon University]], [[Massachusetts Institute of Technology|MIT]] and [[Google Brain]] to directly emit sub-word units which are more natural than English characters;<ref>{{Cite arXiv |eprint=1610.03035 |class=stat.ML |first1=William |last1=Chan |first2=Yu |last2=Zhang |title=Latent Sequence Decompositions |date=10 October 2016 |last3=Le |first3=Quoc |last4=Jaitly |first4=Navdeep}}</ref> [[University of Oxford]] and [[DeepMind|Google DeepMind]] extended LAS to "Watch, Listen, Attend and Spell" (WLAS) to handle lip reading surpassing human-level performance.<ref>{{Cite book |last1=Chung |first1=Joon Son |title=2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |last2=Senior |first2=Andrew |last3=Vinyals |first3=Oriol |last4=Zisserman |first4=Andrew |date=16 November 2016 |isbn=978-1-5386-0457-1 |pages=3444–3453 |chapter=Lip Reading Sentences in the Wild |doi=10.1109/CVPR.2017.367 |arxiv=1611.05358 |s2cid=1662180}}</ref>