Editing Speech recognition (section)

{{short description|Automatic conversion of spoken language into text}}
{{for|the human linguistic concept|Speech perception}}
{{Use dmy dates|date=February 2017}}
'''Speech recognition''' is an [[interdisciplinary]] subfield of [[computer science]] and [[computational linguistics]] that develops [[Methodology|methodologies]] and technologies that enable the recognition and [[translation]] of spoken language into text by computers. It is also known as '''automatic speech recognition''' ('''ASR'''), '''computer speech recognition''' or '''speech-to-text''' ('''STT'''). It incorporates knowledge and research in the [[computer science]], [[linguistics]] and [[computer engineering]] fields. The reverse process is [[speech synthesis]].

Some speech recognition systems require "training" (also called "enrollment") where an individual speaker reads text or isolated [[vocabulary]] into the system. The system analyzes the person's specific voice and uses it to fine-tune the recognition of that person's speech, resulting in increased accuracy. Systems that do not use training are called "speaker-independent"<ref>{{Cite web |title=Speaker Independent Connected Speech Recognition- Fifth Generation Computer Corporation |url=http://www.fifthgen.com/speaker-independent-connected-s-r.htm |url-status=live |archive-url=https://web.archive.org/web/20131111101228/http://www.fifthgen.com/speaker-independent-connected-s-r.htm |archive-date=11 November 2013 |access-date=15 June 2013 |publisher=Fifthgen.com |df=dmy-all}}</ref> systems. Systems that use training are called "speaker dependent".

Speech recognition applications include [[voice user interface]]s such as voice dialing (e.g. "call home"), call routing (e.g. "I would like to make a collect call"), [[domotic]] appliance control, search key words (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), determining speaker characteristics,<ref>{{Cite book |last=P. Nguyen |title=International Conference on Communications and Electronics 2010 |date=2010 |isbn=978-1-4244-7055-6 |pages=147–152 |chapter=Automatic classification of speaker characteristics |doi=10.1109/ICCE.2010.5670700 |s2cid=13482115}}</ref> speech-to-text processing (e.g., [[word processor]]s or [[email]]s), and [[aircraft]] (usually termed [[direct voice input]]). Automatic [[pronunciation assessment]] is used in education such as for spoken language learning.

{{anchor|vs_voice_rec}}The term ''voice recognition''<ref name="Macmillan Brit. def of voice recognition">{{Cite web |title=British English definition of voice recognition |url=http://www.macmillandictionary.com/dictionary/british/voice-recognition |url-status=live |archive-url=https://web.archive.org/web/20110916050430/http://www.macmillandictionary.com/dictionary/british/voice-recognition |archive-date=16 September 2011 |access-date=21 February 2012 |publisher=Macmillan Publishers Limited. |df=dmy-all}}</ref><ref name="Voice rec, definition">{{Cite web |title=voice recognition, definition of |url=http://www.businessdictionary.com/definition/voice-recognition.html |url-status=live |archive-url=https://web.archive.org/web/20111203144647/http://www.businessdictionary.com/definition/voice-recognition.html |archive-date=3 December 2011 |access-date=21 February 2012 |publisher=WebFinance, Inc |df=dmy-all}}</ref><ref name="mail bag, gazette">{{Cite web |title=The Mailbag LG #114 |url=http://linuxgazette.net/114/lg_mail.html#mailbag.3 |url-status=live |archive-url=https://web.archive.org/web/20130219032501/http://linuxgazette.net/114/lg_mail.html#mailbag.3 |archive-date=19 February 2013 |access-date=15 June 2013 |publisher=Linuxgazette.net |df=dmy-all}}</ref> or ''[[Speaker recognition|speaker identification]]''<ref>{{Cite journal |last1=Sarangi |first1=Susanta |last2=Sahidullah, Md |last3=Saha, Goutam |date=September 2020 |title=Optimization of data-driven filterbank for automatic speaker verification |journal=Digital Signal Processing |volume=104 |page=102795 |arxiv=2007.10729 |bibcode=2020DSP...10402795S |doi=10.1016/j.dsp.2020.102795 |s2cid=220665533}}</ref><ref>{{Cite journal |last1=Reynolds |first1=Douglas |last2=Rose |first2=Richard |date=January 1995 |title=Robust text-independent speaker identification using Gaussian mixture speaker models |url=http://www.cs.toronto.edu/~frank/csc401/readings/ReynoldsRose.pdf |url-status=live |journal=IEEE Transactions on Speech and Audio Processing |volume=3 |issue=1 |pages=72–83 |doi=10.1109/89.365379 |issn=1063-6676 |oclc=26108901 |s2cid=7319345 |archive-url=https://web.archive.org/web/20140308001101/http://www.cs.toronto.edu/~frank/csc401/readings/ReynoldsRose.pdf |archive-date=8 March 2014 |access-date=21 February 2014 |df=dmy-all}}</ref><ref>{{Cite web |title=Speaker Identification (WhisperID) |url=http://research.microsoft.com/en-us/projects/whisperid/ |url-status=live |archive-url=https://web.archive.org/web/20140225190956/http://research.microsoft.com/en-us/projects/whisperid/ |archive-date=25 February 2014 |access-date=21 February 2014 |website=Microsoft Research |publisher=Microsoft |quote=When you speak to someone, they don't just recognize what you say: they recognize who you are. WhisperID will let computers do that, too, figuring out who you are by the way you sound. |df=dmy-all}}</ref> refers to identifying the speaker, rather than what they are saying. [[Speaker recognition|Recognizing the speaker]] can simplify the task of [[speech translation|translating speech]] in systems that have been trained on a specific person's voice or it can be used to [[Authentication|authenticate]] or verify the identity of a speaker as part of a security process.

From the technology perspective, speech recognition has a long history with several waves of major innovations. Most recently, the field has benefited from advances in [[deep learning]] and [[big data]]. The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the worldwide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems.