Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Speech processing
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== History == Early attempts at speech processing and recognition were primarily focused on understanding a handful of simple [[Phonetics|phonetic]] elements such as vowels. In 1952, three researchers at Bell Labs, Stephen. Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker.<ref>{{Citation|last1=Juang|first1=B.-H.|title=Speech Recognition, Automatic: History|date=2006|encyclopedia=Encyclopedia of Language & Linguistics|pages=806β819|publisher=Elsevier|isbn=9780080448541|last2=Rabiner|first2=L.R.|doi=10.1016/b0-08-044854-2/00906-8}}</ref> Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s.<ref>{{Cite book| publisher = Energiya| last1 = Myasnikov| first1 = L. L.| last2 = Myasnikova| first2 = Ye. N.| title = Automatic recognition of sound pattern| location = Leningrad| date = 1970|language=ru}}</ref> [[Linear predictive coding]] (LPC), a speech processing algorithm, was first proposed by [[Fumitada Itakura]] of [[Nagoya University]] and Shuzo Saito of [[Nippon Telegraph and Telephone]] (NTT) in 1966.<ref name="Gray">{{cite journal |last1=Gray |first1=Robert M. |title=A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol |journal=Found. Trends Signal Process. |date=2010 |volume=3 |issue=4 |pages=203β303 |doi=10.1561/2000000036 |url=https://ee.stanford.edu/~gray/lpcip.pdf |issn=1932-8346|doi-access=free }}</ref> Further developments in LPC technology were made by [[Bishnu S. Atal]] and [[Manfred R. Schroeder]] at [[Bell Labs]] during the 1970s.<ref name="Gray">{{cite journal |last1=Gray |first1=Robert M. |title=A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol |journal=Found. Trends Signal Process. |date=2010 |volume=3 |issue=4 |pages=203β303 |doi=10.1561/2000000036 |url=https://ee.stanford.edu/~gray/lpcip.pdf |issn=1932-8346|doi-access=free }}</ref> LPC was the basis for [[voice-over-IP]] (VoIP) technology,<ref name="Gray"/> as well as [[speech synthesizer]] chips, such as the [[Texas Instruments LPC Speech Chips]] used in the [[Speak & Spell (toy)|Speak & Spell]] toys from 1978.<ref name="vintagecomputing_article">{{cite web|url=http://www.vintagecomputing.com/index.php/archives/528|title=VC&G - VC&G Interview: 30 Years Later, Richard Wiggins Talks Speak & Spell Development}}</ref> One of the first commercially available speech recognition products was Dragon Dictate, released in 1990. In 1992, technology developed by [[Lawrence Rabiner]] and others at Bell Labs was used by [[AT&T]] in their Voice Recognition Call Processing service to route calls without a human operator. By this point, the vocabulary of these systems was larger than the average human vocabulary.<ref>{{Cite journal|last1=Huang|first1=Xuedong|last2=Baker|first2=James|last3=Reddy|first3=Raj|date=2014-01-01|title=A historical perspective of speech recognition|journal=Communications of the ACM|volume=57|issue=1|pages=94β103|doi=10.1145/2500887|s2cid=6175701 |issn=0001-0782}}</ref> By the early 2000s, the dominant speech processing strategy started to shift away from [[Hidden Markov model|Hidden Markov Models]] towards more modern [[Artificial neural network|neural networks]] and [[deep learning]].<ref>{{Cite journal |last=Furui |first=Sadaoki |date=2005 |title=50 Years of Progress in Speech and Speaker Recognition Research |journal=ECTI Transactions on Computer and Information Technology |language=en |volume=1 |issue=2 |pages=64β74 |doi=10.37936/ecti-cit.200512.51834 |issn=2286-9131|doi-access=free }}</ref> In 2012, [[Geoffrey Hinton]] and his team at the [[University of Toronto]] demonstrated that deep neural networks could significantly outperform traditional HMM-based systems on large vocabulary continuous speech recognition tasks. This breakthrough led to widespread adoption of deep learning techniques in the industry.<ref name=":0">{{Cite news |date=2019-07-23 |title=Deep Neural Networks for Acoustic Modeling in Speech Recognition |url=https://www.cs.toronto.edu/~hinton/absps/DNN-2012-proof.pdf?form=MG0AV3 |access-date=2024-11-05 |work= |language=en-GB}}</ref><ref>{{Cite news |date=2019-07-23 |title=SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS |url=https://www.cs.toronto.edu/~hinton/absps/DRNN_speech.pdf?form=MG0AV3 |access-date=2024-11-05 |work= |language=en-GB}}</ref> By the mid-2010s, companies like [[Google]], [[Microsoft]], [[Amazon (company)|Amazon]], and [[Apple Inc.|Apple]] had integrated advanced speech recognition systems into their virtual assistants such as [[Google Assistant]], [[Cortana (virtual assistant)|Cortana]], [[Amazon Alexa|Alexa]], and [[Siri]].<ref>{{Cite journal |last=Hoy |first=Matthew B. |date=2018 |title=Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants |url=https://pubmed.ncbi.nlm.nih.gov/29327988/ |journal=Medical Reference Services Quarterly |volume=37 |issue=1 |pages=81β88 |doi=10.1080/02763869.2018.1404391 |issn=1540-9597 |pmid=29327988}}</ref> These systems utilized deep learning models to provide more natural and accurate voice interactions. The development of Transformer-based models, like Google's BERT (Bidirectional Encoder Representations from Transformers) and OpenAI's GPT (Generative Pre-trained Transformer), further pushed the boundaries of natural language processing and speech recognition. These models enabled more context-aware and semantically rich understanding of speech.<ref>{{Cite web |title=Vbee |url=https://vbee.vn |access-date=2024-11-05 |website=vbee.vn |language=vi}}</ref><ref name=":0" /> In recent years, end-to-end speech recognition models have gained popularity. These models simplify the speech recognition pipeline by directly converting audio input into text output, bypassing intermediate steps like feature extraction and acoustic modeling. This approach has streamlined the development process and improved performance.<ref>{{Cite book |last=Hagiwara |first=Masato |url=https://books.google.com/books?id=Ye9MEAAAQBAJ |title=Real-World Natural Language Processing: Practical applications with deep learning |date=2021-12-21 |publisher=Simon and Schuster |isbn=978-1-63835-039-2 |language=en}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)