Editing Speech processing

{{Short description|Study of speech signals and the processing methods of these signals}}
{{about|electronic speech processing|speech processing in the human brain|Language processing in the brain}}
'''Speech processing''' is the study of [[speech communication|speech]] [[signal (information theory)|signals]] and the processing methods of  signals. The signals are usually processed in a [[digital data|digital]] representation, so speech processing can be regarded as a special case of [[digital signal processing]], applied to [[audio signal|speech signals]]. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include [[speech recognition]], [[speech synthesis]], [[speaker diarization]], [[speech enhancement]], [[speaker recognition]], etc.<ref>{{cite arXiv |last1=Sahidullah |first1=Md |last2=Patino |first2=Jose |last3=Cornell |first3=Samuele |last4=Yin |first4=Ruiking |last5=Sivasankaran |first5=Sunit |last6=Bredin |first6=Herve |last7=Korshunov |first7=Pavel |last8=Brutti |first8=Alessio |last9=Serizel |first9=Romain |last10=Vincent |first10=Emmanuel |last11=Evans |first11=Nicholas |last12=Marcel |first12=Sebastien |last13=Squartini |first13=Stefano |last14=Barras |first14=Claude |date=2019-11-06 |title=The Speed Submission to DIHARD II: Contributions & Lessons Learned |class=eess.AS |eprint=1911.02388 }}</ref>

== History ==
Early attempts at speech processing and recognition were primarily focused on understanding a handful of simple [[Phonetics|phonetic]] elements such as vowels. In 1952, three researchers at Bell Labs, Stephen. Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker.<ref>{{Citation|last1=Juang|first1=B.-H.|title=Speech Recognition, Automatic: History|date=2006|encyclopedia=Encyclopedia of Language & Linguistics|pages=806–819|publisher=Elsevier|isbn=9780080448541|last2=Rabiner|first2=L.R.|doi=10.1016/b0-08-044854-2/00906-8}}</ref> Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s.<ref>{{Cite book| publisher = Energiya| last1 = Myasnikov| first1 = L. L.| last2 = Myasnikova| first2 = Ye. N.| title = Automatic recognition of sound pattern| location = Leningrad| date = 1970|language=ru}}</ref>

[[Linear predictive coding]] (LPC), a speech processing algorithm, was first proposed by [[Fumitada Itakura]] of [[Nagoya University]] and Shuzo Saito of [[Nippon Telegraph and Telephone]] (NTT) in 1966.<ref name="Gray">{{cite journal |last1=Gray |first1=Robert M. |title=A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol |journal=Found. Trends Signal Process. |date=2010 |volume=3 |issue=4 |pages=203–303 |doi=10.1561/2000000036 |url=https://ee.stanford.edu/~gray/lpcip.pdf |issn=1932-8346|doi-access=free }}</ref> Further developments in LPC technology were made by [[Bishnu S. Atal]] and [[Manfred R. Schroeder]] at [[Bell Labs]] during the 1970s.<ref name="Gray">{{cite journal |last1=Gray |first1=Robert M. |title=A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol |journal=Found. Trends Signal Process. |date=2010 |volume=3 |issue=4 |pages=203–303 |doi=10.1561/2000000036 |url=https://ee.stanford.edu/~gray/lpcip.pdf |issn=1932-8346|doi-access=free }}</ref> LPC was the basis for [[voice-over-IP]] (VoIP) technology,<ref name="Gray"/> as well as [[speech synthesizer]] chips, such as the [[Texas Instruments LPC Speech Chips]] used in the [[Speak & Spell (toy)|Speak & Spell]] toys from 1978.<ref name="vintagecomputing_article">{{cite web|url=http://www.vintagecomputing.com/index.php/archives/528|title=VC&G - VC&G Interview: 30 Years Later, Richard Wiggins Talks Speak & Spell Development}}</ref>

One of the first commercially available speech recognition products was Dragon Dictate, released in 1990. In 1992, technology developed by [[Lawrence Rabiner]] and others at Bell Labs was used by [[AT&T]] in their Voice Recognition Call Processing service to route calls without a human operator. By this point, the vocabulary of these systems was larger than the average human vocabulary.<ref>{{Cite journal|last1=Huang|first1=Xuedong|last2=Baker|first2=James|last3=Reddy|first3=Raj|date=2014-01-01|title=A historical perspective of speech recognition|journal=Communications of the ACM|volume=57|issue=1|pages=94–103|doi=10.1145/2500887|s2cid=6175701 |issn=0001-0782}}</ref>

By the early 2000s, the dominant speech processing strategy started to shift away from [[Hidden Markov model|Hidden Markov Models]] towards more modern [[Artificial neural network|neural networks]] and [[deep learning]].<ref>{{Cite journal |last=Furui |first=Sadaoki |date=2005 |title=50 Years of Progress in Speech and Speaker Recognition Research |journal=ECTI Transactions on Computer and Information Technology |language=en |volume=1 |issue=2 |pages=64–74 |doi=10.37936/ecti-cit.200512.51834 |issn=2286-9131|doi-access=free }}</ref>

In 2012, [[Geoffrey Hinton]] and his team at the [[University of Toronto]] demonstrated that deep neural networks could significantly outperform traditional HMM-based systems on large vocabulary continuous speech recognition tasks. This breakthrough led to widespread adoption of deep learning techniques in the industry.<ref name=":0">{{Cite news |date=2019-07-23 |title=Deep Neural Networks for Acoustic Modeling in Speech Recognition |url=https://www.cs.toronto.edu/~hinton/absps/DNN-2012-proof.pdf?form=MG0AV3 |access-date=2024-11-05 |work= |language=en-GB}}</ref><ref>{{Cite news |date=2019-07-23 |title=SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS |url=https://www.cs.toronto.edu/~hinton/absps/DRNN_speech.pdf?form=MG0AV3 |access-date=2024-11-05 |work= |language=en-GB}}</ref>

By the mid-2010s, companies like [[Google]], [[Microsoft]], [[Amazon (company)|Amazon]], and [[Apple Inc.|Apple]] had integrated advanced speech recognition systems into their virtual assistants such as [[Google Assistant]], [[Cortana (virtual assistant)|Cortana]], [[Amazon Alexa|Alexa]], and [[Siri]].<ref>{{Cite journal |last=Hoy |first=Matthew B. |date=2018 |title=Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants |url=https://pubmed.ncbi.nlm.nih.gov/29327988/ |journal=Medical Reference Services Quarterly |volume=37 |issue=1 |pages=81–88 |doi=10.1080/02763869.2018.1404391 |issn=1540-9597 |pmid=29327988}}</ref> These systems utilized deep learning models to provide more natural and accurate voice interactions.

The development of Transformer-based models, like Google's BERT (Bidirectional Encoder Representations from Transformers) and OpenAI's GPT (Generative Pre-trained Transformer), further pushed the boundaries of natural language processing and speech recognition. These models enabled more context-aware and semantically rich understanding of speech.<ref>{{Cite web |title=Vbee |url=https://vbee.vn |access-date=2024-11-05 |website=vbee.vn |language=vi}}</ref><ref name=":0" /> In recent years, end-to-end speech recognition models have gained popularity. These models simplify the speech recognition pipeline by directly converting audio input into text output, bypassing intermediate steps like feature extraction and acoustic modeling. This approach has streamlined the development process and improved performance.<ref>{{Cite book |last=Hagiwara |first=Masato |url=https://books.google.com/books?id=Ye9MEAAAQBAJ |title=Real-World Natural Language Processing: Practical applications with deep learning |date=2021-12-21 |publisher=Simon and Schuster |isbn=978-1-63835-039-2 |language=en}}</ref>

== Techniques ==

=== Dynamic time warping ===
{{Main|Dynamic time warping}}Dynamic time warping (DTW) is an [[algorithm]] for measuring similarity between two [[Time series|temporal sequences]], which may vary in speed. In general, DTW is a method that calculates an [[Optimal matching|optimal match]] between two given sequences (e.g. time series) with certain restriction and rules. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values.{{citation needed|date=December 2018}}

=== Hidden Markov models ===
{{Main|Hidden Markov model}}A hidden Markov model can be represented as the simplest [[dynamic Bayesian network]]. The goal of the algorithm is to estimate a hidden variable x(t) given a list of observations y(t). By applying the [[Markov property]], the [[conditional probability distribution]] of the hidden variable ''x''(''t'') at time ''t'', given the values of the hidden variable ''x'' at all times, depends ''only'' on the value of the hidden variable ''x''(''t'' − 1). Similarly, the value of the observed variable ''y''(''t'') only depends on the value of the hidden variable ''x''(''t'') (both at time ''t'').{{citation needed|date=December 2018}}

=== Artificial neural networks ===
{{Main|Artificial neural network}}An artificial neural network (ANN) is based on a collection of connected units or nodes called [[artificial neuron]]s, which loosely model the [[neuron]]s in a biological [[brain]]. Each connection, like the [[synapse]]s in a biological [[brain]], can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a [[real number]], and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs.{{citation needed|date=December 2018}}

===Phase-aware processing===
Phase is usually supposed to be random uniform variable and thus useless. This is due wrapping of phase:<ref name="limits">{{Cite journal| doi = 10.1109/TASLP.2015.2430820| issn = 2329-9290| volume = 23| issue = 8| pages = 1283–1294| last1 = Mowlaee| first1 = Pejman| last2 = Kulmer| first2 = Josef| title = Phase Estimation in Single-Channel Speech Enhancement: Limits-Potential| journal = IEEE/ACM Transactions on Audio, Speech, and Language Processing|access-date= 2017-12-03| date = August 2015| s2cid = 13058142| url = https://ieeexplore.ieee.org/document/7103305| url-access = subscription}}</ref> result of [[arctangent]] function is not continuous due to periodical jumps on <math>2 \pi</math>. After phase unwrapping (see,<ref>{{Cite book| publisher = Wiley| isbn = 978-1-119-23882-9| last1 = Mowlaee| first1 = Pejman| last2 = Kulmer| first2 = Josef| last3 = Stahl| first3 = Johannes| last4 = Mayer| first4 = Florian| title = Single channel phase-aware signal processing in speech communication: theory and practice| location = Chichester| date = 2017}}</ref> Chapter 2.3; [[Instantaneous phase and frequency]]), it can be expressed as:<ref name="limits" /><ref name="vonMises">{{Cite conference| publisher = IEEE| pages = 5063–5067| last1 = Kulmer| first1 = Josef| last2 = Mowlaee| first2 = Pejman| title = Harmonic phase estimation in single-channel speech enhancement using von Mises distribution and prior SNR|book-title= Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on| date = April 2015}}</ref>
<math>\phi(h,l) = \phi_{lin}(h,l) + \Psi(h,l)</math>, where <math>\phi_{lin}(h,l) = \omega_0(l') {}_\Delta t</math> is linear phase (<math>{}_\Delta t</math> is temporal shift at each frame of analysis), <math>\Psi(h,l)</math> is phase contribution of the vocal tract and phase source.<ref name="vonMises" />
Obtained phase estimations can be used for noise reduction: temporal smoothing of instantaneous phase <ref>{{Cite journal| doi = 10.1109/LSP.2014.2365040| issn = 1070-9908| volume = 22| issue = 5| pages = 598–602| last1 = Kulmer| first1 = Josef| last2 = Mowlaee| first2 = Pejman| title = Phase Estimation in Single Channel Speech Enhancement Using Phase Decomposition| journal = IEEE Signal Processing Letters|access-date= 2017-12-03| date = May 2015| bibcode = 2015ISPL...22..598K| s2cid = 15503015| url = https://ieeexplore.ieee.org/document/6936313| url-access = subscription}}</ref> and its derivatives by time ([[Instantaneous phase and frequency|instantaneous frequency]]) and frequency ([[Group delay and phase delay|group delay]]),<ref name="Advances">{{Cite journal| doi = 10.1016/j.specom.2016.04.002| issn = 0167-6393| volume = 81| pages = 1–29| last1 = Mowlaee| first1 = Pejman| last2 = Saeidi| first2 = Rahim| last3 = Stylianou| first3 = Yannis| title = Advances in phase-aware signal processing in speech communication| journal = Speech Communication|access-date= 2017-12-03| date = July 2016| s2cid = 17409161| url = http://linkinghub.elsevier.com/retrieve/pii/S0167639316300784| url-access = subscription}}</ref> smoothing of phase across frequency.<ref name="Advances" /> Joined amplitude and phase estimators can recover speech more accurately basing on assumption of von Mises distribution of phase.<ref name="vonMises" />

== Applications ==

* [[Interactive voice response]]
* [[Virtual assistant|Virtual Assistants]]
* [[Speaker recognition|Voice Identification]]
* [[Emotion recognition|Emotion Recognition]]
* Call Center Automation
* [[Robotics]]

==See also==
* [[Computational audiology]]
* [[Neurocomputational speech processing]]
* [[Speech coding]]
* [[Speech technology]]
*[[Natural language processing|Natural Language Processing]]

==References==
{{reflist}}

{{Speech processing}}
{{Authority control}}

[[Category:Speech processing| ]]
[[Category:Speech]]
[[Category:Signal processing]]