Editing Speech synthesis (section)

== Synthesizer technologies ==

The most important qualities of a speech synthesis system are ''naturalness'' and ''[[Intelligibility (communication)|intelligibility]]''.'''<ref>{{cite book|last1=Taylor|first1=Paul|title=Text-to-speech synthesis|url=https://archive.org/details/texttospeechsynt00tayl_030|url-access=limited|date=2009|publisher=Cambridge University Press|location=Cambridge, UK|isbn=9780521899277|page=[https://archive.org/details/texttospeechsynt00tayl_030/page/n26 3]}}</ref>''' Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood. The ideal speech synthesizer is both natural and intelligible. Speech synthesis systems usually try to maximize both characteristics.

The two primary technologies generating synthetic speech waveforms are ''concatenative synthesis'' and ''[[formant]] synthesis''. Each technology has strengths and weaknesses, and the intended uses of a synthesis system will typically determine which approach is used.

=== Concatenation synthesis ===
{{main|Concatenative synthesis}}
Concatenative synthesis is based on the concatenation (stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis.

==== Unit selection synthesis ====

Unit selection synthesis uses large databases of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual [[phone (phonetics)|phones]], [[diphone]]s, half-phones, [[syllable]]s, [[morpheme]]s, [[word]]s, [[phrase]]s, and [[sentence (linguistics)|sentence]]s. Typically, the division into segments is done using a specially modified [[speech recognition|speech recognizer]] set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the [[waveform]] and [[spectrogram]].<ref>[[Alan W. Black]], [https://www.cs.cmu.edu/~awb/papers/IEEE2002/allthetime/allthetime.html Perfect synthesis for all of the people all of the time.] IEEE TTS Workshop 2002.</ref> An [[index (database)|index]] of the units in the speech database is then created based on the segmentation and acoustic parameters like the [[fundamental frequency]] ([[pitch (music)|pitch]]), duration, position in the syllable, and neighboring phones. At [[Run time (program lifecycle phase)|run time]], the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted [[decision tree]].

Unit selection provides the greatest naturalness, because it applies only a small amount of [[digital signal processing]] (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the [[gigabyte]]s of recorded data, representing dozens of hours of speech.<ref>John Kominek and [[Alan W. Black]]. (2003). CMU ARCTIC databases for speech synthesis. CMU-LTI-03-177. Language Technologies Institute, School of Computer Science, Carnegie Mellon University.</ref> Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database.<ref>Julia Zhang. [http://groups.csail.mit.edu/sls/publications/2004/zhang_thesis.pdf Language Generation and Speech Synthesis in Dialogues for Language Learning], masters thesis, Section 5.6 on page 54.</ref> Recently, researchers have proposed various automated methods to detect unnatural segments in unit-selection speech synthesis systems.<ref>William Yang Wang and Kallirroi Georgila. (2011). [https://www.cs.cmu.edu/~yww/papers/asru2011.pdf Automatic Detection of Unnatural Word-Level Segments in Unit-Selection Speech Synthesis], IEEE ASRU 2011.</ref>

==== Diphone synthesis ====

Diphone synthesis uses a minimal speech database containing all the [[diphone]]s (sound-to-sound transitions) occurring in a language. The number of diphones depends on the [[phonotactics]] of the language: for example, Spanish has about 800 diphones, and German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target [[prosody (linguistics)|prosody]] of a sentence is superimposed on these minimal units by means of [[digital signal processing]] techniques such as [[linear predictive coding]], [[PSOLA]]<ref>{{cite web|title=Pitch-Synchronous Overlap and Add (PSOLA) Synthesis|url=http://www.fon.hum.uva.nl/praat/manual/PSOLA.html|url-status=dead|archive-url=https://web.archive.org/web/20070222180903/http://www.fon.hum.uva.nl/praat/manual/PSOLA.html|archive-date=February 22, 2007|access-date=2008-05-28}}</ref> or [[MBROLA]].<ref>T. Dutoit, V. Pagel, N. Pierret, F. Bataille, O. van der Vrecken. [http://ai2-s2-pdfs.s3.amazonaws.com/7b1f/dadf05b8f968a5b361f6f82852ade62c8010.pdf The MBROLA Project: Towards a set of high quality speech synthesizers of use for non commercial purposes]. ''ICSLP Proceedings'', 1996.</ref> or more recent techniques such as pitch modification in the source domain using [[discrete cosine transform]].<ref name="Muralishankar2004">{{cite journal | last1 = Muralishankar | first1 = R. | last2 = Ramakrishnan | first2 = A. G. | last3 = Prathibha | first3 = P. | date = February 2004 | title = Modification of Pitch using DCT in the Source Domain | journal = Speech Communication | volume = 42 | issue = 2 | pages = 143–154 | doi=10.1016/j.specom.2003.05.001}}</ref> Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining,{{Citation needed|date=January 2012}} although it continues to be used in research because there are a number of freely available software implementations. An early example of Diphone synthesis is a teaching robot, [[Leachim (Robot)|Leachim]], that was invented by [[Michael J. Freeman]].<ref>{{Cite news|url=http://content.time.com/time/magazine/article/0,9171,904056,00.html|title=Education: Marvel of The Bronx|date=1974-04-01|magazine=Time|access-date=2019-05-28|language=en-US|issn=0040-781X}}</ref> Leachim contained information regarding class curricular and certain biographical information about the students whom it was programmed to teach.<ref>{{Cite web|url=http://cyberneticzoo.com/robots/1960-rudy-the-robot-michael-freeman-american/|title=1960 - Rudy the Robot - Michael Freeman (American)|date=2010-09-13|website=cyberneticzoo.com|language=en-US|access-date=2019-05-23}}</ref> It was tested in a fourth grade classroom in [[The Bronx|the Bronx, New York]].<ref>{{Cite book|url=https://books.google.com/books?id=bNECAAAAMBAJ&q=Leachim+Michael+Freeman&pg=PA40|title=New York Magazine|date=1979-07-30|publisher=New York Media, LLC|language=en}}</ref><ref>{{Cite book|url=https://books.google.com/books?id=_QJmAAAAMAAJ&q=leachim|title=The Futurist|date=1978|publisher=World Future Society.|pages=359, 360, 361|language=en}}</ref>

==== Domain-specific synthesis ====

Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports.<ref>[[Lori Lamel|L.F. Lamel]], J.L. Gauvain, B. Prouts, C. Bouhier, R. Boesch. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53.6101&rep=rep1&type=pdf Generation and Synthesis of Broadcast Messages], ''Proceedings ESCA-NATO Workshop and Applications of Speech Technology'', September 1993.</ref> The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.{{Citation needed|date=February 2007}}

Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account. For example, in [[rhotic and non-rhotic accents|non-rhotic]] dialects of English the ''"r"'' in words like ''"clear"'' {{IPA|/ˈklɪə/}} is usually only pronounced when the following word has a vowel as its first letter (e.g. ''"clear out"'' is realized as {{IPA|/ˌklɪəɹˈʌʊt/}}). Likewise in [[French language|French]], many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called [[Liaison (French)|liaison]]. This [[alternation (linguistics)|alternation]] cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be [[context-sensitive grammar|context-sensitive]].

=== Formant synthesis ===

[[Formant]] synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created using [[additive synthesis]] and an acoustic model ([[physical modelling synthesis]]).<ref>Dartmouth College: [http://digitalmusics.dartmouth.edu/~book/MATCpages/chap.4/4.4.formant_synth.html ''Music and Computers''] {{webarchive|url=https://web.archive.org/web/20110608035309/http://digitalmusics.dartmouth.edu/~book/MATCpages/chap.4/4.4.formant_synth.html |date=2011-06-08 }}, 1993.</ref> Parameters such as [[fundamental frequency]], [[phonation|voicing]], and [[noise]] levels are varied over time to create a [[waveform]] of artificial speech. This method is sometimes called ''rules-based synthesis''; however, many concatenative systems also have rules-based components.
Many systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never be mistaken for human speech. However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have advantages over concatenative systems. Formant-synthesized speech can be reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative systems. High-speed synthesized speech is used by the visually impaired to quickly navigate computers using a [[screen reader]]. Formant synthesizers are usually smaller programs than concatenative systems because they do not have a database of speech samples. They can therefore be used in [[embedded system]]s, where [[data storage device|memory]] and [[microprocessor]] power are especially limited. Because formant-based systems have complete control of all aspects of the output speech, a wide variety of prosodies and [[intonation (linguistics)|intonation]]s can be output, conveying not just questions and statements, but a variety of emotions and tones of voice.

Examples of non-real-time but highly accurate intonation control in formant synthesis include the work done in the late 1970s for the [[Texas Instruments]] toy [[Speak & Spell (game)|Speak & Spell]], and in the early 1980s [[Sega]] [[Video arcade|arcade]] machines<ref>Examples include [[Astro Blaster]], [[Space Fury]], and [[Star Trek (arcade game)|Star Trek: Strategic Operations Simulator]]</ref> and in many [[Atari, Inc.]] arcade games<ref>Examples include [[Star Wars (arcade game)|Star Wars]], [[Firefox (arcade game)|Firefox]], [[Star Wars: Return of the Jedi (arcade game)|Return of the Jedi]], [[Road Runner (video game)|Road Runner]], [[Star Wars: The Empire Strikes Back (arcade game)|The Empire Strikes Back]], [[Indiana Jones and the Temple of Doom (arcade game)|Indiana Jones and the Temple of Doom]], [[720°]], [[Gauntlet (arcade game)|Gauntlet]], [[Gauntlet II]], [[A.P.B. (video game)|A.P.B.]], [[Paperboy (video game)|Paperboy]], [[RoadBlasters]],  [http://www.arcade-museum.com/game_detail.php?game_id=10319 Vindicators Part II], [[Escape from the Planet of the Robot Monsters]].</ref> using the [[Texas Instruments LPC Speech Chips|TMS5220 LPC Chips]]. Creating proper intonation for these projects was painstaking, and the results have yet to be matched by real-time text-to-speech interfaces.<ref>{{Cite book |author=John Holmes and Wendy Holmes |title=Speech Synthesis and Recognition |edition=2nd |publisher=CRC |year=2001 |isbn=978-0-7484-0856-6}}</ref>{{When|date=April 2025}}

=== Articulatory synthesis ===
{{Main|Articulatory synthesis}}
Articulatory synthesis consists of computational techniques for synthesizing speech based on models of the human [[vocal tract]] and the articulation processes occurring there. The first articulatory synthesizer regularly used for laboratory experiments was developed at [[Haskins Laboratories]] in the mid-1970s by [[Philip Rubin]], Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was based on vocal tract models developed at [[Bell Laboratories]] in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues.

Until recently, articulatory synthesis models have not been incorporated into commercial speech synthesis systems. A notable exception is the [[NeXT]]-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the [[University of Calgary]], where much of the original research was conducted. Following the demise of the various incarnations of NeXT (started by [[Steve Jobs]] in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under the GNU General Public License, with work continuing as [[gnuspeech]]. The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Carré's "distinctive region model".

More recent synthesizers, developed by Jorge C. Lucero and colleagues, incorporate models of vocal fold biomechanics, glottal aerodynamics and acoustic wave propagation in the bronchi, trachea, nasal and oral cavities, and thus constitute full systems of physics-based speech simulation.<ref name=":0">{{Cite journal|url = http://www.cic.unb.br/~lucero/papers/768_Paper.pdf|title = Physics-based synthesis of disordered voices|last1 = Lucero|first1 = J. C.|date = 2013|journal = Interspeech 2013|access-date = Aug 27, 2015|last2 = Schoentgen|first2 = J.|last3 = Behlau|first3 = M.|pages = 587–591|publisher = International Speech Communication Association|location = Lyon, France|doi = 10.21437/Interspeech.2013-161| s2cid=17451802 }}</ref><ref name=":1">{{Cite journal|last1=Englert|first1=Marina|last2=Madazio|first2=Glaucya|last3=Gielow|first3=Ingrid|last4=Lucero|first4=Jorge|last5=Behlau|first5=Mara|date=2016|title=Perceptual error identification of human and synthesized voices|journal=Journal of Voice|volume=30|issue=5|pages=639.e17–639.e23|doi=10.1016/j.jvoice.2015.07.017|pmid=26337775}}</ref>

=== HMM-based synthesis ===

HMM-based synthesis is a synthesis method based on [[hidden Markov model]]s, also called Statistical Parametric Synthesis. In this system, the [[frequency spectrum]] ([[vocal tract]]), [[fundamental frequency]] (voice source), and duration ([[prosody (linguistics)|prosody]]) of speech are modeled simultaneously by HMMs. Speech [[waveforms]] are generated from HMMs themselves based on the [[maximum likelihood]] criterion.<ref>{{cite web |url=http://hts.sp.nitech.ac.jp/ |title=The HMM-based Speech Synthesis System |publisher=Hts.sp.nitech.ac.j |access-date=2012-02-22 |archive-date=2012-02-13 |archive-url=https://web.archive.org/web/20120213232606/http://hts.sp.nitech.ac.jp/ |url-status=dead }}</ref>

=== Sinewave synthesis ===
{{Main|Sinewave synthesis}}
Sinewave synthesis is a technique for synthesizing speech by replacing the [[formants]] (main bands of energy) with pure tone whistles.<ref>{{Cite journal
 |last1        = Remez
 |first1       = R.
 |last2        = Rubin
 |first2       = P.
 |last3        = Pisoni
 |first3       = D.
 |last4        = Carrell
 |first4       = T.
 |title        = Speech perception without traditional speech cues
 |doi          = 10.1126/science.7233191
 |journal      = Science
 |volume       = 212
 |issue        = 4497
 |pages        = 947–949
 |date         = 22 May 1981
 |pmid         = 7233191
 |bibcode = 1981Sci...212..947R
 |url          = http://www.bsos.umd.edu/hesp/mwinn/Remez_et_al_1981.pdf
 |access-date  = 2011-12-14
 |archive-url  = https://web.archive.org/web/20111216113028/http://www.bsos.umd.edu/hesp/mwinn/Remez_et_al_1981.pdf
 |archive-date = 2011-12-16
 |url-status     = dead
}}<!-- in case PDF link dies, paper also available here and here:
http://people.ece.cornell.edu/land/courses/ece4760/Speech/remez_rubin_pisoni_carrell1981.pdf
http://www.haskins.yale.edu/Reprints/HL0338.pdf --></ref>

=== Deep learning-based synthesis ===
{{Main|Deep learning speech synthesis}}
[[File:Larynx-HiFi-GAN speech sample.wav|thumb|Speech synthesis example using the HiFi-GAN neural vocoder]]
Deep learning speech synthesis uses [[deep neural network]]s (DNN) to produce artificial speech from text (text-to-speech) or spectrum (vocoder).
The deep neural networks are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

[[15.ai]] uses a ''multi-speaker model''—hundreds of voices are trained concurrently rather than sequentially, decreasing the required training time and enabling the model to learn and generalize shared emotional context, even for voices with no exposure to such emotional context.<ref>{{cite web |last=Temitope |first=Yusuf |date=December 10, 2024 |title=15.ai Creator reveals journey from MIT Project to internet phenomenon |url=https://guardian.ng/technology/15-ai-creator-reveals-journey-from-mit-project-to-internet-phenomenon/ |access-date=December 25, 2024 |website=[[The Guardian (Nigeria)|The Guardian]] |quote= |archive-url=https://web.archive.org/web/20241228152312/https://guardian.ng/technology/15-ai-creator-reveals-journey-from-mit-project-to-internet-phenomenon/ |archive-date=December 28, 2024}}</ref> The [[deep learning]] model used by the application is [[Nondeterministic algorithm|nondeterministic]]: each time that speech is generated from the same string of text, the intonation of the speech will be slightly different. The application also supports manually altering the [[Emotional prosody|emotion]] of a generated line using ''emotional contextualizers'' (a term coined by this project), a sentence or phrase that conveys the emotion of the take that serves as a guide for the model during inference.<ref name="automaton2">{{cite web |last=Kurosawa |first=Yuki |date=2021-01-19 |title=ゲームキャラ音声読み上げソフト「15.ai」公開中。『Undertale』や『Portal』のキャラに好きなセリフを言ってもらえる |url=https://automaton-media.com/articles/newsjp/20210119-149494/ |url-status=live |archive-url=https://web.archive.org/web/20210119103031/https://automaton-media.com/articles/newsjp/20210119-149494/ |archive-date=2021-01-19 |access-date=2021-01-19 |website=AUTOMATON |quote=}}</ref><ref name="Denfaminicogamer2">{{cite web |last=Yoshiyuki |first=Furushima |date=2021-01-18 |title=『Portal』のGLaDOSや『UNDERTALE』のサンズがテキストを読み上げてくれる。文章に込められた感情まで再現することを目指すサービス「15.ai」が話題に |url=https://news.denfaminicogamer.jp/news/210118f |url-status=live |archive-url=https://web.archive.org/web/20210118051321/https://news.denfaminicogamer.jp/news/210118f |archive-date=2021-01-18 |access-date=2021-01-18 |website=Denfaminicogamer |quote=}}</ref>

[[ElevenLabs]] is primarily known for its [[browser-based]], AI-assisted text-to-speech software, Speech Synthesis, which can produce lifelike speech by synthesizing [[vocal emotion]] and [[Intonation (linguistics)|intonation]].<ref>{{Cite web |date=January 23, 2023 |title=Generative AI comes for cinema dubbing: Audio AI startup ElevenLabs raises pre-seed |url=https://sifted.eu/articles/generative-ai-audio-elevenlabs/ |access-date=2023-02-03 |website=Sifted |language=en-US}}</ref> The company states its software is built to adjust the intonation and pacing of delivery based on the context of language input used.<ref name=":13">{{Cite magazine |last=Ashworth |first=Boone |date=April 12, 2023 |title=AI Can Clone Your Favorite Podcast Host's Voice |url=https://www.wired.com/story/ai-podcasts-podcastle-revoice-descript/ |magazine=Wired |language=en-US |access-date=2023-04-25}}</ref> It uses advanced algorithms to analyze the contextual aspects of text, aiming to detect emotions like anger, sadness, happiness, or alarm, which enables the system to understand the user's sentiment,<ref>{{Cite magazine |author=WIRED Staff |title=This Podcast Is Not Hosted by AI Voice Clones. We Swear |url=https://www.wired.com/story/gadget-lab-podcast-594/ |magazine=Wired |language=en-US |issn=1059-1028 |access-date=2023-07-25}}</ref> resulting in a more realistic and human-like inflection. Other features include multilingual speech generation and long-form content creation with contextually-aware voices.<ref name=":34">{{Cite web |last=Wiggers |first=Kyle |date=2023-06-20 |title=Voice-generating platform ElevenLabs raises $19M, launches detection tool |url=https://techcrunch.com/2023/06/20/voice-generating-platform-elevenlabs-raises-19m-launches-detection-tool/ |access-date=2023-07-25 |website=TechCrunch |language=en-US}}</ref><ref>{{Cite web |last=Bonk |first=Lawrence |title=ElevenLabs' Powerful New AI Tool Lets You Make a Full Audiobook in Minutes |url=https://www.lifewire.com/elevenlabs-new-audiobook-ai-tool-7550061 |access-date=2023-07-25 |website=Lifewire |language=en}}</ref>

The DNN-based speech synthesizers are approaching the naturalness of the human voice.
Examples of disadvantages of the method are low robustness when the data are not sufficient, lack of controllability and low performance in auto-regressive models.

For tonal languages, such as Chinese or Taiwanese language, there are different levels of [[tone sandhi]] required and sometimes the output of speech synthesizer may result in the mistakes of tone sandhi.<ref>{{Cite journal |last=Zhu |first=Jian |date=2020-05-25 |title=Probing the phonetic and phonological knowledge of tones in Mandarin TTS models |url=http://dx.doi.org/10.21437/speechprosody.2020-190 |journal=Speech Prosody 2020 |pages=930–934 |location=ISCA |publisher=ISCA |doi=10.21437/speechprosody.2020-190|arxiv=1912.10915 |s2cid=209444942 }}</ref>

=== Audio deepfakes ===
{{excerpt|Audio deepfake}}In 2023, [[Vice Media|VICE]] reporter [[Joseph Cox (journalist)|Joseph Cox]] published findings that he had recorded five minutes of himself talking and then used a tool developed by ElevenLabs to create voice deepfakes that defeated a bank's [[Speaker recognition|voice-authentication]] system.<ref>{{Cite magazine |last=Newman |first=Lily Hay |title=AI-Generated Voice Deepfakes Aren't Scary Good—Yet |url=https://www.wired.com/story/ai-voice-deep-fakes/ |magazine=Wired |language=en-US |issn=1059-1028 |access-date=2023-07-25}}</ref>