Editing Speech synthesis (section)

=== Deep learning-based synthesis ===
{{Main|Deep learning speech synthesis}}
[[File:Larynx-HiFi-GAN speech sample.wav|thumb|Speech synthesis example using the HiFi-GAN neural vocoder]]
Deep learning speech synthesis uses [[deep neural network]]s (DNN) to produce artificial speech from text (text-to-speech) or spectrum (vocoder).
The deep neural networks are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

[[15.ai]] uses a ''multi-speaker model''—hundreds of voices are trained concurrently rather than sequentially, decreasing the required training time and enabling the model to learn and generalize shared emotional context, even for voices with no exposure to such emotional context.<ref>{{cite web |last=Temitope |first=Yusuf |date=December 10, 2024 |title=15.ai Creator reveals journey from MIT Project to internet phenomenon |url=https://guardian.ng/technology/15-ai-creator-reveals-journey-from-mit-project-to-internet-phenomenon/ |access-date=December 25, 2024 |website=[[The Guardian (Nigeria)|The Guardian]] |quote= |archive-url=https://web.archive.org/web/20241228152312/https://guardian.ng/technology/15-ai-creator-reveals-journey-from-mit-project-to-internet-phenomenon/ |archive-date=December 28, 2024}}</ref> The [[deep learning]] model used by the application is [[Nondeterministic algorithm|nondeterministic]]: each time that speech is generated from the same string of text, the intonation of the speech will be slightly different. The application also supports manually altering the [[Emotional prosody|emotion]] of a generated line using ''emotional contextualizers'' (a term coined by this project), a sentence or phrase that conveys the emotion of the take that serves as a guide for the model during inference.<ref name="automaton2">{{cite web |last=Kurosawa |first=Yuki |date=2021-01-19 |title=ゲームキャラ音声読み上げソフト「15.ai」公開中。『Undertale』や『Portal』のキャラに好きなセリフを言ってもらえる |url=https://automaton-media.com/articles/newsjp/20210119-149494/ |url-status=live |archive-url=https://web.archive.org/web/20210119103031/https://automaton-media.com/articles/newsjp/20210119-149494/ |archive-date=2021-01-19 |access-date=2021-01-19 |website=AUTOMATON |quote=}}</ref><ref name="Denfaminicogamer2">{{cite web |last=Yoshiyuki |first=Furushima |date=2021-01-18 |title=『Portal』のGLaDOSや『UNDERTALE』のサンズがテキストを読み上げてくれる。文章に込められた感情まで再現することを目指すサービス「15.ai」が話題に |url=https://news.denfaminicogamer.jp/news/210118f |url-status=live |archive-url=https://web.archive.org/web/20210118051321/https://news.denfaminicogamer.jp/news/210118f |archive-date=2021-01-18 |access-date=2021-01-18 |website=Denfaminicogamer |quote=}}</ref>

[[ElevenLabs]] is primarily known for its [[browser-based]], AI-assisted text-to-speech software, Speech Synthesis, which can produce lifelike speech by synthesizing [[vocal emotion]] and [[Intonation (linguistics)|intonation]].<ref>{{Cite web |date=January 23, 2023 |title=Generative AI comes for cinema dubbing: Audio AI startup ElevenLabs raises pre-seed |url=https://sifted.eu/articles/generative-ai-audio-elevenlabs/ |access-date=2023-02-03 |website=Sifted |language=en-US}}</ref> The company states its software is built to adjust the intonation and pacing of delivery based on the context of language input used.<ref name=":13">{{Cite magazine |last=Ashworth |first=Boone |date=April 12, 2023 |title=AI Can Clone Your Favorite Podcast Host's Voice |url=https://www.wired.com/story/ai-podcasts-podcastle-revoice-descript/ |magazine=Wired |language=en-US |access-date=2023-04-25}}</ref> It uses advanced algorithms to analyze the contextual aspects of text, aiming to detect emotions like anger, sadness, happiness, or alarm, which enables the system to understand the user's sentiment,<ref>{{Cite magazine |author=WIRED Staff |title=This Podcast Is Not Hosted by AI Voice Clones. We Swear |url=https://www.wired.com/story/gadget-lab-podcast-594/ |magazine=Wired |language=en-US |issn=1059-1028 |access-date=2023-07-25}}</ref> resulting in a more realistic and human-like inflection. Other features include multilingual speech generation and long-form content creation with contextually-aware voices.<ref name=":34">{{Cite web |last=Wiggers |first=Kyle |date=2023-06-20 |title=Voice-generating platform ElevenLabs raises $19M, launches detection tool |url=https://techcrunch.com/2023/06/20/voice-generating-platform-elevenlabs-raises-19m-launches-detection-tool/ |access-date=2023-07-25 |website=TechCrunch |language=en-US}}</ref><ref>{{Cite web |last=Bonk |first=Lawrence |title=ElevenLabs' Powerful New AI Tool Lets You Make a Full Audiobook in Minutes |url=https://www.lifewire.com/elevenlabs-new-audiobook-ai-tool-7550061 |access-date=2023-07-25 |website=Lifewire |language=en}}</ref>

The DNN-based speech synthesizers are approaching the naturalness of the human voice.
Examples of disadvantages of the method are low robustness when the data are not sufficient, lack of controllability and low performance in auto-regressive models.

For tonal languages, such as Chinese or Taiwanese language, there are different levels of [[tone sandhi]] required and sometimes the output of speech synthesizer may result in the mistakes of tone sandhi.<ref>{{Cite journal |last=Zhu |first=Jian |date=2020-05-25 |title=Probing the phonetic and phonological knowledge of tones in Mandarin TTS models |url=http://dx.doi.org/10.21437/speechprosody.2020-190 |journal=Speech Prosody 2020 |pages=930–934 |location=ISCA |publisher=ISCA |doi=10.21437/speechprosody.2020-190|arxiv=1912.10915 |s2cid=209444942 }}</ref>