Editing Speech synthesis (section)

{{short description|Artificial production of human speech}}
{{pp-pc}}
{{listen
| filename     = JärDa-utrop.ogg
| title        = Automatic announcement
| description  = A synthetic voice announcing an arriving train in Sweden.
| format       = [[Ogg]]
}}

'''Speech synthesis''' is the artificial production of human [[speech]]. A computer system used for this purpose is called a '''speech synthesizer''', and can be implemented in [[software]] or [[Computer hardware|hardware]] products. A '''text-to-speech''' ('''TTS''') system converts normal language text into speech; other systems render [[symbolic linguistic representation]]s like [[phonetic transcription]]s into speech.<ref>{{Cite book |first1=Jonathan |last1=Allen |first2=M. Sharon |last2=Hunnicutt |first3=Dennis |last3=Klatt |title=From Text to Speech: The MITalk system |publisher=Cambridge University Press |year=1987 |isbn=978-0-521-30641-6 |url-access=registration |url=https://archive.org/details/fromtexttospeech00alle }}</ref> The reverse process is [[speech recognition]].

Synthesized speech can be created by [[Concatenative synthesis|concatenating]] pieces of recorded speech that are stored in a [[database]]. Systems differ in the size of the stored speech units; a system that stores [[phone (phonetics)|phones]] or [[diphone]]s provides the largest output range, but may lack clarity.{{Citation needed|date=September 2024}} For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the [[vocal tract]] and other human voice characteristics to create a completely "synthetic" voice output.<ref>{{Cite journal | doi = 10.1121/1.386780 | last1 = Rubin | first1 = P. | last2 = Baer | first2 = T. | last3 = Mermelstein | first3 = P. | year = 1981 | title = An articulatory synthesizer for perceptual research | journal = Journal of the Acoustical Society of America | volume = 70 | issue = 2| pages = 321–328 | bibcode = 1981ASAJ...70..321R }}</ref>

The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood clearly. An intelligible text-to-speech program allows people with [[visual impairment]]s or [[reading disability|reading disabilities]] to listen to written words on a home computer. Many computer [[operating system]]s have included speech synthesizers since the early 1990s.{{Citation needed|date=September 2024}}

[[File:TTS System.svg|550px|thumb|Overview of a typical TTS system]]

A text-to-speech system (or "engine") is composed of two parts:<ref>{{Cite book |first1=Jan P. H. |last1=van Santen |first2=Richard W. |last2=Sproat |first3=Joseph P. |last3=Olive |first4=Julia |last4=Hirschberg |title=Progress in Speech Synthesis |publisher=Springer |year=1997 |isbn=978-0-387-94701-3 |url-access=registration |url=https://archive.org/details/progressinspeech0000unse }}</ref> a [[Input method|front-end]] and a [[Front and back ends|back-end]]. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called ''[[text normalization]]'', ''pre-processing'', or ''[[Tokenization (lexical analysis)|tokenization]]''. The front-end then assigns [[phonetic transcription]]s to each word, and divides and marks the text into [[prosody (linguistics)|prosodic units]], like [[phrase]]s, [[clause]]s, and [[sentence (linguistics)|sentence]]s. The process of assigning phonetic transcriptions to words is called ''text-to-phoneme'' or ''[[grapheme]]-to-phoneme'' conversion. Phonetic transcriptions and [[Prosody (linguistics)|prosody]] information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the ''synthesizer''—then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the ''target prosody'' (pitch contour, phoneme durations),<ref>{{Cite journal
| last1 = Van Santen | first1 = J.
| title = Assignment of segmental duration in text-to-speech synthesis
| doi = 10.1006/csla.1994.1005
| journal = Computer Speech & Language
| volume = 8
| issue = 2
| pages = 95–128
|date=April 1994
}}</ref> which is then imposed on the output speech.