Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Speech synthesis
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{short description|Artificial production of human speech}} {{pp-pc}} {{listen | filename = JärDa-utrop.ogg | title = Automatic announcement | description = A synthetic voice announcing an arriving train in Sweden. | format = [[Ogg]] }} '''Speech synthesis''' is the artificial production of human [[speech]]. A computer system used for this purpose is called a '''speech synthesizer''', and can be implemented in [[software]] or [[Computer hardware|hardware]] products. A '''text-to-speech''' ('''TTS''') system converts normal language text into speech; other systems render [[symbolic linguistic representation]]s like [[phonetic transcription]]s into speech.<ref>{{Cite book |first1=Jonathan |last1=Allen |first2=M. Sharon |last2=Hunnicutt |first3=Dennis |last3=Klatt |title=From Text to Speech: The MITalk system |publisher=Cambridge University Press |year=1987 |isbn=978-0-521-30641-6 |url-access=registration |url=https://archive.org/details/fromtexttospeech00alle }}</ref> The reverse process is [[speech recognition]]. Synthesized speech can be created by [[Concatenative synthesis|concatenating]] pieces of recorded speech that are stored in a [[database]]. Systems differ in the size of the stored speech units; a system that stores [[phone (phonetics)|phones]] or [[diphone]]s provides the largest output range, but may lack clarity.{{Citation needed|date=September 2024}} For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the [[vocal tract]] and other human voice characteristics to create a completely "synthetic" voice output.<ref>{{Cite journal | doi = 10.1121/1.386780 | last1 = Rubin | first1 = P. | last2 = Baer | first2 = T. | last3 = Mermelstein | first3 = P. | year = 1981 | title = An articulatory synthesizer for perceptual research | journal = Journal of the Acoustical Society of America | volume = 70 | issue = 2| pages = 321–328 | bibcode = 1981ASAJ...70..321R }}</ref> The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood clearly. An intelligible text-to-speech program allows people with [[visual impairment]]s or [[reading disability|reading disabilities]] to listen to written words on a home computer. Many computer [[operating system]]s have included speech synthesizers since the early 1990s.{{Citation needed|date=September 2024}} [[File:TTS System.svg|550px|thumb|Overview of a typical TTS system]] A text-to-speech system (or "engine") is composed of two parts:<ref>{{Cite book |first1=Jan P. H. |last1=van Santen |first2=Richard W. |last2=Sproat |first3=Joseph P. |last3=Olive |first4=Julia |last4=Hirschberg |title=Progress in Speech Synthesis |publisher=Springer |year=1997 |isbn=978-0-387-94701-3 |url-access=registration |url=https://archive.org/details/progressinspeech0000unse }}</ref> a [[Input method|front-end]] and a [[Front and back ends|back-end]]. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called ''[[text normalization]]'', ''pre-processing'', or ''[[Tokenization (lexical analysis)|tokenization]]''. The front-end then assigns [[phonetic transcription]]s to each word, and divides and marks the text into [[prosody (linguistics)|prosodic units]], like [[phrase]]s, [[clause]]s, and [[sentence (linguistics)|sentence]]s. The process of assigning phonetic transcriptions to words is called ''text-to-phoneme'' or ''[[grapheme]]-to-phoneme'' conversion. Phonetic transcriptions and [[Prosody (linguistics)|prosody]] information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the ''synthesizer''—then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the ''target prosody'' (pitch contour, phoneme durations),<ref>{{Cite journal | last1 = Van Santen | first1 = J. | title = Assignment of segmental duration in text-to-speech synthesis | doi = 10.1006/csla.1994.1005 | journal = Computer Speech & Language | volume = 8 | issue = 2 | pages = 95–128 |date=April 1994 }}</ref> which is then imposed on the output speech.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)