Editing Speech synthesis (section)

=== Concatenation synthesis ===
{{main|Concatenative synthesis}}
Concatenative synthesis is based on the concatenation (stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis.

==== Unit selection synthesis ====

Unit selection synthesis uses large databases of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual [[phone (phonetics)|phones]], [[diphone]]s, half-phones, [[syllable]]s, [[morpheme]]s, [[word]]s, [[phrase]]s, and [[sentence (linguistics)|sentence]]s. Typically, the division into segments is done using a specially modified [[speech recognition|speech recognizer]] set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the [[waveform]] and [[spectrogram]].<ref>[[Alan W. Black]], [https://www.cs.cmu.edu/~awb/papers/IEEE2002/allthetime/allthetime.html Perfect synthesis for all of the people all of the time.] IEEE TTS Workshop 2002.</ref> An [[index (database)|index]] of the units in the speech database is then created based on the segmentation and acoustic parameters like the [[fundamental frequency]] ([[pitch (music)|pitch]]), duration, position in the syllable, and neighboring phones. At [[Run time (program lifecycle phase)|run time]], the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted [[decision tree]].

Unit selection provides the greatest naturalness, because it applies only a small amount of [[digital signal processing]] (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the [[gigabyte]]s of recorded data, representing dozens of hours of speech.<ref>John Kominek and [[Alan W. Black]]. (2003). CMU ARCTIC databases for speech synthesis. CMU-LTI-03-177. Language Technologies Institute, School of Computer Science, Carnegie Mellon University.</ref> Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database.<ref>Julia Zhang. [http://groups.csail.mit.edu/sls/publications/2004/zhang_thesis.pdf Language Generation and Speech Synthesis in Dialogues for Language Learning], masters thesis, Section 5.6 on page 54.</ref> Recently, researchers have proposed various automated methods to detect unnatural segments in unit-selection speech synthesis systems.<ref>William Yang Wang and Kallirroi Georgila. (2011). [https://www.cs.cmu.edu/~yww/papers/asru2011.pdf Automatic Detection of Unnatural Word-Level Segments in Unit-Selection Speech Synthesis], IEEE ASRU 2011.</ref>

==== Diphone synthesis ====

Diphone synthesis uses a minimal speech database containing all the [[diphone]]s (sound-to-sound transitions) occurring in a language. The number of diphones depends on the [[phonotactics]] of the language: for example, Spanish has about 800 diphones, and German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target [[prosody (linguistics)|prosody]] of a sentence is superimposed on these minimal units by means of [[digital signal processing]] techniques such as [[linear predictive coding]], [[PSOLA]]<ref>{{cite web|title=Pitch-Synchronous Overlap and Add (PSOLA) Synthesis|url=http://www.fon.hum.uva.nl/praat/manual/PSOLA.html|url-status=dead|archive-url=https://web.archive.org/web/20070222180903/http://www.fon.hum.uva.nl/praat/manual/PSOLA.html|archive-date=February 22, 2007|access-date=2008-05-28}}</ref> or [[MBROLA]].<ref>T. Dutoit, V. Pagel, N. Pierret, F. Bataille, O. van der Vrecken. [http://ai2-s2-pdfs.s3.amazonaws.com/7b1f/dadf05b8f968a5b361f6f82852ade62c8010.pdf The MBROLA Project: Towards a set of high quality speech synthesizers of use for non commercial purposes]. ''ICSLP Proceedings'', 1996.</ref> or more recent techniques such as pitch modification in the source domain using [[discrete cosine transform]].<ref name="Muralishankar2004">{{cite journal | last1 = Muralishankar | first1 = R. | last2 = Ramakrishnan | first2 = A. G. | last3 = Prathibha | first3 = P. | date = February 2004 | title = Modification of Pitch using DCT in the Source Domain | journal = Speech Communication | volume = 42 | issue = 2 | pages = 143–154 | doi=10.1016/j.specom.2003.05.001}}</ref> Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining,{{Citation needed|date=January 2012}} although it continues to be used in research because there are a number of freely available software implementations. An early example of Diphone synthesis is a teaching robot, [[Leachim (Robot)|Leachim]], that was invented by [[Michael J. Freeman]].<ref>{{Cite news|url=http://content.time.com/time/magazine/article/0,9171,904056,00.html|title=Education: Marvel of The Bronx|date=1974-04-01|magazine=Time|access-date=2019-05-28|language=en-US|issn=0040-781X}}</ref> Leachim contained information regarding class curricular and certain biographical information about the students whom it was programmed to teach.<ref>{{Cite web|url=http://cyberneticzoo.com/robots/1960-rudy-the-robot-michael-freeman-american/|title=1960 - Rudy the Robot - Michael Freeman (American)|date=2010-09-13|website=cyberneticzoo.com|language=en-US|access-date=2019-05-23}}</ref> It was tested in a fourth grade classroom in [[The Bronx|the Bronx, New York]].<ref>{{Cite book|url=https://books.google.com/books?id=bNECAAAAMBAJ&q=Leachim+Michael+Freeman&pg=PA40|title=New York Magazine|date=1979-07-30|publisher=New York Media, LLC|language=en}}</ref><ref>{{Cite book|url=https://books.google.com/books?id=_QJmAAAAMAAJ&q=leachim|title=The Futurist|date=1978|publisher=World Future Society.|pages=359, 360, 361|language=en}}</ref>

==== Domain-specific synthesis ====

Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports.<ref>[[Lori Lamel|L.F. Lamel]], J.L. Gauvain, B. Prouts, C. Bouhier, R. Boesch. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53.6101&rep=rep1&type=pdf Generation and Synthesis of Broadcast Messages], ''Proceedings ESCA-NATO Workshop and Applications of Speech Technology'', September 1993.</ref> The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.{{Citation needed|date=February 2007}}

Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account. For example, in [[rhotic and non-rhotic accents|non-rhotic]] dialects of English the ''"r"'' in words like ''"clear"'' {{IPA|/ˈklɪə/}} is usually only pronounced when the following word has a vowel as its first letter (e.g. ''"clear out"'' is realized as {{IPA|/ˌklɪəɹˈʌʊt/}}). Likewise in [[French language|French]], many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called [[Liaison (French)|liaison]]. This [[alternation (linguistics)|alternation]] cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be [[context-sensitive grammar|context-sensitive]].