Editing Speech synthesis (section)

==== Unit selection synthesis ====

Unit selection synthesis uses large databases of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual [[phone (phonetics)|phones]], [[diphone]]s, half-phones, [[syllable]]s, [[morpheme]]s, [[word]]s, [[phrase]]s, and [[sentence (linguistics)|sentence]]s. Typically, the division into segments is done using a specially modified [[speech recognition|speech recognizer]] set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the [[waveform]] and [[spectrogram]].<ref>[[Alan W. Black]], [https://www.cs.cmu.edu/~awb/papers/IEEE2002/allthetime/allthetime.html Perfect synthesis for all of the people all of the time.] IEEE TTS Workshop 2002.</ref> An [[index (database)|index]] of the units in the speech database is then created based on the segmentation and acoustic parameters like the [[fundamental frequency]] ([[pitch (music)|pitch]]), duration, position in the syllable, and neighboring phones. At [[Run time (program lifecycle phase)|run time]], the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted [[decision tree]].

Unit selection provides the greatest naturalness, because it applies only a small amount of [[digital signal processing]] (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the [[gigabyte]]s of recorded data, representing dozens of hours of speech.<ref>John Kominek and [[Alan W. Black]]. (2003). CMU ARCTIC databases for speech synthesis. CMU-LTI-03-177. Language Technologies Institute, School of Computer Science, Carnegie Mellon University.</ref> Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database.<ref>Julia Zhang. [http://groups.csail.mit.edu/sls/publications/2004/zhang_thesis.pdf Language Generation and Speech Synthesis in Dialogues for Language Learning], masters thesis, Section 5.6 on page 54.</ref> Recently, researchers have proposed various automated methods to detect unnatural segments in unit-selection speech synthesis systems.<ref>William Yang Wang and Kallirroi Georgila. (2011). [https://www.cs.cmu.edu/~yww/papers/asru2011.pdf Automatic Detection of Unnatural Word-Level Segments in Unit-Selection Speech Synthesis], IEEE ASRU 2011.</ref>