Editing Speech synthesis (section)

=== Text normalization challenges ===

The process of normalizing text is rarely straightforward. Texts are full of [[heteronym (linguistics)|heteronym]]s, [[number]]s, and [[abbreviation]]s that all require expansion into a phonetic representation. There are many spellings in English which are pronounced differently based on context. For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project".

Most text-to-speech (TTS) systems do not generate [[Semantics#Linguistics|semantic]] representations of their input texts, as processes for doing so are unreliable, poorly understood, and computationally ineffective. As a result, various [[heuristic]] techniques are used to guess the proper way to disambiguate [[homograph]]s, like examining neighboring words and using statistics about frequency of occurrence.

Recently TTS systems have begun to use HMMs (discussed [[Speech synthesis#HMM-based synthesis|above]]) to generate "[[Part-of-speech tagging|parts of speech]]" to aid in disambiguating homographs. This technique is quite successful for many cases such as whether "read" should be pronounced as "red" implying past tense, or as "reed" implying present tense. Typical error rates when using HMMs in this fashion are usually below five percent. These techniques also work well for most European languages, although access to required training [[Text corpus|corpora]] is frequently difficult in these languages.

Deciding how to convert numbers is another problem that TTS systems have to address. It is a simple programming challenge to convert a number into words (at least in English), like "1325" becoming "one thousand three hundred twenty-five". However, numbers occur in many different contexts; "1325" may also be read as "one three two five", "thirteen twenty-five" or "thirteen hundred and twenty five". A TTS system can often infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to specify the context if it is ambiguous.<ref>{{cite web | title = Speech synthesis | publisher = World Wide Web Organization | url = http://www.w3.org/TR/speech-synthesis/#S3.1.8}}</ref> Roman numerals can also be read differently depending on context. For example, "Henry VIII" reads as "Henry the Eighth", while "Chapter VIII" reads as "Chapter Eight".

Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for "inches" must be differentiated from the word "in", and the address "12 St John St." uses the same abbreviation for both "Saint" and "Street". TTS systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others provide the same result in all cases, resulting in nonsensical (and sometimes comical) outputs, such as "[[Ulysses S. Grant]]" being rendered as "Ulysses South Grant".