Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Speech synthesis
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Text normalization challenges === The process of normalizing text is rarely straightforward. Texts are full of [[heteronym (linguistics)|heteronym]]s, [[number]]s, and [[abbreviation]]s that all require expansion into a phonetic representation. There are many spellings in English which are pronounced differently based on context. For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project". Most text-to-speech (TTS) systems do not generate [[Semantics#Linguistics|semantic]] representations of their input texts, as processes for doing so are unreliable, poorly understood, and computationally ineffective. As a result, various [[heuristic]] techniques are used to guess the proper way to disambiguate [[homograph]]s, like examining neighboring words and using statistics about frequency of occurrence. Recently TTS systems have begun to use HMMs (discussed [[Speech synthesis#HMM-based synthesis|above]]) to generate "[[Part-of-speech tagging|parts of speech]]" to aid in disambiguating homographs. This technique is quite successful for many cases such as whether "read" should be pronounced as "red" implying past tense, or as "reed" implying present tense. Typical error rates when using HMMs in this fashion are usually below five percent. These techniques also work well for most European languages, although access to required training [[Text corpus|corpora]] is frequently difficult in these languages. Deciding how to convert numbers is another problem that TTS systems have to address. It is a simple programming challenge to convert a number into words (at least in English), like "1325" becoming "one thousand three hundred twenty-five". However, numbers occur in many different contexts; "1325" may also be read as "one three two five", "thirteen twenty-five" or "thirteen hundred and twenty five". A TTS system can often infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to specify the context if it is ambiguous.<ref>{{cite web | title = Speech synthesis | publisher = World Wide Web Organization | url = http://www.w3.org/TR/speech-synthesis/#S3.1.8}}</ref> Roman numerals can also be read differently depending on context. For example, "Henry VIII" reads as "Henry the Eighth", while "Chapter VIII" reads as "Chapter Eight". Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for "inches" must be differentiated from the word "in", and the address "12 St John St." uses the same abbreviation for both "Saint" and "Street". TTS systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others provide the same result in all cases, resulting in nonsensical (and sometimes comical) outputs, such as "[[Ulysses S. Grant]]" being rendered as "Ulysses South Grant".
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)