Editing Speech synthesis (section)

== Challenges ==

=== Text normalization challenges ===

The process of normalizing text is rarely straightforward. Texts are full of [[heteronym (linguistics)|heteronym]]s, [[number]]s, and [[abbreviation]]s that all require expansion into a phonetic representation. There are many spellings in English which are pronounced differently based on context. For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project".

Most text-to-speech (TTS) systems do not generate [[Semantics#Linguistics|semantic]] representations of their input texts, as processes for doing so are unreliable, poorly understood, and computationally ineffective. As a result, various [[heuristic]] techniques are used to guess the proper way to disambiguate [[homograph]]s, like examining neighboring words and using statistics about frequency of occurrence.

Recently TTS systems have begun to use HMMs (discussed [[Speech synthesis#HMM-based synthesis|above]]) to generate "[[Part-of-speech tagging|parts of speech]]" to aid in disambiguating homographs. This technique is quite successful for many cases such as whether "read" should be pronounced as "red" implying past tense, or as "reed" implying present tense. Typical error rates when using HMMs in this fashion are usually below five percent. These techniques also work well for most European languages, although access to required training [[Text corpus|corpora]] is frequently difficult in these languages.

Deciding how to convert numbers is another problem that TTS systems have to address. It is a simple programming challenge to convert a number into words (at least in English), like "1325" becoming "one thousand three hundred twenty-five". However, numbers occur in many different contexts; "1325" may also be read as "one three two five", "thirteen twenty-five" or "thirteen hundred and twenty five". A TTS system can often infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to specify the context if it is ambiguous.<ref>{{cite web | title = Speech synthesis | publisher = World Wide Web Organization | url = http://www.w3.org/TR/speech-synthesis/#S3.1.8}}</ref> Roman numerals can also be read differently depending on context. For example, "Henry VIII" reads as "Henry the Eighth", while "Chapter VIII" reads as "Chapter Eight".

Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for "inches" must be differentiated from the word "in", and the address "12 St John St." uses the same abbreviation for both "Saint" and "Street". TTS systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others provide the same result in all cases, resulting in nonsensical (and sometimes comical) outputs, such as "[[Ulysses S. Grant]]" being rendered as "Ulysses South Grant".

=== Text-to-phoneme challenges ===

{{Unreferenced section|date=April 2023}}

Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its [[spelling]], a process which is often called text-to-phoneme or [[grapheme]]-to-phoneme conversion ([[phoneme]] is the term used by [[Linguistics|linguists]] to describe distinctive sounds in a [[language]]). The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct [[pronunciation]]s is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary. The other approach is rule-based, in which pronunciation rules are applied to words to determine their pronunciations based on their spellings. This is similar to the "sounding out", or [[synthetic phonics]], approach to learning reading.

Each approach has advantages and drawbacks. The dictionary-based approach is quick and accurate, but completely fails if it is given a word which is not in its dictionary. As dictionary size grows, so too does the memory space requirements of the synthesis system. On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as the system takes into account irregular spellings or pronunciations. (Consider that the word "of" is very common in English, yet is the only word in which the letter "f" is pronounced {{IPA|[v]}}.) As a result, nearly all speech synthesis systems use a combination of these approaches.

Languages with a [[phonemic orthography]] have a very regular writing system, and the prediction of the pronunciation of words based on their spellings is quite successful. Speech synthesis systems for such languages often use the rule-based method extensively, resorting to dictionaries only for those few words, like foreign names and loanwords, whose pronunciations are not obvious from their spellings. On the other hand, speech synthesis systems for languages like English, which have extremely irregular spelling systems, are more likely to rely on dictionaries, and to use rule-based methods only for unusual words, or words that are not in their dictionaries.

=== Evaluation challenges ===
The consistent evaluation of speech synthesis systems may be difficult because of a lack of universally agreed objective evaluation criteria. Different organizations often use different speech data. The quality of speech synthesis systems also depends on the quality of the production technique (which may involve analogue or digital recording) and on the facilities used to replay the speech. Evaluating speech synthesis systems has therefore often been compromised by differences between production techniques and replay facilities.

Since 2005, however, some researchers have started to evaluate speech synthesis systems using a common speech dataset.<ref>{{cite web|url=http://festvox.org/blizzard |title=Blizzard Challenge |publisher=Festvox.org |access-date=2012-02-22}}</ref>

=== Prosodics and emotional content ===
{{See also|Emotional speech recognition|Prosody (linguistics)}}
A study in the journal ''Speech Communication'' by Amy Drahota and colleagues at the [[University of Portsmouth]], [[UK]], reported that listeners to voice recordings could determine, at better than chance levels, whether or not the speaker was smiling.<ref>{{Cite news|title=Smile -and the world can hear you |date=January 9, 2008 |url=http://www.port.ac.uk/aboutus/newsandevents/news/title,74220,en.html |archive-date=May 17, 2008 |archive-url=https://web.archive.org/web/20080517102201/http://www.port.ac.uk/aboutus/newsandevents/news/title%2C74220%2Cen.html |publisher=University of Portsmouth |url-status=dead }}</ref><ref>{{Cite news |title=Smile – And The World Can Hear You, Even If You Hide |work=Science Daily |date=January 2008 |url=https://www.sciencedaily.com/releases/2008/01/080111224745.htm}}</ref><ref>{{Cite journal
 |last1       = Drahota
 |first1      = A.
 |title       = The vocal communication of different kinds of smile
 |doi         = 10.1016/j.specom.2007.10.001
 |journal     = Speech Communication
 |volume      = 50
 |issue       = 4
 |pages       = 278–287
 |year        = 2008
 |s2cid = 46693018
 |url         = http://peer.ccsd.cnrs.fr/docs/00/49/91/97/PDF/PEER_stage2_10.1016%252Fj.specom.2007.10.001.pdf
 |url-status     = dead
 |archive-url  = https://web.archive.org/web/20130703062330/https://peer.ccsd.cnrs.fr/docs/00/49/91/97/PDF/PEER_stage2_10.1016/j.specom.2007.10.001.pdf
 |archive-date = 2013-07-03
}}<!-- also available here: http://ganymedes.lib.unideb.hu:8080/udpeer/bitstream/2437.2/2984/1/PEER_stage2_10.1016%252Fj.specom.2007.10.001.pdf --></ref> It was suggested that identification of the vocal features that signal emotional content may be used to help make synthesized speech sound more natural. One of the related issues is modification of the [[pitch contour]] of the sentence, depending upon whether it is an affirmative, interrogative or exclamatory sentence. One of the techniques for pitch modification<ref name="Muralishankar2004" /> uses [[discrete cosine transform]] in the source domain ([[linear prediction]] residual). Such pitch synchronous pitch modification techniques need a priori pitch marking of the synthesis speech database using techniques such as epoch extraction using dynamic [[Plosive|plosion]] index applied on the integrated linear prediction residual of the [[Voice (phonetics)|voiced]] regions of speech.<ref>{{cite journal|last1=Prathosh|first1=A. P.|last2=Ramakrishnan|first2=A. G.|last3=Ananthapadmanabha|first3=T. V.|title=Epoch extraction based on integrated linear prediction residual using plosion index|journal=IEEE Trans. Audio Speech Language Processing|date=December 2013|volume=21|issue=12|pages=2471–2480|doi=10.1109/TASL.2013.2273717|s2cid=10491251}}</ref>  In general, prosody remains a challenge for speech synthesizers, and is an active research topic.