Editing Speech synthesis (section)

==Text-to-speech systems==
Text-to-speech (TTS) refers to the ability of computers to read text aloud. A TTS engine converts written text to a phonemic representation, then converts the phonemic representation to waveforms that can be output as sound. TTS engines with different languages, dialects and specialized vocabularies are available through third-party publishers.<ref name="microsoft.com">{{cite web|url=http://support.microsoft.com/kb/306902 |title=How to configure and use Text-to-Speech in Windows XP and in Windows Vista |publisher=Microsoft  |date=2007-05-07 |access-date=2010-02-17}}</ref>

=== Android ===

Version 1.6 of [[Android (operating system)|Android]] added support for speech synthesis (TTS).<ref>{{cite web |author = Jean-Michel Trivi |date=2009-09-23 |url=http://android-developers.blogspot.com/2009/09/introduction-to-text-to-speech-in.html |title=An introduction to Text-To-Speech in Android |publisher=Android-developers.blogspot.com |access-date=2010-02-17}}</ref>

=== Internet ===
Currently, there are a number of [[application software|applications]], [[Plug-in (computing)|plugins]] and gadgets that can read messages directly from an [[e-mail client]] and web pages from a [[web browser]] or [[Google Toolbar]]. Some specialized software can narrate [[RSS|RSS-feeds]]. On one hand, online RSS-narrators simplify information delivery by allowing users to listen to their favourite news sources and to convert them to [[podcast]]s. On the other hand, on-line RSS-readers are available on almost any personal computer connected to the Internet. Users can download generated audio files to portable devices, e.g. with a help of [[podcast]] receiver, and listen to them while walking, jogging or commuting to work.

A growing field in Internet based TTS is web-based [[assistive technology]], e.g. 'Browsealoud' from a UK company and [[Readspeaker]]. It can deliver TTS functionality to anyone (for reasons of accessibility, convenience, entertainment or information) with access to a web browser. The non-profit project [[Wikipedia:WikiProject Spoken Wikipedia/Pediaphon|Pediaphon]] was created in 2006 to provide a similar web-based TTS interface to the Wikipedia.<ref>Andreas Bischoff, [http://www.dr-bischoff.de/research/pdf/bischoff_pediaphon_uwsi2007_final.pdf The Pediaphon – Speech Interface to the free Wikipedia Encyclopedia for Mobile Phones], PDA's and MP3-Players, Proceedings of the 18th International Conference on Database and Expert Systems Applications,  Pages: 575–579 {{ISBN|0-7695-2932-1}}, 2007</ref>

Other work is being done in the context of the [[W3C]] through the W3C Audio Incubator Group with the involvement of The BBC and Google Inc.

===Open source===
Some [[open-source software]] systems are available, such as:
* [[eSpeak]] which supports a broad range of languages.
* [[Festival Speech Synthesis System]] which uses diphone-based synthesis, as well as more modern and contemporary sounding techniques.
* [[gnuspeech]] which uses articulatory synthesis<ref>{{cite web|url=https://www.gnu.org/software/gnuspeech/ |title=gnuspeech |publisher=Gnu.org |access-date=2010-02-17}}</ref> from the [[Free Software Foundation]].

=== Others ===
* Following the commercial failure of the hardware-based Intellivoice, gaming developers sparingly used software synthesis in later games{{Citation needed|date=April 2020}}. Earlier systems from Atari, such as the [[Atari 5200]] (''Baseball'') and the [[Atari 2600]] (''[[Quadrun]]'' and ''Open Sesame''), also had games utilizing software synthesis.{{Citation needed|date=April 2020}}
* Some [[e-book readers]], such as the [[Amazon Kindle]], [[Samsung]] E6, [[PocketBook eReader]] Pro, [[enTourage eDGe]], and the Bebook Neo.
* The [[BBC Micro]] incorporated the Texas Instruments TMS5220 speech synthesis chip.
* Some models of Texas Instruments home computers produced in 1979 and 1981 ([[TI-99/4A|Texas Instruments TI-99/4 and TI-99/4A]]) were capable of text-to-phoneme synthesis or reciting complete words and phrases (text-to-dictionary), using a very popular Speech Synthesizer peripheral. TI used a proprietary [[codec]] to embed complete spoken phrases into applications, primarily video games.<ref>{{cite web |url=http://www.mindspring.com/~ssshp/ssshp_cd/ss_home.htm |title=Smithsonian Speech Synthesis History Project (SSSHP) 1986–2002 |publisher=Mindspring.com |access-date=2010-02-17 |archive-url=https://web.archive.org/web/20131003104852/http://amhistory.si.edu/archives/speechsynthesis/ss_home.htm |archive-date=2013-10-03 |url-status=dead }}</ref>
* [[IBM]]'s [[OS/2 Warp|OS/2 Warp 4]] included VoiceType, a precursor to [[IBM ViaVoice]].
* [[Global Positioning System|GPS]] Navigation units produced by [[Garmin]], [[Magellan Navigation|Magellan]], [[TomTom]] and others use speech synthesis for automobile navigation.
* [[Yamaha Corporation|Yamaha]] produced a music synthesizer in 1999, the [[Yamaha FS1R]] which included a Formant synthesis capability. Sequences of up to 512 individual vowel and consonant formants could be stored and replayed, allowing short vocal phrases to be synthesized.

=== Digital sound-alikes ===
At the 2018 [[Conference on Neural Information Processing Systems]] (NeurIPS) researchers from [[Google]] presented the work 'Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis', which [[Transfer learning|transfers learning]] from [[speaker recognition|speaker verification]] to achieve text-to-speech synthesis, that can be made to sound almost like anybody from a speech sample of only 5 seconds.<ref name="GoogleLearningTransferToTTS2018">

{{Citation
 | last1 = Jia
 | first1 = Ye
 | last2 = Zhang
 | first2 = Yu
 | last3 = Weiss
 | first3 = Ron J.
 | title = Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
 | journal = [[Advances in Neural Information Processing Systems]]
 | volume = 31
 | pages = 4485–4495
 | date = 2018-06-12
 | language = en
 | arxiv = 1806.04558
 }}

</ref>

Also researchers from [[Baidu Research]] presented a [[voice cloning]] system with similar aims at the 2018 NeurIPS conference,<ref name="Baidu2018">

{{Citation
 | last1 =  Arık
 | first1 = Sercan Ö.
 | last2 = Chen
 | first2 = Jitong
 | last3 = Peng
 | first3 = Kainan
 | last4 = Ping
 | first4 = Wei
 | last5 = Zhou
 | first5 = Yanqi
 | title = Neural Voice Cloning with a Few Samples
 | journal = [[Advances in Neural Information Processing Systems]]
 | volume = 31
 | year =2018
 | url = http://papers.nips.cc/paper/8206-neural-voice-cloning-with-a-few-samples
 | arxiv = 1802.06006
 }}

</ref> though the result is rather unconvincing.

By 2019 the digital sound-alikes found their way to the hands of criminals as [[NortonLifeLock|Symantec]] researchers know of 3 cases where digital sound-alikes technology has been used for crime.<ref name="BBC2019">
{{cite web
 |url= https://www.bbc.com/news/technology-48908736
 |title= Fake voices 'help cyber-crooks steal cash'
 |date= 2019-07-08
 |website= [[bbc.com]]
 |publisher= [[BBC]]
 |access-date= 2019-09-11
 }}
</ref><ref name="WaPo2019">
{{cite news
 |url= https://www.washingtonpost.com/technology/2019/09/04/an-artificial-intelligence-first-voice-mimicking-software-reportedly-used-major-theft/
 |title= An artificial-intelligence first: Voice-mimicking software reportedly used in a major theft
 |last= Drew
 |first= Harwell
 |date= 2019-09-04
 |newspaper= Washington Post
 |access-date= 2019-09-08
 }}
</ref>

This increases the stress on the disinformation situation coupled with the facts that 
* [[Human image synthesis]] since the early 2000s has improved beyond the point of human's inability to tell a real human imaged with a real camera from a simulation of a human imaged with a simulation of a camera.
*  2D video forgery techniques were presented in 2016 that allow [[Real-time computing#Near real-time|near real-time]] counterfeiting of [[facial expressions]] in existing 2D video.<ref name="Thi2016">{{cite web
  | last = Thies
  | first = Justus
  | title = Face2Face: Real-time Face Capture and Reenactment of RGB Videos
  | publisher = Proc. Computer Vision and Pattern Recognition (CVPR), IEEE
  | year = 2016
  | url = http://www.graphics.stanford.edu/~niessner/thies2016face.html
  | access-date =  2016-06-18}}
</ref>
* In [[SIGGRAPH]] 2017 an audio driven digital look-alike of upper torso of Barack Obama was presented by researchers from [[University of Washington]]. It was driven only by a voice track as source data for the animation after the training phase to acquire [[lip sync]] and wider facial information from training material consisting of 2D videos with audio had been completed.<ref name="Suw2017">{{Citation
 | last1 = Suwajanakorn | first1 = Supasorn 
 | last2 = Seitz | first2 = Steven 
 | last3 = Kemelmacher-Shlizerman | first3 = Ira 
 | title = Synthesizing Obama: Learning Lip Sync from Audio
 | publisher = [[University of Washington]]
 | year = 2017 
 | url = http://grail.cs.washington.edu/projects/AudioToObama/
 | access-date = 2018-03-02 }}
</ref>
In March 2020, a [[freeware]] web application called 15.ai that generates high-quality voices from an assortment of fictional characters from a variety of media sources was released.<ref name="Batch042020">
{{cite web|last=Ng|first=Andrew|date=2020-04-01|title=Voice Cloning for the Masses|url=https://blog.deeplearning.ai/blog/the-batch-ai-against-coronavirus-datasets-voice-cloning-for-the-masses-finding-unexploded-bombs-seeing-see-through-objects-optimizing-training-parameters|url-status=dead|archive-url=https://web.archive.org/web/20200807111844/https://blog.deeplearning.ai/blog/the-batch-ai-against-coronavirus-datasets-voice-cloning-for-the-masses-finding-unexploded-bombs-seeing-see-through-objects-optimizing-training-parameters|archive-date=2020-08-07|access-date=2020-04-02|website=deeplearning.ai|publisher=The Batch}}
</ref> Initial characters included [[GLaDOS]] from ''[[Portal (series)|Portal]]'', [[Twilight Sparkle]] and [[Fluttershy]] from the show ''[[My Little Pony: Friendship Is Magic]]'', and the [[Tenth Doctor]] from ''[[Doctor Who]]''.