Editing Speech recognition (section)

===Accuracy===
As mentioned earlier in this article, the accuracy of speech recognition may vary depending on the following factors:
* Error rates increase as the vocabulary size grows: 
::e.g. the 10 digits "zero" to "nine" can be recognized essentially perfectly, but vocabulary sizes of 200, 5000 or 100000 may have error rates of 3%, 7%, or 45% respectively.
* Vocabulary is hard to recognize if it contains confusing letters:
::e.g. the 26 letters of the English alphabet are difficult to discriminate because they are confusing words (most notoriously, the E-set: "B, C, D, E, G, P, T, V, Z — when "Z" is pronounced "zee" rather than "zed" depending on the English region); an 8% error rate is considered good for this vocabulary.<ref>{{Cite web |title=Letter Names Can Cause Confusion and Other Things to Know About Letter–Sound Relationships |url=https://www.naeyc.org/resources/pubs/yc/mar2015/letter-sound-relationships |access-date=2023-10-27 |website=NAEYC |language=en |archive-date=9 September 2024 |archive-url=https://web.archive.org/web/20240909054452/https://www.naeyc.org/resources/pubs/yc/mar2015/letter-sound-relationships |url-status=live }}</ref>
* Speaker dependence vs. independence:
:: A speaker-dependent system is intended for use by a single speaker.
:: A speaker-independent system is intended for use by any speaker (more difficult).
* Isolated, Discontinuous or continuous speech
:: With isolated speech, single words are used, therefore it becomes easier to recognize the speech.
With discontinuous speech full sentences separated by silence are used, therefore it becomes easier to recognize the speech as well as with isolated speech. <br />
With continuous speech naturally spoken sentences are used, therefore it becomes harder to recognize the speech, different from both isolated and discontinuous speech.

* Task and language constraints
**e.g. Querying application may dismiss the hypothesis "The apple is red."
**e.g. Constraints may be semantic; rejecting "The apple is angry."
**e.g. Syntactic; rejecting "Red is apple the."

Constraints are often represented by grammar. 
* Read vs. Spontaneous Speech – When a person reads it's usually in a context that has been previously prepared, but when a person uses spontaneous speech, it is difficult to recognize the speech because of the disfluencies (like "uh" and "um", false starts, incomplete sentences, stuttering, coughing, and laughter) and limited vocabulary. 
* Adverse conditions – Environmental noise (e.g. Noise in a car or a factory). Acoustical distortions (e.g. echoes, room acoustics)
Speech recognition is a multi-leveled pattern recognition task.
* Acoustical signals are structured into a hierarchy of units, e.g. [[Phoneme]]s, Words, Phrases, and Sentences;
* Each level provides additional constraints;
e.g. Known word pronunciations or legal word sequences, which can compensate for errors or uncertainties at a lower level;
* This hierarchy of constraints is exploited. By combining decisions probabilistically at all lower levels, and making more deterministic decisions only at the highest level, speech recognition by a machine is a process broken into several phases. Computationally, it is a problem in which a sound pattern has to be recognized or classified into a category that represents a meaning to a human. Every acoustic signal can be broken into smaller more basic sub-signals. As the more complex sound signal is broken into the smaller sub-sounds, different levels are created, where at the top level we have complex sounds, which are made of simpler sounds on the lower level, and going to lower levels, even more, we create more basic and shorter and simpler sounds. At the lowest level, where the sounds are the most fundamental, a machine would check for simple and more probabilistic rules of what sound should represent. Once these sounds are put together into more complex sounds on upper level, a new set of more deterministic rules should predict what the new complex sound should represent. The most upper level of a deterministic rule should figure out the meaning of complex expressions. In order to expand our knowledge about speech recognition, we need to take into consideration neural networks. There are four steps of neural network approaches: 
* Digitize the speech that we want to recognize
For telephone speech the sampling rate is 8000 samples per second; 
* Compute features of spectral-domain of the speech (with Fourier transform);
computed every 10&nbsp;ms, with one 10&nbsp;ms section called a frame;

Analysis of four-step neural network approaches can be explained by further information. Sound is produced by air (or some other medium) vibration, which we register by ears, but machines by receivers. Basic sound creates a wave which has two descriptions: [[amplitude]] (how strong is it), and [[frequency]] (how often it vibrates per second).
Accuracy can be computed with the help of word error rate (WER). Word error rate can be calculated by aligning the recognized word and referenced word using dynamic string alignment. The problem may occur while computing the word error rate due to the difference between the sequence lengths of the recognized word and referenced word.

The formula to compute the word error rate (WER) is:

<math>WER = {(s+d+i) \over n}</math>

where ''s'' is the number of substitutions, ''d'' is the number of deletions, ''i'' is the number of insertions, and ''n'' is the number of word references.

While computing, the word recognition rate (WRR) is used. The formula is:

: <math>WRR = 1 - WER = {(n-s-d-i) \over n} = {h-i \over n}</math>

where ''h'' is the number of correctly recognized words:

: <math>h = n -(s+d).</math>