Editing Machine translation (section)

==Issues==
[[File:Stir Fried Wikipedia.jpg|thumb|right|250px|Machine translation could produce some non-understandable phrases, such as "{{lang|zh|鸡枞}}" (''[[Macrolepiota albuminosa]]'') being rendered as "wikipedia".]]
[[File:Machine translation in Bali.jpg|thumb|right|250px|Broken Chinese "{{lang|zh|沒有進入}}" from machine translation in [[Bali, Indonesia]]. The broken Chinese sentence sounds like "there does not exist an entry" or "have not entered yet".]]
Studies using human evaluation (e.g. by professional literary translators or human readers) have [[problem-solving|systematically identified various issues]] with the latest advanced MT outputs.<ref name="arxiv221014250"/> Common issues include the translation of ambiguous parts whose correct translation requires common sense-like semantic language processing or context.<ref name="arxiv221014250"/> There can also be errors in the source texts, missing high-quality training data and the severity of frequency of several types of problems may not get reduced with techniques used to date, requiring some level of human active participation.

===Disambiguation===
{{Main|Word-sense disambiguation|Syntactic disambiguation}}
Word-sense disambiguation concerns finding a suitable translation when a word can have more than one meaning. The problem was first raised in the 1950s by [[Yehoshua Bar-Hillel]].<ref>[http://ourworld.compuserve.com/homepages/WJHutchins/Miles-6.htm Milestones in machine translation – No.6: Bar-Hillel and the nonfeasibility of FAHQT] {{webarchive|url=https://web.archive.org/web/20070312062051/http://ourworld.compuserve.com/homepages/WJHutchins/Miles-6.htm |date=12 March 2007 }} by John Hutchins</ref> He pointed out that without a "universal encyclopedia", a machine would never be able to distinguish between the two meanings of a word.<ref>Bar-Hillel (1960), "Automatic Translation of Languages". Available online at http://www.mt-archive.info/Bar-Hillel-1960.pdf {{Webarchive|url=https://web.archive.org/web/20110928112348/http://www.mt-archive.info/Bar-Hillel-1960.pdf |date=28 September 2011 }}</ref> Today there are numerous approaches designed to overcome this problem. They can be approximately divided into "shallow" approaches and "deep" approaches.

Shallow approaches assume no knowledge of the text. They simply apply statistical methods to the words surrounding the ambiguous word. Deep approaches presume a comprehensive knowledge of the word. So far, shallow approaches have been more successful.<ref>{{Cite book|title=Hybrid approaches to machine translation|others=Costa-jussà, Marta R., Rapp, Reinhard, Lambert, Patrik, Eberle, Kurt, Banchs, Rafael E., Babych, Bogdan|date=21 July 2016|isbn=9783319213101|location=Switzerland|oclc=953581497}}</ref>

[[Claude Piron]], a long-time translator for the United Nations and the [[World Health Organization]], wrote that machine translation, at its best, automates the easier part of a translator's job; the harder and more time-consuming part usually involves doing extensive research to resolve [[ambiguity|ambiguities]] in the [[source text]], which the [[grammatical]] and [[Lexical (semiotics)|lexical]] exigencies of the [[Translation|target language]] require to be resolved:

{{Blockquote|Why does a translator need a whole workday to translate five pages, and not an hour or two? ..... About 90% of an average text corresponds to these simple conditions.  But unfortunately, there's the other 10%.  It's that part that requires six [more] hours of work.  There are ambiguities one has to resolve.  For instance, the author of the source text, an Australian physician, cited the example of an epidemic which was declared during World War II in a "Japanese prisoners of war camp".  Was he talking about an American camp with Japanese prisoners or a Japanese camp with American prisoners?  The English has two senses.  It's necessary therefore to do research, maybe to the extent of a phone call to Australia.<ref name="piron">[[Claude Piron]], ''Le défi des langues'' (The Language Challenge), Paris, L'Harmattan, 1994. <!-- GFDL translation by Jim Henry --></ref>
}}

The ideal deep approach would require the translation software to do all the research necessary for this kind of disambiguation on its own; but this would require a higher degree of [[AI]] than has yet been attained.  A shallow approach which simply guessed at the sense of the ambiguous English phrase that Piron mentions (based, perhaps, on which kind of prisoner-of-war camp is more often mentioned in a given corpus) would have a reasonable chance of guessing wrong fairly often.  A shallow approach that involves "ask the user about each ambiguity" would, by Piron's estimate, only automate about 25% of a professional translator's job, leaving the harder 75% still to be done by a human.

===Non-standard speech===
One of the major pitfalls of MT is its inability to translate non-standard language with the same accuracy as standard language. Heuristic or statistical based MT takes input from various sources in standard form of a language. Rule-based translation, by nature, does not include common non-standard usages. This causes errors in translation from a vernacular source or into colloquial language. Limitations on translation from casual speech present issues in the use of machine translation in mobile devices.

===Named entities===
{{main|Named entity}}
In [[information extraction]], named entities, in a narrow sense, refer to concrete or abstract entities in the real world such as people, organizations, companies, and places that have a proper name: George Washington, Chicago, Microsoft.  It also refers to expressions of time, space and quantity such as 1 July 2011, $500.

In the sentence "Smith is the president of Fabrionix" both ''Smith'' and ''Fabrionix'' are named entities, and can be further qualified via first name or other information; "president" is not, since Smith could have earlier held another position at Fabrionix, e.g. Vice President.
The term [[rigid designator]] is what defines these usages for analysis in statistical machine translation.

Named entities must first be identified in the text; if not, they may be erroneously translated as common nouns, which would most likely not affect the [[Bilingual evaluation understudy|BLEU]] rating of the translation but would change the text's human readability.<ref>{{Cite conference |last1=Babych |first1=Bogdan |last2=Hartley |first2=Anthony |date=2003 |title=Improving Machine Translation Quality with Automatic Named Entity Recognition |url=http://www.cl.cam.ac.uk/~ar283/eacl03/workshops03/W03-w1_eacl03babych.local.pdf |conference=Paper presented at the 7th International EAMT Workshop on MT and Other Language Technology Tools... |archive-url=https://web.archive.org/web/20060514031411/http://www.cl.cam.ac.uk/~ar283/eacl03/workshops03/W03-w1_eacl03babych.local.pdf |archive-date=14 May 2006 |access-date=4 November 2013 |url-status=dead}}</ref> They may be omitted from the output translation, which would also have implications for the text's readability and message.

[[Transliteration]] includes finding the letters in the target language that most closely correspond to the name in the source language.  This, however, has been cited as sometimes worsening the quality of translation.<ref>Hermajakob, U., Knight, K., & Hal, D. (2008).  [http://www.aclweb.org/old_anthology/P/P08/P08-1.pdf#page=433 Name Translation in Statistical Machine Translation Learning When to Transliterate] {{Webarchive|url=https://web.archive.org/web/20180104073326/http://www.aclweb.org/old_anthology/P/P08/P08-1.pdf#page=433 |date=4 January 2018 }}.  Association for Computational Linguistics.  389–397.</ref> For "Southern California" the first word should be translated directly, while the second word should be transliterated.  Machines often transliterate both because they treated them as one entity.  Words like these are hard for machine translators, even those with a transliteration component, to process.

Use of a "do-not-translate" list, which has the same end goal – transliteration as opposed to translation.<ref name="singla">{{Citation |last1=Neeraj Agrawal |title=Using Named Entity Recognition to improve Machine Translation |url=http://nlp.stanford.edu/courses/cs224n/2010/reports/singla-nirajuec.pdf |archive-url=https://web.archive.org/web/20130521075940/http://nlp.stanford.edu/courses/cs224n/2010/reports/singla-nirajuec.pdf |access-date=4 November 2013 |archive-date=21 May 2013 |last2=Ankush Singla |mode=cs1 |url-status=live}}</ref>  still relies on correct identification of named entities.

A third approach is a class-based model. Named entities are replaced with a token to represent their "class"; "Ted"  and "Erica" would both be replaced with "person" class token. Then the statistical distribution and use of person names, in general, can be analyzed instead of looking at the distributions of "Ted" and "Erica" individually, so that the probability of a given name in a specific language will not affect the assigned probability of a translation. A study by Stanford on improving this area of translation gives the examples that different probabilities will be assigned to "David is going for a walk" and "Ankit is going for a walk" for English as a target language due to the different number of occurrences for each name in the training data. A frustrating outcome of the same study by Stanford (and other attempts to improve named recognition translation) is that many times, a decrease in the [[Bilingual evaluation understudy|BLEU]] scores for translation will result from the inclusion of methods for named entity translation.<ref name="singla" />