Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Natural language generation
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Evaluation== As in other scientific fields, NLG researchers need to test how well their systems, modules, and algorithms work. This is called ''evaluation''. There are three basic techniques for evaluating NLG systems: * ''Task-based (extrinsic) evaluation'': give the generated text to a person, and assess how well it helps them perform a task (or otherwise achieves its communicative goal). For example, a system which generates summaries of medical data can be evaluated by giving these summaries to doctors, and assessing whether the summaries help doctors make better decisions.<ref name=portet/> * ''Human ratings'': give the generated text to a person, and ask them to rate the quality and usefulness of the text. * ''Metrics'': compare generated texts to texts written by people from the same input data, using an automatic metric such as [[BLEU]], [[METEOR]], [[ROUGE (metric)|ROUGE]] and [[LEPOR]]. An ultimate goal is how useful NLG systems are at helping people, which is the first of the above techniques. However, task-based evaluations are time-consuming and expensive, and can be difficult to carry out (especially if they require subjects with specialised expertise, such as doctors). Hence (as in other areas of NLP) task-based evaluations are the exception, not the norm. Recently researchers are assessing how well human-ratings and metrics correlate with (predict) task-based evaluations. Work is being conducted in the context of Generation Challenges<ref>[https://sites.google.com/view/genchalrepository/home Generation Challenges]</ref> shared-task events. Initial results suggest that human ratings are much better than metrics in this regard. In other words, human ratings usually do predict task-effectiveness at least to some degree (although there are exceptions), while ratings produced by metrics often do not predict task-effectiveness well. These results are preliminary. In any case, human ratings are the most popular evaluation technique in NLG; this is contrast to [[Evaluation of machine translation|machine translation]], where metrics are widely used. An AI can be graded on ''faithfulness'' to its training data or, alternatively, on ''factuality''. A response that reflects the training data but not reality is faithful but not factual. A confident but unfaithful response is a ''[[hallucination (artificial intelligence)|hallucination]]''. In Natural Language Processing, a hallucination is often defined as "generated content that is nonsensical or unfaithful to the provided source content".<ref>{{cite journal |last1=Ji |first1=Ziwei |last2=Lee |first2=Nayeon |last3=Frieske |first3=Rita |last4=Yu |first4=Tiezheng |last5=Su |first5=Dan |last6=Xu |first6=Yan |last7=Ishii |first7=Etsuko |last8=Bang |first8=Yejin |last9=Madotto |first9=Andrea |last10=Fung |first10=Pascale |title=Survey of Hallucination in Natural Language Generation |journal=ACM Computing Surveys |date=17 November 2022 |volume=55 |issue=12 |pages=3571730 |doi=10.1145/3571730|s2cid=246652372 |doi-access=free |arxiv=2202.03629 }}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)