Editing Natural language generation (section)

==Evaluation==
As in other scientific fields, NLG researchers need to test how well their systems, modules, and algorithms work.  This is called ''evaluation''. There are three basic techniques for evaluating NLG systems:
* ''Task-based (extrinsic) evaluation'': give the generated text to a person, and assess how well it helps them perform a task (or otherwise achieves its communicative goal).  For example, a system which generates summaries of medical data can be evaluated by giving these summaries to doctors, and assessing whether the summaries help doctors make better decisions.<ref name=portet/>
* ''Human ratings'': give the generated text to a person, and ask them to rate the quality and usefulness of the text.
* ''Metrics'': compare generated texts to texts written by people from the same input data, using an automatic metric such as [[BLEU]], [[METEOR]], [[ROUGE (metric)|ROUGE]] and [[LEPOR]].

An ultimate goal is how useful NLG systems are at helping people, which is the first of the above techniques. However, task-based evaluations are time-consuming and expensive, and can be difficult to carry out (especially if they require subjects with specialised expertise, such as doctors).  Hence (as in other areas of NLP) task-based evaluations are the exception, not the norm.

Recently researchers are assessing how well human-ratings and metrics correlate with (predict) task-based evaluations.  Work is being conducted in the context of Generation Challenges<ref>[https://sites.google.com/view/genchalrepository/home Generation Challenges]</ref> shared-task events.  Initial results suggest that human ratings are much better than metrics in this regard.  In other words, human ratings usually do predict task-effectiveness at least to some degree (although there are exceptions), while ratings produced by metrics often do not predict task-effectiveness well.  These results are preliminary.  In any case, human ratings are the most popular evaluation technique in NLG; this is contrast to [[Evaluation of machine translation|machine translation]], where metrics are widely used.

An AI can be graded on ''faithfulness'' to its training data or, alternatively, on ''factuality''. A response that reflects the training data but not reality is faithful but not factual. A confident but unfaithful response is a ''[[hallucination (artificial intelligence)|hallucination]]''. In Natural Language Processing, a hallucination is often defined as "generated content that is nonsensical or unfaithful to the provided source content".<ref>{{cite journal |last1=Ji |first1=Ziwei |last2=Lee |first2=Nayeon |last3=Frieske |first3=Rita |last4=Yu |first4=Tiezheng |last5=Su |first5=Dan |last6=Xu |first6=Yan |last7=Ishii |first7=Etsuko |last8=Bang |first8=Yejin |last9=Madotto |first9=Andrea |last10=Fung |first10=Pascale |title=Survey of Hallucination in Natural Language Generation |journal=ACM Computing Surveys |date=17 November 2022 |volume=55 |issue=12 |pages=3571730 |doi=10.1145/3571730|s2cid=246652372 |doi-access=free |arxiv=2202.03629 }}</ref>