Editing Automatic summarization (section)

==Evaluation==
<!-- IMPORTANT: This section needs to be tied in to the above article so it fits in.  Currently, it is not clear what the relation of evaluation is to any of the above topics. The following questions need to be answered: First, in the context of automatic summarization, what is evaluation?  Second, what is the significance of evaluation?  That is, what is evaluation used for?
-->
The most common way to evaluate the informativeness of automatic summaries is to compare them with human-made model summaries.

Evaluation can be intrinsic or extrinsic,<ref>[http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings2/sum-mani.pdf Mani, I. Summarization evaluation: an overview]</ref> and inter-textual or intra-textual.<ref>{{Cite journal | doi=10.3103/S0005105507030041|title = A method for evaluating modern systems of automatic text summarization| journal=Automatic Documentation and Mathematical Linguistics| volume=41| issue=3| pages=93–103|year = 2007|last1 = Yatsko|first1 = V. A.| last2=Vishnyakov| first2=T. N.|s2cid = 7853204}}</ref>

=== Intrinsic versus extrinsic ===
Intrinsic evaluation assesses the summaries directly, while extrinsic evaluation evaluates how the summarization system affects the completion of some other task. Intrinsic evaluations have assessed mainly the coherence and informativeness of summaries. Extrinsic evaluations, on the other hand, have tested the impact of summarization on tasks like relevance assessment, reading comprehension, etc.

=== Inter-textual versus intra-textual ===
Intra-textual evaluation assess the output of a specific summarization system, while inter-textual evaluation focuses on contrastive analysis of outputs of several summarization systems.

Human judgement often varies greatly in what it considers a "good" summary, so creating an automatic evaluation process is particularly difficult. Manual evaluation can be used, but this is both time and labor-intensive, as it requires humans to read not only the summaries but also the source documents. Other issues are those concerning [[coherence (linguistics)|coherence]] and coverage.

The most common way to evaluate summaries is [[ROUGE (metric)|ROUGE]] (Recall-Oriented Understudy for Gisting Evaluation). It is very common for summarization and translation systems in [[NIST]]'s Document Understanding Conferences.[https://web.archive.org/web/20060408135021/http://haydn.isi.edu/ROUGE/] ROUGE is a recall-based measure of how well a summary covers the content of human-generated summaries known as references. It calculates [[n-gram]] overlaps between automatically generated summaries and previously written human summaries. It is recall-based to encourage inclusion of all important topics in summaries. Recall can be computed with respect to unigram, bigram, trigram, or 4-gram matching. For example, ROUGE-1 is the fraction of unigrams that appear in both the reference summary and the automatic summary out of all unigrams in the reference summary. If there are multiple reference summaries, their scores are averaged. A high level of overlap should indicate a high degree of shared concepts between the two summaries.

ROUGE cannot determine if the result is coherent, that is if sentences flow together in a sensibly. High-order n-gram ROUGE measures help to some degree.

Another unsolved problem is [[anaphora (linguistics)|Anaphor resolution]]. Similarly, for image summarization, Tschiatschek et al., developed a Visual-ROUGE score which judges the performance of algorithms for image summarization.<ref>Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, [http://papers.nips.cc/paper/5415-learning-mixtures-of-submodular-functions-for-image-collection-summarization.pdf Learning Mixtures of Submodular Functions for Image Collection Summarization], In Advances of Neural Information Processing Systems (NIPS), Montreal, Canada, December - 2014. (PDF)</ref>

===Domain-specific versus domain-independent summarization===
Domain-independent summarization techniques apply sets of general features to identify information-rich text segments. Recent research focuses on domain-specific summarization using knowledge specific to the text's domain, such as medical knowledge and ontologies for summarizing medical texts.<ref>{{Cite book|last1=Sarker|first1=Abeed|last2=Molla|first2=Diego|last3=Paris|first3=Cecile|title=Artificial Intelligence in Medicine |chapter=An Approach for Query-Focused Text Summarisation for Evidence Based Medicine |date=2013|volume=7885|pages=295–304|doi=10.1007/978-3-642-38326-7_41|series=Lecture Notes in Computer Science|isbn=978-3-642-38325-0}}</ref>

===Qualitative===
The main drawback of the evaluation systems so far is that we need a reference summary (for some methods, more than one), to compare automatic summaries with models. This is a hard and expensive task. Much effort has to be made to create corpora of texts and their corresponding summaries. Furthermore, some methods require manual annotation of the summaries (e.g. SCU in the Pyramid Method). Moreover, they all perform a quantitative evaluation with regard to different similarity metrics.