Editing Automatic summarization (section)

=====Diversity=====
Multi-document extractive summarization faces a problem of redundancy. Ideally, we want to extract sentences that are both "central" (i.e., contain the main ideas) and "diverse" (i.e., they differ from one another). For example, in a set of news articles about some event, each article is likely to have many similar sentences. To address this issue, LexRank applies a heuristic post-processing step that adds sentences in rank order, but discards sentences that are too similar to ones already in the summary. This method is called Cross-Sentence Information Subsumption (CSIS). These methods work based on the idea that sentences "recommend" other similar sentences to the reader. Thus, if one sentence is very similar to many others, it will likely be a sentence of great importance. Its importance also stems from the importance of the sentences "recommending" it. Thus, to get ranked highly and placed in a summary, a sentence must be similar to many sentences that are in turn also similar to many other sentences. This makes intuitive sense and allows the algorithms to be applied to an arbitrary new text. The methods are domain-independent and easily portable. One could imagine the features indicating important sentences in the news domain might vary considerably from the biomedical domain. However, the unsupervised "recommendation"-based approach applies to any domain.

A related method is Maximal Marginal Relevance (MMR),<ref>Carbonell, Jaime, and Jade Goldstein. "[https://www.cs.cmu.edu/afs/.cs.cmu.edu/Web/People/jgc/publication/MMR_DiversityBased_Reranking_SIGIR_1998.pdf The use of MMR, diversity-based reranking for reordering documents and producing summaries]." Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1998.</ref> which uses a general-purpose graph-based ranking algorithm like Page/Lex/TextRank that handles both "centrality" and "diversity" in a unified mathematical framework based on [[absorbing Markov chain]] random walks (a random walk where certain states end the walk). The algorithm is called GRASSHOPPER.<ref>Zhu, Xiaojin, et al. "[http://www.aclweb.org/anthology/N07-1013 Improving Diversity in Ranking using Absorbing Random Walks]." HLT-NAACL. 2007.</ref> In addition to explicitly promoting diversity during the ranking process, GRASSHOPPER incorporates a prior ranking (based on sentence position in the case of summarization).

The state of the art results for multi-document summarization are obtained using mixtures of submodular functions. These methods have achieved the state of the art results for Document Summarization Corpora, DUC 04 - 07.<ref>Hui Lin, Jeff Bilmes. "[https://arxiv.org/abs/1210.4871 Learning mixtures of submodular shells with application to document summarization]</ref> Similar results were achieved with the use of determinantal point processes (which are a special case of submodular functions) for DUC-04.<ref>Alex Kulesza and Ben Taskar, [http://www.nowpublishers.com/article/DownloadSummary/MAL-044 Determinantal point processes for machine learning]. Foundations and Trends in Machine Learning, December 2012.</ref>

A new method for multi-lingual multi-document summarization that avoids redundancy generates ideograms to represent the meaning of each sentence in each document, then evaluates similarity by comparing ideogram shape and position. It does not use word frequency, training or preprocessing. It uses two user-supplied parameters: equivalence (when are two sentences to be considered equivalent?) and relevance (how long is the desired summary?).