Editing Sentence extraction (section)

=== Early approaches and some sample heuristics ===
Seminal papers which laid the foundations for many techniques used today have been published by [[Hans Peter Luhn]] in 1958<ref>{{Cite journal
 | author = [[Hans Peter Luhn]]
 | title = The Automatic Creation of Literature Abstracts
 | journal = [[IBM Journal]]
 |date=April 1958
 | pages = 159&ndash;165
 | url = http://www.research.ibm.com/journal/rd/022/luhn.pdf
}}</ref> and [[H. P Edmundson]] in 1969.<ref>{{Cite journal
 | author = [[H. P. Edmundson]]
 | year = 1969
 | title = New Methods in Automatic Extracting
 | journal = [[Journal of the ACM]]
 | volume = 16 
 | issue = 2
 | pages = 264&ndash;285
 | doi = 10.1145/321510.321519
 | s2cid = 1177942
 | url = http://courses.ischool.berkeley.edu/i256/f06/papers/edmonson69.pdf
}}</ref>

Luhn proposed to assign more weight to sentences at the beginning of the document or a paragraph.
Edmundson stressed the importance of title-words for summarization and was the first to employ stop-lists in order to filter uninformative words of low semantic content (e.g. most grammatical words such as ''of'', ''the'', ''a''). He also distinguished between bonus words and stigma words, i.e. words that probably occur together with important (e.g. the word form ''significant'') or unimportant information.
His idea of using key-words, i.e. words which occur significantly frequently in the document, is still one of the core heuristics of today's summarizers. With large linguistic corpora available today, the [[tf–idf]] value which originated in [[information retrieval]], can be successfully applied to identify the key words of a text: If for example the word ''cat'' occurs significantly more often in the text to be summarized (TF = "term frequency") than in the corpus (IDF means "inverse document frequency"; here the corpus is meant by ''document''), then ''cat'' is likely to be an important word of the text; the text may in fact be a text about cats.