Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Sentence extraction
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Early approaches and some sample heuristics === Seminal papers which laid the foundations for many techniques used today have been published by [[Hans Peter Luhn]] in 1958<ref>{{Cite journal | author = [[Hans Peter Luhn]] | title = The Automatic Creation of Literature Abstracts | journal = [[IBM Journal]] |date=April 1958 | pages = 159–165 | url = http://www.research.ibm.com/journal/rd/022/luhn.pdf }}</ref> and [[H. P Edmundson]] in 1969.<ref>{{Cite journal | author = [[H. P. Edmundson]] | year = 1969 | title = New Methods in Automatic Extracting | journal = [[Journal of the ACM]] | volume = 16 | issue = 2 | pages = 264–285 | doi = 10.1145/321510.321519 | s2cid = 1177942 | url = http://courses.ischool.berkeley.edu/i256/f06/papers/edmonson69.pdf }}</ref> Luhn proposed to assign more weight to sentences at the beginning of the document or a paragraph. Edmundson stressed the importance of title-words for summarization and was the first to employ stop-lists in order to filter uninformative words of low semantic content (e.g. most grammatical words such as ''of'', ''the'', ''a''). He also distinguished between bonus words and stigma words, i.e. words that probably occur together with important (e.g. the word form ''significant'') or unimportant information. His idea of using key-words, i.e. words which occur significantly frequently in the document, is still one of the core heuristics of today's summarizers. With large linguistic corpora available today, the [[tfβidf]] value which originated in [[information retrieval]], can be successfully applied to identify the key words of a text: If for example the word ''cat'' occurs significantly more often in the text to be summarized (TF = "term frequency") than in the corpus (IDF means "inverse document frequency"; here the corpus is meant by ''document''), then ''cat'' is likely to be an important word of the text; the text may in fact be a text about cats.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)