Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Sentence extraction
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{Short description|Text summarization technique}} '''Sentence extraction''' is a technique used for [[automatic summarization]] of a text. In this shallow approach, [[heuristic (computer science)|statistical heuristics]] are used to identify the most salient sentences of a text. Sentence extraction is a low-cost approach compared to more knowledge-intensive deeper approaches which require additional knowledge bases such as [[ontology (computer science)|ontologies]] or [[linguistics|linguistic knowledge]]. In short, sentence extraction works as a filter that allows only meaningful sentences to pass. The major downside of applying sentence-extraction techniques to the task of summarization is the loss of [[coherence (linguistics)|coherence]] in the resulting summary. Nevertheless, sentence extraction summaries can give valuable clues to the main points of a document and are frequently sufficiently intelligible to human readers. == Procedure == Usually, a combination of heuristics is used to determine the most important sentences within the document. Each heuristic assigns a (positive or negative) score to the sentence. After all heuristics have been applied, the highest-scoring sentences are included in the summary. The individual heuristics are weighted according to their importance. === Early approaches and some sample heuristics === Seminal papers which laid the foundations for many techniques used today have been published by [[Hans Peter Luhn]] in 1958<ref>{{Cite journal | author = [[Hans Peter Luhn]] | title = The Automatic Creation of Literature Abstracts | journal = [[IBM Journal]] |date=April 1958 | pages = 159–165 | url = http://www.research.ibm.com/journal/rd/022/luhn.pdf }}</ref> and [[H. P Edmundson]] in 1969.<ref>{{Cite journal | author = [[H. P. Edmundson]] | year = 1969 | title = New Methods in Automatic Extracting | journal = [[Journal of the ACM]] | volume = 16 | issue = 2 | pages = 264–285 | doi = 10.1145/321510.321519 | s2cid = 1177942 | url = http://courses.ischool.berkeley.edu/i256/f06/papers/edmonson69.pdf }}</ref> Luhn proposed to assign more weight to sentences at the beginning of the document or a paragraph. Edmundson stressed the importance of title-words for summarization and was the first to employ stop-lists in order to filter uninformative words of low semantic content (e.g. most grammatical words such as ''of'', ''the'', ''a''). He also distinguished between bonus words and stigma words, i.e. words that probably occur together with important (e.g. the word form ''significant'') or unimportant information. His idea of using key-words, i.e. words which occur significantly frequently in the document, is still one of the core heuristics of today's summarizers. With large linguistic corpora available today, the [[tfβidf]] value which originated in [[information retrieval]], can be successfully applied to identify the key words of a text: If for example the word ''cat'' occurs significantly more often in the text to be summarized (TF = "term frequency") than in the corpus (IDF means "inverse document frequency"; here the corpus is meant by ''document''), then ''cat'' is likely to be an important word of the text; the text may in fact be a text about cats. == See also == * [[Sentence boundary disambiguation]] * [[Text segmentation]] == References == {{Reflist}} {{Natural Language Processing}} [[Category:Computational linguistics]] [[Category:Natural language processing]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:Cite journal
(
edit
)
Template:Natural Language Processing
(
edit
)
Template:Reflist
(
edit
)
Template:Short description
(
edit
)