Editing Automatic summarization (section)

===Keyphrase extraction===
The task is the following. You are given a piece of text, such as a journal article, and you must produce a list of keywords or key[phrase]s that capture the primary topics discussed in the text.<ref>{{Cite book |doi = 10.1007/978-3-319-66939-7_19|chapter = SemCluster: Unsupervised Automatic Keyphrase Extraction Using Affinity Propagation|title = Advances in Computational Intelligence Systems|volume = 650|pages = 222–235|series = Advances in Intelligent Systems and Computing|year = 2018|last1 = Alrehamy|first1 = Hassan H|last2 = Walker|first2 = Coral|isbn = 978-3-319-66938-0}}</ref> In the case of [[research article]]s, many authors provide manually assigned keywords, but most text lacks pre-existing keyphrases. For example, news articles rarely have keyphrases attached, but it would be useful to be able to automatically do so for a number of applications discussed below.
Consider the example text from a news article:

:"The Army Corps of Engineers, rushing to meet President Bush's promise to protect New Orleans by the start of the 2006 hurricane season, installed defective flood-control pumps last year despite warnings from its own expert that the equipment would fail during a storm, according to documents obtained by The Associated Press".

A keyphrase extractor might select "Army Corps of Engineers", "President Bush", "New Orleans", and "defective flood-control pumps" as keyphrases. These are pulled directly from the text. In contrast, an abstractive keyphrase system would somehow internalize the content and generate keyphrases that do not appear in the text, but more closely resemble what a human might produce, such as "political negligence" or "inadequate protection from floods". Abstraction requires a deep [[natural-language understanding|understanding of the text]], which makes it difficult for a computer system.
Keyphrases have many applications. They can enable document browsing by providing a short summary, improve [[information retrieval]] (if documents have keyphrases assigned, a user could search by keyphrase to produce more reliable hits than a [[full-text search]]), and be employed in generating index entries for a large text corpus.

Depending on the different literature and the definition of key terms, words or phrases, [[keyword extraction]] is a highly related theme.

====Supervised learning approaches====
Beginning with the work of Turney,<ref>{{Cite journal |arxiv = cs/0212020|last1 = Turney|first1 = Peter D|title = Learning Algorithms for Keyphrase Extraction|journal = Information Retrieval|volume = 2|issue = 4|pages = 303–336|year = 2002|doi = 10.1023/A:1009976227802|bibcode = 2002cs.......12020T|s2cid = 7007323}}</ref> many researchers have approached keyphrase extraction as a [[supervised machine learning]] problem.
Given a document, we construct an example for each [[unigram]], [[bigram]], and trigram found in the text (though other text units are also possible, as discussed below). We then compute various features describing each example (e.g., does the phrase begin with an upper-case letter?). We assume there are known keyphrases available for a set of training documents. Using the known keyphrases, we can assign positive or negative labels to the examples. Then we learn a classifier that can discriminate between positive and negative examples as a function of the features. Some classifiers make a [[binary classification]] for a test example, while others assign a probability of being a keyphrase. For instance, in the above text, we might learn a rule that says phrases with initial capital letters are likely to be keyphrases.
After training a learner, we can select keyphrases for test documents in the following manner. We apply the same example-generation strategy to the test documents, then run each example through the learner. We can determine the keyphrases by looking at binary classification decisions or probabilities returned from our learned model. If probabilities are given, a threshold is used to select the keyphrases.
Keyphrase extractors are generally evaluated using [[precision and recall]]. Precision measures how
many of the proposed keyphrases are actually correct. Recall measures how many of the true
keyphrases your system proposed. The two measures can be combined in an F-score, which is the
harmonic mean of the two (''F''&nbsp;=&nbsp;2''PR''/(''P''&nbsp;+&nbsp;''R'') ). Matches between the proposed keyphrases and the known keyphrases can be checked after stemming or applying some other text normalization.

Designing a supervised keyphrase extraction system involves deciding on several choices (some of these apply to unsupervised, too). The first choice is exactly how to generate examples. Turney and others have used all possible unigrams, bigrams, and trigrams without intervening punctuation and after removing stopwords. Hulth showed that you can get some improvement by selecting examples to be sequences of tokens that match certain patterns of part-of-speech tags. Ideally, the mechanism for generating examples produces all the known labeled keyphrases as candidates, though this is often not the case. For example, if we use only unigrams, bigrams, and trigrams, then we will never be able to extract a known keyphrase containing four words. Thus, recall may suffer. However, generating too many examples can also lead to low precision.

We also need to create features that describe the examples and are informative enough to allow a learning algorithm to discriminate keyphrases from non- keyphrases. Typically features involve various term frequencies (how many times a phrase appears in the current text or in a larger corpus), the length of the example, relative position of the first occurrence, various Boolean syntactic features (e.g., contains all caps), etc. The Turney paper used about 12 such features. Hulth uses a reduced set of features, which were found most successful in the KEA (Keyphrase Extraction Algorithm) work derived from Turney's seminal paper.

In the end, the system will need to return a list of keyphrases for a test document, so we need to have a way to limit the number. Ensemble methods (i.e., using votes from several classifiers) have been used to produce numeric scores that can be thresholded to provide a user-provided number of keyphrases. This is the technique used by Turney with C4.5 decision trees. Hulth used a single binary classifier so the learning algorithm implicitly determines the appropriate number.

Once examples and features are created, we need a way to learn to predict keyphrases. Virtually any supervised learning algorithm could be used, such as decision trees, [[Naive Bayes]], and rule induction. In the case of Turney's GenEx algorithm, a [[genetic algorithm]] is used to learn parameters for a domain-specific keyphrase extraction algorithm. The extractor follows a series of heuristics to identify keyphrases. The genetic algorithm optimizes parameters for these heuristics with respect to performance on training documents with known key phrases.

====Unsupervised approach: TextRank====
Another keyphrase extraction algorithm is TextRank. While supervised methods have some nice properties, like being able to produce interpretable rules for what features characterize a keyphrase, they also require a large amount of [[training set|training data]]. Many documents with known keyphrases are needed. Furthermore, training on a specific domain tends to customize the extraction process to that domain, so the resulting classifier is not necessarily portable, as some of Turney's results demonstrate.
Unsupervised keyphrase extraction removes the need for training data. It approaches the problem from a different angle. Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm<ref>Rada Mihalcea and Paul Tarau, 2004: ''TextRank: Bringing Order into Texts'', Department of Computer Science University of North Texas {{Cite web |url=http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf |title=Archived copy |access-date=2012-07-20 |archive-date=2012-06-17 |archive-url=https://web.archive.org/web/20120617170501/http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf |url-status=bot: unknown }}</ref> exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that [[PageRank]] selects important Web pages. Recall this is based on the notion of "prestige" or "recommendation" from [[social network]]s. In this way, TextRank does not rely on any previous training data at all, but rather can be run on any arbitrary piece of text, and it can produce output simply based on the text's intrinsic properties. Thus the algorithm is easily portable to new domains and languages.

TextRank is a general purpose [[Graph (abstract data type)|graph]]-based ranking algorithm for [[natural language processing|NLP]]. Essentially, it runs PageRank on a graph specially designed for a particular NLP task. For keyphrase extraction, it builds a graph using some set of text units as vertices. Edges are based on some measure of semantic or [[lexical (semiotics)|lexical]] [[semantic similarity|similarity]] between the text unit vertices. Unlike PageRank, the edges are typically undirected and can be weighted to reflect a degree of similarity. Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to [[eigenvalue]] 1 (i.e., the [[stationary distribution]] of the [[random walk]] on the graph).

The vertices should correspond to what we want to rank. Potentially, we could do something similar to the supervised methods and create a vertex for each unigram, bigram, trigram, etc. However, to keep the graph small, the authors decide to rank individual unigrams in a first step, and then include a second step that merges highly ranked adjacent unigrams to form multi-word phrases. This has a nice side effect of allowing us to produce keyphrases of arbitrary length. For example, if we rank unigrams and find that "advanced", "natural", "language", and "processing" all get high ranks, then we would look at the original text and see that these words appear consecutively and create a final keyphrase using all four together. Note that the unigrams placed in the graph can be filtered by part of speech. The authors found that adjectives and nouns were the best to include. Thus, some linguistic knowledge comes into play in this step.

Edges are created based on word [[co-occurrence]] in this application of TextRank. Two vertices are connected by an edge if the [[unigram]]s appear within a window of size N in the original text. N is typically around 2–10. Thus, "natural" and "language" might be linked in a text about NLP. "Natural" and "processing" would also be linked because they would both appear in the same string of N words. These edges build on the notion of "text [[Cohesion (linguistics)|cohesion]]" and the idea that words that appear near each other are likely related in a meaningful way and "recommend" each other to the reader.

Since this method simply ranks the individual vertices, we need a way to threshold or produce a limited number of keyphrases. The technique chosen is to set a count T to be a user-specified fraction of the total number of vertices in the graph. Then the top T vertices/unigrams are selected based on their stationary probabilities. A post- processing step is then applied to merge adjacent instances of these T unigrams. As a result, potentially more or less than T final keyphrases will be produced, but the number should be roughly proportional to the length of the original text.

It is not initially clear why applying PageRank to a co-occurrence graph would produce useful keyphrases. One way to think about it is the following. A word that appears multiple times throughout a text may have many different co-occurring neighbors. For example, in a text about machine learning, the unigram "learning" might co-occur with "machine", "supervised", "un-supervised", and "semi-supervised" in four different sentences. Thus, the "learning" vertex would be a central "hub" that connects to these other modifying words. Running PageRank/TextRank on the graph is likely to rank "learning" highly. Similarly, if the text contains the phrase "supervised classification", then there would be an edge between "supervised" and "classification". If "classification" appears several other places and thus has many neighbors, its importance would contribute to the importance of "supervised". If it ends up with a high rank, it will be selected as one of the top T unigrams, along with "learning" and probably "classification". In the final post-processing step, we would then end up with keyphrases "supervised learning" and "supervised classification".

In short, the co-occurrence graph will contain densely connected regions for terms that appear often and in different contexts. A random walk on this graph will have a stationary distribution that assigns large probabilities to the terms in the centers of the clusters. This is similar to densely connected Web pages getting ranked highly by PageRank. This approach has also been used in document summarization, considered below.