Editing Information extraction (section)

==Tasks and subtasks==
Applying information extraction to text is linked to the problem of [[text simplification]] in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical IE tasks and subtasks include:

* Template filling: Extracting a fixed set of fields from a document, e.g. extract perpetrators, victims, time, etc. from a newspaper article about a terrorist attack.
** Event extraction: Given an input document, output zero or more event templates. For instance, a newspaper article might describe multiple terrorist attacks.
* [[Knowledge Base]] Population: Fill a database of facts given a set of documents. Typically the database is in the form of triplets, (entity 1, relation, entity 2), e.g. ([[Barack Obama]], Spouse, [[Michelle Obama]])
** [[Named entity recognition]]: recognition of known entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions, by employing existing knowledge of the domain or information extracted from other sentences.<ref name="ecir2019">{{cite conference |author=Nguyen |first=Dat Quoc |last2=Verspoor |first2=Karin |author-link2=Karin Verspoor |year=2019 |title=End-to-end neural relation extraction using deep biaffine attention |conference= |arxiv=1812.11275 |doi=10.1007/978-3-030-15712-8_47 |book-title=Proceedings of the 41st European Conference on Information Retrieval (ECIR)}}</ref> Typically the recognition task involves assigning a unique identifier to the extracted entity. A simpler task is ''named entity detection'', which aims at detecting entities without having any existing knowledge about the entity instances. For example, in processing the sentence "M. Smith likes fishing", ''named entity detection'' would denote '''detecting''' that the phrase "M. Smith" does refer to a person, but without necessarily having (or using) any knowledge about a certain ''M. Smith'' who is (or, "might be") the specific person whom that sentence is talking about.
** [[Coreference]] resolution: detection of [[coreference]] and [[Anaphora (linguistics)|anaphoric]] links between text entities. In IE tasks, this is typically restricted to finding links between previously extracted named entities. For example, "International Business Machines" and "IBM" refer to the same real-world entity. If we take the two sentences "M. Smith likes fishing. But he doesn't like biking", it would be beneficial to detect that "he" is referring to the previously detected person "M. Smith".
** [[Relationship extraction]]: identification of relations between entities,<ref name="ecir2019" /> such as:
*** PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")
*** PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
* Semi-structured information extraction which may refer to any IE that tries to restore some kind of information structure that has been lost through publication, such as:
** Table extraction: finding and extracting tables from documents.<ref name="A framework for information extract">{{cite journal | vauthors = Milosevic N, Gregson C, Hernandez R, Nenadic G | title = A framework for information extraction from tables in biomedical literature | journal = International Journal on Document Analysis and Recognition | volume = 22 | issue = 1 | pages = 55–78 | date = February 2019 | doi = 10.1007/s10032-019-00317-0 | arxiv = 1902.10031 | bibcode = 2019arXiv190210031M | s2cid = 62880746 }}</ref><ref>{{cite thesis |type=PhD |last=Milosevic |first=Nikola |date=2018 |title=A multi-layered approach to information extraction from tables in biomedical documents |publisher=University of Manchester | url=https://www.research.manchester.ac.uk/portal/files/70405100/FULL_TEXT.PDF}}</ref>
** Table information extraction : extracting information in structured manner from the tables. This task is more complex than table extraction, as table extraction is only the first step, while understanding the roles of the cells, rows, columns, linking the information inside the table and understanding the information presented in the table are additional tasks necessary for table information extraction.<ref name="A framework for information extract"/><ref>{{cite book | vauthors = Milosevic N, Gregson C, Hernandez R, Nenadic G | title = Natural Language Processing and Information Systems | chapter = Disentangling the Structure of Tables in Scientific Literature | series = Lecture Notes in Computer Science | volume = 21  | date = June 2016 | pages = 162–174 | doi = 10.1007/978-3-319-41754-7_14 | isbn = 978-3-319-41753-0 | s2cid = 19538141 | chapter-url = https://pure.manchester.ac.uk/ws/files/41051279/Disentangling_the_Structure_of_Tables_in_Scientific_Literature.pdf }}</ref><ref>{{cite thesis |type=PhD |last=Milosevic |first=Nikola |date=2018 |title=A multi-layered approach to information extraction from tables in biomedical documents |publisher=University of Manchester | url=https://www.research.manchester.ac.uk/portal/files/70405100/FULL_TEXT.PDF}}</ref>
** Comments extraction : extracting comments from the actual content of articles in order to restore the link between authors of each of the sentences
* Language and vocabulary analysis
**[[Terminology extraction]]: finding the relevant terms for a given [[text corpus|corpus]]
* Audio extraction
** Template-based music extraction: finding relevant characteristic in an audio signal taken from a given repertoire; for instance <ref>A.Zils, F.Pachet, O.Delerue and F. Gouyon, [http://www.csl.sony.fr/downloads/papers/2002/ZilsMusic.pdf Automatic Extraction of Drum Tracks from Polyphonic Music Signals] {{Webarchive|url=https://web.archive.org/web/20170829163036/http://www.csl.sony.fr/downloads/papers/2002/ZilsMusic.pdf |date=2017-08-29 }}, Proceedings of WedelMusic, Darmstadt, Germany, 2002.</ref> time indexes of occurrences of percussive sounds can be extracted in order to represent the essential rhythmic component of a music piece.

Note that this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. Machine learning, statistical analysis and/or natural language processing are often used in IE.

IE on non-text documents is becoming an increasingly interesting topic{{when|date=March 2017}} in research, and information extracted from multimedia documents can now{{when|date=March 2017}} be expressed in a high level structure as it is done on text. This naturally leads to the fusion of extracted information from multiple kinds of documents and sources.