Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Annotation
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Software and engineering == === Text documents === {{Main|Text annotation}} [[Markup language]]s like [[XML]] and [[HTML]] annotate text in a way that is syntactically distinguishable from that text. They can be used to add information about the desired visual presentation, or machine-readable semantic information, as in the [[semantic web]].<ref name="Web Annotation Data Model">{{cite web | title = Web Annotation Data Model | url =http://www.w3.org/TR/annotation-model/ | publisher = [[World Wide Web Consortium]] | date = 11 December 2014 | access-date = 25 August 2015}}</ref> === Tabular data === This includes [[Comma-separated values|CSV]] and [[Microsoft Excel|XLS]]. The process of assigning semantic annotations to tabular data is referred to as semantic labelling. '''Semantic Labelling''' is the process of assigning annotations from [[ontologies]] to tabular data.<ref name="auto12">{{Cite journal |last1=Alobaid |first1=Ahmad |last2=Kacprzak |first2=Emilia |last3=Corcho |first3=Oscar |date=January 1, 2021 |title=Typology-based semantic labeling of numeric tabular data |url=https://content.iospress.com/articles/semantic-web/sw200397 |journal=Semantic Web |volume=12 |issue=1 |pages=5–20 |doi=10.3233/SW-200397 |via=content.iospress.com |s2cid=224853014|url-access=subscription }}</ref><ref>{{Cite journal |last1=Taheriyan |first1=Mohsen |last2=Knoblock |first2=Craig A. |last3=Szekely |first3=Pedro |last4=Ambite |first4=José Luis |date=March 1, 2016 |title=Learning the semantics of structured data sources |url=https://doi.org/10.1016/j.websem.2015.12.003 |journal=Web Semantics: Science, Services and Agents on the World Wide Web |volume=37 |issue=C |pages=152–169 |arxiv=1601.04105 |doi=10.1016/j.websem.2015.12.003 |via=March 2016 |s2cid=7409058}}</ref><ref name="auto2">{{Cite book |last1=Alobaid |first1=Ahmad |last2=Corcho |first2=Oscar |title=Knowledge Engineering and Knowledge Management |chapter=Fuzzy Semantic Labeling of Semi-structured Numerical Datasets |date=2018 |editor-last=Faron Zucker |editor-first=Catherine |editor2-last=Ghidini |editor2-first=Chiara |editor3-last=Napoli |editor3-first=Amedeo |editor4-last=Toussaint |editor4-first=Yannick |chapter-url=http://oa.upm.es/56252/ |series=Lecture Notes in Computer Science |language=en |location=Cham |publisher=Springer International Publishing |volume=11313 |pages=19–33 |doi=10.1007/978-3-030-03667-6_2 |isbn=978-3-030-03667-6}}</ref><ref name=":02">{{Cite journal |last1=Alobaid |first1=Ahmad |last2=Corcho |first2=Oscar |date=2022-03-15 |title=Balancing coverage and specificity for semantic labelling of subject columns |url=https://www.sciencedirect.com/science/article/pii/S095070512101159X |journal=Knowledge-Based Systems |language=en |volume=240 |pages=108092 |doi=10.1016/j.knosys.2021.108092 |issn=0950-7051 |s2cid=245971543|url-access=subscription }}</ref> This process is also referred to as semantic annotation.<ref>{{Cite web <!-- Citation bot bypass--> |last1=Hassanzadeh |first1=O. |last2=Ward |first2=Michael J. |last3=Rodriguez-Muro |first3=Mariano |last4=Srinivas |first4=Kavitha |date=December 17, 2015|url=https://www.semanticscholar.org/paper/Understanding-a-large-corpus-of-web-tables-through-Hassanzadeh-Ward/f3d7550fcdf9c284874c05931ced2ffbcb2accc0 |title=Understanding a large corpus of web tables through matching with knowledge bases: an empirical study |s2cid=442374}}</ref><ref name=":02"/> Semantic Labelling is often done in a (semi-)automatic fashion. Semantic Labelling techniques work on entity columns,<ref name=":02"/> numeric columns,<ref name="auto12"/><ref name="auto2"/><ref>{{Cite book |last1=Neumaier |first1=Sebastian |last2=Umbrich |first2=Jürgen |last3=Parreira |first3=Josiane Xavier |last4=Polleres |first4=Axel |title=The Semantic Web – ISWC 2016 |chapter=Multi-level Semantic Labelling of Numerical Values |date=2016 |editor-last=Groth |editor-first=Paul |editor2-last=Simperl |editor2-first=Elena |editor3-last=Gray |editor3-first=Alasdair |editor4-last=Sabou |editor4-first=Marta |editor5-last=Krötzsch |editor5-first=Markus |editor6-last=Lecue |editor6-first=Freddy |editor7-last=Flöck |editor7-first=Fabian |editor8-last=Gil |editor8-first=Yolanda |chapter-url=https://link.springer.com/chapter/10.1007/978-3-319-46523-4_26 |series=Lecture Notes in Computer Science |language=en |location=Cham |publisher=Springer International Publishing |volume=9981 |pages=428–445 |doi=10.1007/978-3-319-46523-4_26 |isbn=978-3-319-46523-4}}</ref><ref name=":102">{{Cite book |last1=Zhang |first1=Meihui |last2=Chakrabarti |first2=Kaushik |title=Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data |chapter=InfoGather+ |date=2013-06-22 |chapter-url=https://doi.org/10.1145/2463676.2465276 |series=SIGMOD '13 |location=New York, NY, USA |publisher=Association for Computing Machinery |pages=145–156 |doi=10.1145/2463676.2465276 |isbn=978-1-4503-2037-5 |s2cid=15540847}}</ref> coordinates,<ref name=":1">{{Cite book |last1=Ritze |first1=Dominique |title=Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics |last2=Lehmberg |first2=Oliver |last3=Bizer |first3=Christian |date=July 13, 2015 |publisher=Association for Computing Machinery |isbn=9781450332934 |pages=1–6 |chapter=Matching HTML Tables to DBpedia |doi=10.1145/2797115.2797118 |chapter-url=https://doi.org/10.1145/2797115.2797118 |via=ACM Digital Library |s2cid=207228254}}</ref> and more.<ref name=":1" /><ref name=":102"/> ==== Semantic labelling techniques ==== There are several semantic labelling types which utilises machine learning techniques. These techniques can be categorised following the work of Flach<ref name=":2">{{Cite book |last=Flach |first=Peter |url=https://www.cambridge.org/core/books/machine-learning/621D3E616DF879E494B094CC93ED36A4 |title=Machine Learning: The Art and Science of Algorithms that Make Sense of Data |date=2012 |publisher=Cambridge University Press |isbn=978-1-107-09639-4 |location=Cambridge |doi=10.1017/cbo9780511973000}}</ref><ref name=":5">{{Cite thesis |title=Knowledge-Graph-Based Semantic Labeling of Tabular Data |url=https://oa.upm.es/64068/ |publisher=E.T.S. de Ingenieros Informáticos (UPM) |date=c. 2020 |degree=phd |doi=10.20868/upm.thesis.64068 |first=Ahmad |last=Alobaid}}</ref> as follows: geometric (using lines and planes, such as [[Support-vector machine]], [[Linear regression]]), probabilistic (e.g., [[Conditional random field]]), logical (e.g., [[Decision tree learning]]), and Non-ML techniques (e.g., balancing coverage and specificity<ref name=":02"/>). Note that the geometric, probabilistic, and logical machine learning models are not mutually exclusive.<ref name=":2" /> ===== Geometric techniques ===== Pham et al.<ref name=":6">{{Cite book |last1=Pham |first1=Minh |last2=Alse |first2=Suresh |last3=Knoblock |first3=Craig A. |last4=Szekely |first4=Pedro |title=The Semantic Web – ISWC 2016 |chapter=Semantic Labeling: A Domain-Independent Approach |date=2016 |editor-last=Groth |editor-first=Paul |editor2-last=Simperl |editor2-first=Elena |editor3-last=Gray |editor3-first=Alasdair |editor4-last=Sabou |editor4-first=Marta |editor5-last=Krötzsch |editor5-first=Markus |editor6-last=Lecue |editor6-first=Freddy |editor7-last=Flöck |editor7-first=Fabian |editor8-last=Gil |editor8-first=Yolanda |chapter-url=https://link.springer.com/chapter/10.1007/978-3-319-46523-4_27 |series=Lecture Notes in Computer Science |language=en |location=Cham |publisher=Springer International Publishing |volume=9981 |pages=446–462 |doi=10.1007/978-3-319-46523-4_27 |isbn=978-3-319-46523-4|s2cid=37873758 }}</ref> use [[Jaccard index]] and [[Tf–idf|TF-IDF]] similarity for textual data and [[Kolmogorov–Smirnov test]] for the numeric ones. Alobaid and Corcho<ref name="auto2"/> use [[fuzzy clustering]] (c-means<ref>{{Citation |title=Fuzzy c-Means Library |date=2022-01-29 |url=https://github.com/oeg-upm/fcm-cpp |publisher=Ontology Engineering Group (UPM) |access-date=2023-01-04}}</ref><ref>{{Citation |title=fuzzy-c-means |date=2022-12-12 |url=https://github.com/oeg-upm/fuzzy-c-means |publisher=Ontology Engineering Group (UPM) |access-date=2023-01-04}}</ref>) to label numeric columns. ===== Probabilistic techniques ===== Limaye et al.<ref name=":7">{{Cite journal |last1=Limaye |first1=Girija |last2=Sarawagi |first2=Sunita |last3=Chakrabarti |first3=Soumen |date=2010-09-01 |title=Annotating and searching web tables using entities, types and relationships |url=https://doi.org/10.14778/1920841.1921005 |journal=Proceedings of the VLDB Endowment |volume=3 |issue=1–2 |pages=1338–1347 |doi=10.14778/1920841.1921005 |issn=2150-8097 |s2cid=9262964}}</ref> uses [[Tf–idf|TF-IDF]] similarity and [[graphical model]]s. They also use [[support-vector machine]] to compute the weights. Venetis et al.<ref name=":8">{{Cite journal |last1=Venetis |first1=Petros |last2=Halevy |first2=Alon |last3=Madhavan |first3=Jayant |last4=Paşca |first4=Marius |last5=Shen |first5=Warren |last6=Wu |first6=Fei |last7=Miao |first7=Gengxin |last8=Wu |first8=Chung |date=2011-06-01 |title=Recovering semantics of tables on the web |url=https://doi.org/10.14778/2002938.2002939 |journal=Proceedings of the VLDB Endowment |volume=4 |issue=9 |pages=528–538 |doi=10.14778/2002938.2002939 |issn=2150-8097 |s2cid=11359711}}</ref> construct an isA database which consists of the pairs (instance, class) and then compute maximum likelihood using these pairs. Alobaid and Corcho<ref>{{Cite journal |last=Alobaid |first=Ahmad |last2=Corcho |first2=Oscar |date=March 2024|title=Linear approximation of the quantile–quantile plot for semantic labelling of numeric columns in tabular data |url=https://linkinghub.elsevier.com/retrieve/pii/S0957417423026544 |journal=Expert Systems with Applications |language=en |volume=238 |pages=122152 |doi=10.1016/j.eswa.2023.122152|url-access=subscription }}</ref> approximated the q-q plot for predicting the properties of numeric columns. ===== Logical techniques ===== Syed et al.<ref name=":4">{{Cite journal |last1=Syed |first1=Zareen |last2=Finin |first2=Tim |last3=Mulwad |first3=Varish |last4=Joshi |first4=Anupam |date=2010-04-26 |title=Exploiting a Web of Semantic Data for Interpreting Tables |url=https://ebiquity.umbc.edu/paper/html/id/474 |journal=Proceedings of the Second Web Science Conference |language=en}}</ref> built Wikitology, which is "a hybrid knowledge base of structured and unstructured information extracted from Wikipedia augmented by RDF data from DBpedia and other Linked Data resources."<ref name=":4" /> For the Wikitology index, they use [[PageRank]] for [[Entity linking]], which is one of the tasks often used in semantic labelling. Since they were not able to query Google for all Wikipedia articles to get the [[PageRank]], they used [[Decision tree]] to approximate it.<ref name=":4" /> ===== Non-ML techniques ===== Alobaid and Corcho<ref name=":02" /> presented an approach to annotate entity columns. The technique starts by annotating the cells in the entity column with the entities from the reference knowledge graph (e.g., [[DBpedia]]). The classes are then gathered and each one of them is scored based on several formulas they presented taking into account the frequency of each class and their depth according to the subClass hierarchy.<ref>{{Cite web |title=OWL Web Ontology Language Reference |url=https://www.w3.org/TR/owl-ref/Overview.html |access-date=2022-09-22 |website=www.w3.org}}</ref> ==== Semantic labelling common tasks ==== Here are some of the common semantic labelling tasks presented in the literature: ===== Entity linking and disambiguation ===== This is the most common task in semantic labelling. Given a text of a cell and a data source, the approach predicts the entity and link it to the one identified in the given data source. For example, if the input to the approach were the text "Richard Feynman" and a URL to the SPARQL endpoint of DBpedia, the approach would return "[https://dbpedia.org/resource/Richard_Feynman http://dbpedia.org/resource/Richard_Feynman]", which is the entity from DBpedia. Some approaches use exact match.<ref name=":02" /> while others use similarity metrics such as [[Cosine similarity]]<ref name=":7" /> ===== Subject column identification ===== The subject column of a table is the column that contain the main subjects/entities in the table.<ref name="auto12"/><ref name=":5" /><ref name=":8" /><ref>{{Citation |last1=Ermilov |first1=Ivan |date=2016 |url=http://dx.doi.org/10.1007/978-3-319-49004-5_11 |pages=163–179 |place=Cham |publisher=Springer International Publishing |doi=10.1007/978-3-319-49004-5_11 |isbn=978-3-319-49003-8 |access-date=2022-09-22 |last2=Ngomo |first2=Axel-Cyrille Ngonga|title=Knowledge Engineering and Knowledge Management |chapter=TAIPAN: Automatic Property Mapping for Tabular Data |series=Lecture Notes in Computer Science |volume=10024 |s2cid=37730677 |url-access=subscription }}</ref><ref name=":9">{{Cite journal |last=Zhang |first=Ziqi |date=2017-08-07 |editor-last=Hitzler |editor-first=Pascal |editor2-last=Cruz |editor2-first=Isabel |title=Effective and efficient Semantic Table Interpretation using TableMiner+ |url=https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/SW-160242 |journal=Semantic Web |volume=8 |issue=6 |pages=921–957 |doi=10.3233/SW-160242}}</ref> Some approaches expects the subject column as an input<ref name=":02" /> while others predict the subject column such as TableMiner+.<ref name=":9" /> ===== Column data-type detection ===== Columns types are divided differently by different approaches.<ref name=":5" /> Some divide them into strings/text and numbers<ref name="auto2" /><ref name=":6" /><ref>{{Cite book |last1=Ramnandan |first1=S.K. |last2=Mittal |first2=Amol |last3=Knoblock |first3=Craig A. |last4=Szekely |first4=Pedro |title=The Semantic Web. Latest Advances and New Domains |chapter=Assigning Semantic Labels to Data Sources |date=2015 |editor-last=Gandon |editor-first=Fabien |editor2-last=Sabou |editor2-first=Marta |editor3-last=Sack |editor3-first=Harald |editor4-last=d’Amato |editor4-first=Claudia |editor5-last=Cudré-Mauroux |editor5-first=Philippe |editor6-last=Zimmermann |editor6-first=Antoine |series=Lecture Notes in Computer Science |language=en |location=Cham |publisher=Springer International Publishing |volume=9088 |pages=403–417 |doi=10.1007/978-3-319-18818-8_25 |isbn=978-3-319-18818-8|s2cid=7040223 |doi-access=free }}</ref><ref name=":102"/> while others divide them further<ref name=":5" /> (e.g., Number Typology,<ref name="auto12" /> Date,<ref name=":4" /><ref name=":8" /> coordinates<ref>{{Cite book |last1=Quercini |first1=Gianluca |last2=Reynaud |first2=Chantal |title=Proceedings of the 16th International Conference on Extending Database Technology |chapter=Entity discovery and annotation in tables |date=2013 |chapter-url=http://dx.doi.org/10.1145/2452376.2452457 |location=New York, New York, USA |publisher=ACM Press |page=693 |doi=10.1145/2452376.2452457 |isbn=9781450315975 |s2cid=8252126|url=https://hal.inria.fr/hal-00832639/file/edbt2013.pdf }}</ref>). ===== Relation prediction ===== The relation between [[Madrid]] and [[Spain]] is "capitalOf".<ref>{{Cite web |title=About: capital of |url=https://dbpedia.org/property/capitalOf |access-date=2022-09-22 |website=dbpedia.org}}</ref> Such relations can easily be found in ontologies, such as [[DBpedia]]. Venetis et al.<ref name=":8" /> use TextRunner<ref>{{Cite journal |last1=Etzioni |first1=Oren |last2=Banko |first2=Michele |last3=Soderland |first3=Stephen |last4=Weld |first4=Daniel S. |date=2008-12-01 |title=Open information extraction from the web |url=https://doi.org/10.1145/1409360.1409378 |journal=Communications of the ACM |volume=51 |issue=12 |pages=68–74 |doi=10.1145/1409360.1409378 |issn=0001-0782 |s2cid=207169186}}</ref> to extract the relation between two columns. Syed et al.<ref name=":4" /> use the relation between the entities of the two columns and the most frequent relation is selected. ==== Gold standards ==== T2D<ref name=":3">{{Cite web |last=Bizer |first=Dominique Ritze, Oliver Lehmberg, Christian |title=Web Data Commons - T2Dv2 |url=http://webdatacommons.org/webtables/goldstandardV2.html |access-date=2022-07-18 |website=webdatacommons.org}}</ref> is the most common gold standard for semantic labelling. Two versions exists of T2D: T2Dv1 (sometimes are referred to T2D as well) and T2Dv2.<ref name=":3" /> Another known benchmarks are published with the SemTab Challenge.<ref>{{Cite web |title=Semantic Web Challenge on Tabular Data to Knowledge Graph Matching |url=https://www.cs.ox.ac.uk/isg/challenges/sem-tab |access-date=2022-09-30 |website=www.cs.ox.ac.uk}}</ref> === Source control === The "annotate" function (also known as "blame" or "praise") used in [[source control]] systems such as [[Git (software)|Git]], [[Team Foundation Server]] and [[Apache Subversion|Subversion]] determines who [[Revision control#Common vocabulary|committed]] changes to the source code into the repository. This outputs a copy of the source code where each line is annotated with the name of the last contributor to edit that line (and possibly a revision number). This can help establish blame in the event a change caused a malfunction, or identify the author of brilliant code. === Java annotations === {{Main|Java annotation}} A special case is the [[Java (programming language)|Java programming language]], where annotations can be used as a special form of syntactic [[metadata]] in the source code.<ref>{{cite web|url = http://java.sun.com/j2se/1.5.0/docs/guide/language/annotations.html|title = JDK 5.0 Developer's Guide: Annotations|date = 2007-12-18|access-date = 2008-03-05|publisher = [[Sun Microsystems]]| archive-url= https://web.archive.org/web/20080306060928/http://java.sun.com/j2se/1.5.0/docs/guide/language/annotations.html| archive-date= 6 March 2008 | url-status= live}}.</ref> Classes, methods, variables, parameters and packages may be annotated. The annotations can be embedded in [[class (file format)|class files]] generated by the compiler and may be retained by the [[Java virtual machine]] and thus influence the [[Run time (program lifecycle phase)|run-time]] behaviour of an application. It is possible to create meta-annotations out of the existing ones in Java.<ref>{{Cite news|last=Characterizing the Usage, Evolution and Impact of Java Annotations in Practice|title=Characterizing the Usage, Evolution and Impact of Java Annotations in Practice|url=https://hal.inria.fr/hal-02091516/document}}</ref> === Image annotation === {{main|Automatic image annotation}} Automatic image annotation is used to classify images for [[image retrieval]] systems.<ref>{{cite journal |last1=Zhang |first1=D. |first2=M.M. |last2=Islam |first3=G. |last3=Lu |title=A review on automatic image annotation techniques |journal=Pattern Recognition |volume=45 |issue=1 |pages=346–362 |date=2012 |doi=10.1016/j.patcog.2011.05.013 |bibcode=2012PatRe..45..346Z}}</ref> === Computational biology === {{main|DNA annotation}} Since the 1980s, [[molecular biology]] and [[bioinformatics]] have created the need for [[DNA annotation]]. DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. An annotation (irrespective of the context) is a note added by way of explanation or commentary. Once a genome is sequenced, it needs to be annotated to make sense of it.<ref>{{Cite web|title=Medical Definition of Genome annotation|url=https://www.medicinenet.com/genome_annotation/definition.htm|access-date=2021-09-09|website=MedicineNet|language=en}}</ref> === Digital imaging === In the [[digital imaging]] community the term annotation is commonly used for visible metadata superimposed on an [[digital image|image]] without changing the underlying master image, such as [[sticky note]]s, virtual laser pointers, circles, arrows, and black-outs (cf. [[redaction]]).<ref>{{Cite journal|last1=Pelka|first1=Obioma|last2=Nensa|first2=Felix|last3=Friedrich|first3=Christoph M.|date=2018-11-12|title=Annotation of enhanced radiographs for medical image retrieval with deep convolutional neural networks|journal=PLOS ONE|language=en|volume=13|issue=11|pages=e0206229|doi=10.1371/journal.pone.0206229|issn=1932-6203|pmc=6231616|pmid=30419028|bibcode=2018PLoSO..1306229P|doi-access=free}}</ref> In the [[medical imaging]] community, an annotation is often referred to as a [[region of interest]] and is encoded in [[DICOM]] format.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)