Editing Semantic similarity (section)

== Measures ==
=== Topological similarity ===
There are essentially two types of approaches that calculate topological similarity between ontological concepts:
* Edge-based: which use the edges and their types as the data source;
* Node-based: in which the main data sources are the nodes and their properties.
Other measures calculate the similarity between ontological instances:
* Pairwise: measure functional similarity between two instances by combining the semantic similarities of the concepts they represent
* Groupwise: calculate the similarity directly not combining the semantic similarities of the concepts they represent

Some examples:

==== Edge-based ====
* Pekar et al.<ref>{{cite conference |last1=Pekar |first1=Viktor |conference=Proceedings of the 19th international conference on Computational linguistics – |last2=Staab |first2=Steffen |volume=1 |pages=1–7 |year=2002 |doi=10.3115/1072228.1072318|title=Taxonomy learning }}</ref>
* Cheng and Cline<ref>{{cite journal |doi=10.1081/BIP-200025659 |last1=Cheng |first1=J |last2=Cline |first2=M |last3=Martin |first3=J |last4=Finkelstein |first4=D |last5=Awad |first5=T |last6=Kulp |first6=D |last7=Siani-Rose |first7=MA |title=A knowledge-based clustering algorithm driven by Gene Ontology |journal=Journal of Biopharmaceutical Statistics |volume=14 |issue=3 |pages=687–700 |year=2004 |pmid=15468759|s2cid=25224811 }}</ref>
* Wu et al.<ref>{{cite journal |last1=Wu |first1=H |last2=Su |first2=Z |last3=Mao |first3=F |last4=Olman |first4=V |last5=Xu |first5=Y |title=Prediction of functional modules based on comparative genome analysis and Gene Ontology application |journal=Nucleic Acids Research |volume=33 |issue=9 |pages=2822–37 |year=2005 |pmid=15901854 |pmc=1130488 |doi=10.1093/nar/gki573}}</ref>
* Del Pozo et al.<ref>{{cite journal |last1=Del Pozo |first1=Angela |last2=Pazos |first2=Florencio |last3=Valencia |first3=Alfonso |title=Defining functional distances over Gene Ontology |journal=BMC Bioinformatics |volume=9 |pages=50 |year=2008 |pmid=18221506 |pmc=2375122 |doi=10.1186/1471-2105-9-50 |doi-access=free }}</ref>
* IntelliGO: Benabderrahmane et al.<ref name="ReferenceA" />

==== Node-based ====
* Resnik<ref>{{cite journal|author=Philip Resnik|year=1995|title=Using information content to evaluate semantic similarity in a taxonomy|journal=Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI'95)|editor=Chris S. Mellish|volume=1|pages=448–453|citeseerx=10.1.1.41.6956|bibcode=1995cmp.lg...11007R|arxiv=cmp-lg/9511007}}</ref>
** based on the notion of [[information content]]. The information content of a concept (term or word) is the logarithm of the probability of finding the concept in a given corpus.
** only considers the information content of [[Lowest common ancestor|lowest common subsumer]] (lcs). A lowest common subsumer is a concept in a lexical taxonomy ( e.g. WordNet), which has the shortest distance from the two concepts compared. For example, animal and mammal both are the subsumers of cat and dog, but mammal is lower subsumer than animal for them.
* Lin<ref>Dekang Lin. 1998. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.1832&rep=rep1&type=pdf An Information-Theoretic Definition of Similarity]. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML '98), Jude W. Shavlik (Ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 296–304</ref>
** based on Resnik's similarity.
** considers the information content of lowest common subsumer (lcs) and the two compared concepts.
* Maguitman, [[Filippo Menczer|Menczer]], Roinestad and [[Alessandro Vespignani|Vespignani]]<ref>Ana Gabriela Maguitman, Filippo Menczer, Heather Roinestad, Alessandro Vespignani: [http://wwwconference.org/proceedings/www2005/docs/p107.pdf Algorithmic detection of semantic similarity]. WWW 2005: 107–116</ref>
** Generalizes Lin's similarity to arbitrary ontologies (graphs).
* Jiang and Conrath<ref>J. J. Jiang and D. W. Conrath. [https://arxiv.org/abs/cmp-lg/9709008 Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy]. In International Conference on Research
 on Computational Linguistics (ROCLING X), pages 9008+, September 1997</ref>
** based on Resnik's similarity.
** considers the information content of lowest common subsumer (lcs) and the two compared concepts to calculate the distance between the two concepts. The distance is later used in computing the similarity measure.
* [http://lcl.uniroma1.it/adw/ Align, Disambiguate, and Walk]: Random walks on Semantic Networks<ref>M. T. Pilehvar, D. Jurgens and R. Navigli. [http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity.]. Proc. of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 4–9, 2013, pp. 1341–1351.</ref>

==== Node-and-relation-content-based ====
* applicable to ontology
* consider properties (content) of nodes
* consider types (content) of relations
* based on eTVSM<ref>{{cite book|last1=Dong|first1=Hai|series=Lecture Notes in Computer Science |title=On the Move to Meaningful Internet Systems: OTM 2009 Workshops|chapter=A Hybrid Concept Similarity Measure Model for Ontology Environment|date=2009|volume=5872|pages=848–857|url=https://www.researchgate.net/publication/44241193|bibcode=2009LNCS.5872..848D|doi=10.1007/978-3-642-05290-3_103|isbn=978-3-642-05289-7}}</ref>
* based on Resnik's similarity<ref>{{cite journal|last1=Dong|first1=Hai|title=A context-aware semantic similarity model for ontology environments|journal=Concurrency and Computation: Practice and Experience|date=2011|volume=23|issue=2|pages=505–524|url=https://www.researchgate.net/publication/220105255|doi=10.1002/cpe.1652|s2cid=412845}}</ref>

==== Pairwise ====
* maximum of the pairwise similarities
* composite average in which only the best-matching pairs are considered (best-match average)

==== Groupwise ====
* [[Jaccard index]]

=== Statistical similarity ===
Statistical similarity approaches can be learned from data, or predefined. [[Similarity learning]] can often outperform predefined similarity measures. Broadly speaking, these approaches build a statistical model of documents, and use it to estimate similarity.

* LSA ([[latent semantic analysis]]):<ref>{{cite journal | last1 = Landauer | first1 = T. K. | last2 = Dumais | first2 = S. T. | year = 1997 | title = A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge | url = http://www.stat.cmu.edu/%7Ecshalizi/350/2008/readings/Landauer-Dumais.pdf| journal = Psychological Review | volume = 104 | issue = 2| pages = 211–240 | doi=10.1037/0033-295x.104.2.211| citeseerx = 10.1.1.184.4759 | s2cid = 1144461 }}</ref><ref>{{cite journal |author=Landauer, T. K.|author2=Foltz, P. W.|author3=Laham, D.|name-list-style=amp|year=1998|title=Introduction to Latent Semantic Analysis|journal=Discourse Processes|volume=25|issue=2–3|pages=259–284|url=http://lsa.colorado.edu/papers/dp1.LSAintro.pdf|doi=10.1080/01638539809545028|citeseerx=10.1.1.125.109|s2cid=16625196 }}</ref> (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
* PMI ([[pointwise mutual information]]): (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents
* SOC-PMI ([[second-order co-occurrence pointwise mutual information]]): (+) sort lists of important neighbor words from a large corpus; (−) cannot measure relatedness between whole sentences or documents
* GLSA (generalized latent semantic analysis): (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
* ICAN (incremental construction of an associative network): (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (−) cannot measure relatedness between multi-word terms, long pre-processing times
* NGD ([[normalized Google distance]]): (+) large vocab, because it uses any search engine (like Google); (−) can measure relatedness between whole sentences or documents but the larger the sentence or document, the more ingenuity is required (Cilibrasi & Vitanyi, 2007).<ref>{{cite web | url = http://iknowate.blogspot.com/2011/10/google-similarity-distance.html | title = Google Similarity Distance }}</ref>
* TSS (Twitter semantic similarity):<ref>{{cite journal|author=Carrillo, F.|author2=Cecchi, G. A.|author3=Sigman, M.|author4=Slezak, D. F. |url=http://downloads.hindawi.com/journals/cin/2015/712835.pdf|title=Fast Distributed Dynamics of Semantic Networks via Social Media|journal=Computational Intelligence and Neuroscience|volume=2015|page=712835|date=2015|doi=10.1155/2015/712835|pmc=4449913|pmid=26074953|doi-access=free}}</ref> large vocab, because it use online tweets from Twitter to compute the similarity. It has high temporary resolution that allows to capture high frequency events. Open source 
* NCD ([[normalized compression distance]])
* ESA ([[explicit semantic analysis]]) based on [[Wikipedia]] and the [[Open Directory Project|ODP]]
* SSA (salient semantic analysis)<ref>{{cite web|url=http://www.samerhassan.com/images/4/48/Hassan.pdf|title=Samer Hassan}}{{dead link|date=December 2023}}</ref> which indexes terms using salient concepts found in their immediate context.
* n° of Wikipedia (noW),<ref>{{cite conference|author1=Wilson Wong|author2=Wei Liu|author3=Mohammed Bennamoun|url=http://doi.acm.org/10.1145/1232425.1232448|title=Featureless similarities for terms clustering using tree-traversing ants|conference=PCAR '06: Proceedings of the 2006 international symposium on Practical cognitive agents and robots|date=November 2006|pages= 177–191|doi=10.1145/1232425.1232448|url-access=subscription}}</ref> inspired by the game Six Degrees of Wikipedia,<ref>{{cite web|url=http://chronicle.com/wiredcampus/article/3041/six-degrees-of-wikipedia|title=6 Degrees of Wikipedia|date=May 28, 2008|website=The Chronicle of Higher Education|series=The Wired Campus|archive-url=https://web.archive.org/web/20080530043310/http://chronicle.com/wiredcampus/article/3041/six-degrees-of-wikipedia|archive-date=May 30, 2008|url-status=dead}}</ref> is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later, [[Dijkstra's algorithm|Dijkstra's shortest path algorithm]] is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics (i.e. nodes) in the graph.
* VGEM (vector generation of an explicitly-defined multidimensional semantic space):<ref>{{cite web|title=Defining the Dimensions of the Human Semantic Space|url=https://raw.githubusercontent.com/lyoshenka/papers/master/pp718-veksler.pdf|author1=V. D. Veksler|author2=Ryan Z. Govostes|date= 2008}}</ref> (+) incremental vocab, can compare multi-word terms (−) performance depends on choosing specific dimensions
* [[SimRank]]
* NASARI:<ref>{{cite conference|author1=J. Camacho-Collados|author2= M. T. Pilehvar|author3=R. Navigli|url=http://aclweb.org/anthology/N/N15/N15-1059.pdf |title=NASARI: a Novel Approach to a Semantically-Aware Representation of Items|conference=Proceedings of the North American Chapter of the Association of Computational Linguistics (NAACL 2015)|location= Denver, US|pages=567–577|date=2015}}</ref> Sparse vector representations constructed by applying the hypergeometric distribution over the Wikipedia corpus in combination with BabelNet taxonomy. Cross-lingual similarity is currently also possible thanks to the multilingual and unified extension.<ref>{{cite conference|author1=J. Camacho-Collados|author2=M. T. Pilehvar|author3=R. Navigli|url=http://aclweb.org/anthology/P/P15/P15-1072.pdf|title= A Unified Multilingual Semantic Representation of Concepts|conference=
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015)|location=Beijing, China|date=July 27–29, 2015|pages= 741–751}}</ref>

=== Semantics-based similarity ===
* Marker passing: Combining lexical decomposition for automated ontology creation and marker passing, the approach of Fähndrich et al. introduces a new type of semantic similarity measure.<ref>{{cite conference|author1=Fähndrich J.|author2= Weber S.|author3= Ahrndt S.|date=2016|chapter=Design and Use of a Semantic Similarity Measure for Interoperability Among Agents|editor1=Klusch M.|editor2= Unland R.|editor3=Shehory O.|editor4=Pokahr A.|editor5=Ahrndt S. |title=Multiagent System Technologies|conference= MATES 2016|series=Lecture Notes in Computer Science|volume=9872|publisher=Springer}} Available at [http://www.fähndrich.de author version]</ref> Here markers are passed from the two target concepts carrying an amount of activation. This activation might increase or decrease depending on the relations weight with which the concepts are connected. This combines edge and node based approaches and includes connectionist reasoning with symbolic information.
* Good common subsumer (GCS)-based semantic similarity measure<ref>{{cite conference|author1=C. d'Amato|author2=S. Staab|author3=N. Fanizzi|chapter=On the influence of description logics ontologies on conceptual similarity|title=Knowledge Engineering: Practice and Patterns| pages=48–63|date= 2008|doi=10.1007/978-3-540-87696-0_7}}</ref>

=== Semantics similarity networks ===
* A '''[[semantic similarity network]]''' (SSN) is a special form of [[semantic network]]. designed to represent concepts and their semantic similarity. Its main contribution is reducing the complexity of calculating semantic distances. Bendeck (2004, 2008) introduced the concept of ''semantic similarity networks'' (SSN) as the specialization of a semantic network to measure semantic similarity from ontological representations.<ref name=bendeck>{{cite book|last=Bendeck|first=F.|year=2008|title=WSM-P Workflow Semantic Matching Platform, PhD dissertation, University of Trier, Germany |publisher=Verlag Dr. Hut|id={{ASIN|3899638549|country=de}}}}</ref> Implementations include genetic information handling.

=== Gold standards ===
Researchers have collected datasets with similarity judgements on pairs of words, which are used to evaluate the cognitive plausibility of computational measures. The golden standard up to today is an old 65 word list where humans have judged the word similarity.<ref>Rubenstein, Herbert, and John B. Goodenough. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.893.7406&rep=rep1&type=pdf Contextual correlates of synonymy]. Communications of the ACM, 8(10):627–633, 1965.</ref><ref>For a list of datasets, and an overview of the state of the art see [https://www.aclweb.org/aclwiki/index.php?title=Similarity_(State_of_the_art) https://www.aclweb.org/].</ref>

* RG65<ref>{{Cite journal|last1=Rubenstein|first1=Herbert|last2=Goodenough|first2=John B.|date=1965-10-01|title=Contextual correlates of synonymy|journal=Communications of the ACM|volume=8|issue=10|pages=627–633|doi=10.1145/365628.365657|s2cid=18309234|doi-access=free}}</ref>
* MC30<ref>{{Cite journal|last1=Miller|first1=George A.|last2=Charles|first2=Walter G.|date=1991-01-01|title=Contextual correlates of semantic similarity|journal=Language and Cognitive Processes|volume=6|issue=1|pages=1–28|doi=10.1080/01690969108406936|issn=0169-0965}}</ref>
* WordSim353<ref>{{Cite journal|date=2002-01-01|title=Placing search in context|journal=ACM Transactions on Information Systems |volume=20|pages=116–131|language=EN|doi=10.1145/503104.503110|s2cid=12956853|citeseerx=10.1.1.29.1912}}</ref>