Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Semantic similarity
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{Short description|Natural language processing}} '''Semantic similarity''' is a [[Metric (mathematics)|metric]] defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or [[semantics|semantic content]]{{Citation needed|reason=What aspect of semantic content is related to distance?|date=March 2024}} as opposed to [[lexicographical]] similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.<ref name=harispe2015>{{ cite journal | journal=Synthesis Lectures on Human Language Technologies |author1=Harispe S. |author2=Ranwez S.|author3= Janaqi S.|author4= Montmain J. | year=2015 | title=Semantic Similarity from Natural Language and Ontology Analysis| pages=1β254 | volume=8 |issue=1 | doi=10.2200/S00639ED1V01Y201504HLT027|arxiv=1704.05295 |s2cid=17428739 }}</ref><ref name=Feng2017>{{ cite journal | journal=Knowledge Engineering Review |volume=32 |author1=Feng Y. |author2=Bagheri E. |author3=Ensan F. |author4=Jovanovic J. |year=2017 |title=The state of the art in semantic relatedness: a framework for comparison| pages=1β30 | doi=10.1017/S0269888917000029|s2cid=52172371 }}</ref> The term semantic similarity is often confused with semantic relatedness. '''Semantic relatedness''' includes any relation between two terms, while semantic similarity only includes [[Is-a|"is a"]] relations.<ref>{{ cite journal | journal=GeoInformatica |author1=A. Ballatore |author2=M. Bertolotto |author3=D.C. Wilson | year=2014 | title=An evaluative baseline for geo-semantic relatedness and similarity| pages=747β767 | volume=18|issue=4 |arxiv=1402.3371 |doi=10.1007/s10707-013-0197-8 |bibcode=2014GInfo..18..747B |s2cid=17474023 }}</ref> For example, "car" is similar to "bus", but is also related to "road" and "driving". Computationally, semantic similarity can be estimated by defining a [[topological]] similarity, by using [[Ontology (computer science)|ontologies]] to define the distance between terms/concepts. For example, a naive metric for the comparison of concepts ordered in a [[partially ordered set]] and represented as nodes of a [[directed acyclic graph]] (e.g., a [[Taxonomy (general)|taxonomy]]), would be the shortest-path linking the two concept nodes. Based on text analyses, semantic relatedness between units of language (e.g., words, sentences) can also be estimated using statistical means such as a [[vector space model]] to [[correlation|correlate]] words and textual contexts from a suitable [[text corpus]]. The evaluation of the proposed semantic similarity / relatedness measures are evaluated through two main ways. The former is based on the use of datasets designed by experts and composed of word pairs with semantic similarity / relatedness degree estimation. The second way is based on the integration of the measures inside specific applications such as information retrieval, recommender systems, natural language processing, etc. == Terminology == The concept of '''semantic similarity''' is more specific than '''semantic relatedness''', as the latter includes concepts as [[antonymy]] and [[meronymy]], while similarity does not.<ref name="budanitsky2001">{{Cite journal |last1 = Budanitsky|first1 = Alexander|last2 = Hirst|first2 = Graeme|place = Pittsburgh|year = 2001|title = Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures|journal = Workshop on WordNet and Other Lexical Resources, Second Meeting of the North American Chapter of the Association for Computational Linguistics|url = https://ftp.cs.toronto.edu/pub/gh/Budanitsky+Hirst-2001.pdf}}</ref> However, much of the literature uses these terms interchangeably, along with terms like semantic distance. In essence, semantic similarity, semantic distance, and semantic relatedness all mean, "How much does term A have to do with term B?" The answer to this question is usually a number between β1 and 1, or between 0 and 1, where 1 signifies extremely high similarity. == Visualization == An intuitive way of visualizing the semantic similarity of terms is by grouping together terms which are closely related and spacing wider apart the ones which are distantly related. This is also common in practice for [[mind maps]] and [[concept maps]]. A more direct way of visualizing the semantic similarity of two linguistic items can be seen with the [[Semantic folding|Semantic Folding]] approach. In this approach a linguistic item such as a term or a text can be represented by generating a [[pixel]] for each of its active semantic features in e.g. a 128 x 128 grid. This allows for a direct visual comparison of the semantics of two items by comparing image representations of their respective feature sets. == Applications == === In biomedical informatics === Semantic similarity measures have been applied and developed in biomedical ontologies.<ref>{{cite journal|last1=Guzzi|first1=Pietro Hiram|first2=Marco |last2=Mina |first3=Mario|last3= Cannataro|first4= Concettina |last4=Guerra|title=Semantic similarity analysis of protein data: assessment with biological features and issues|journal=Briefings in Bioinformatics|year=2012|volume=13|pages=569β585|issue=5|doi=10.1093/bib/bbr066|pmid=22138322|doi-access=free}}</ref><ref name="ReferenceA">{{cite journal |last1=Benabderrahmane|first1=Sidahmed|last2=Smail Tabbone|first2=Malika|last3=Poch|first3=Olivier |last4=Napoli|first4=Amedeo|last5=Devignes|first5=Marie-Domonique. |title=IntelliGO: a new vector-based semantic similarity measure including annotation origin |journal=BMC Bioinformatics|volume=11 |pages=588 |year=2010 |pmid=21122125 |doi=10.1186/1471-2105-11-588 |pmc=3098105 |doi-access=free }}</ref> They are mainly used to compare [[genes]] and [[proteins]] based on the similarity of their functions<ref>{{cite journal | last1 = Chicco | first1 = D | last2 = Masseroli | first2 = M | year = 2015 | title = Software suite for gene and protein annotation prediction and similarity search | journal = IEEE/ACM Transactions on Computational Biology and Bioinformatics | volume = 12 | issue = 4 | pages = 837β843 | doi=10.1109/TCBB.2014.2382127 | pmid = 26357324 | hdl = 11311/959408 | s2cid = 14714823 | url = https://doi.org/10.1109/TCBB.2014.2382127 | hdl-access = free }} </ref> rather than on their [[sequence similarity]], but they are also being extended to other bioentities, such as diseases.<ref>{{cite journal |last1=KΓΆhler |first1=S |last2=Schulz |first2=MH |last3=Krawitz |first3=P |last4=Bauer |first4=S |last5=Dolken |first5=S |last6=Ott |first6=CE |last7=Mundlos |first7=C |last8=Horn |first8=D |last9=Mundlos |first9=S |last10=Robinson |first10=Peter N. |title=Clinical diagnostics in human genetics with semantic similarity searches in ontologies |journal=American Journal of Human Genetics |volume=85 |issue=4 |pages=457β64 |year=2009 |pmid=19800049 |pmc=2756558 |doi=10.1016/j.ajhg.2009.09.003|display-authors=8 }}</ref> These comparisons can be done using tools freely available on the web: * ProteInOn can be used to find interacting proteins, find assigned GO terms and calculate the functional semantic similarity of [[UniProt]] proteins and to get the information content and calculate the functional semantic similarity of GO terms.<ref>{{cite web|url=http://xldb.fc.ul.pt/biotools/proteinon/|title=ProteInOn}}</ref> * CMPSim provides a functional similarity measure between chemical compounds and metabolic pathways using [[ChEBI]] based semantic similarity measures.<ref>{{cite web|url=http://xldb.di.fc.ul.pt/biotools/cmpsim/|title=CMPSim}}</ref> * CESSM provides a tool for the automated evaluation of GO-based semantic similarity measures.<ref>{{cite web|url=http://xldb.fc.ul.pt/biotools/cessm/|title=CESSM}}</ref> === In geoinformatics === Similarity is also applied in [[geoinformatics]] to find similar [[geographic feature]]s or feature types:<ref>{{cite journal|author1=Janowicz, K.|author2=Raubal, M.|author3=Kuhn, W.|title=The semantics of similarity in geographic information retrieval|journal=Journal of Spatial Information Science|volume=2|issue=2|year=2011|pages=29β57|doi=10.5311/josis.2011.2.3|doi-access=free|hdl=20.500.11850/41298|hdl-access=free}}</ref> * SIM-DL similarity server<ref>{{cite conference | citeseerx = 10.1.1.172.5544 | title =Algorithm, implementation and application of the SIM-DL similarity server | pages = 128β145 | year = 2007 |series=Lecture Notes in Computer Science |number=4853 |conference=Second International Conference on Geospatial Semantics (GEOS 2007)}}</ref> can be used to compute similarities between concepts stored in geographic feature type ontologies. * Similarity Calculator can be used to compute how well related two geographic concepts are in the Geo-Net-PT ontology.<ref>{{cite web|url=http://xldb.fc.ul.pt/wiki/Geographic_Similarity_calculator_GeoSSM|title=Geo-Net-PT Similarity Calculator}}</ref><ref>{{cite web|url=http://xldb.fc.ul.pt/wiki/Geo-Net-PT_02_in_English|title=Geo-Net-PT}}</ref> * The OSM<ref>[https://wiki.openstreetmap.org/wiki/OSM_Semantic_Network "OSM Semantic Network"]. OSM Wiki.</ref> [[semantic network]] can be used to compute the semantic similarity of tags in [[OpenStreetMap]].<ref>{{cite journal|title=Geographic Knowledge Extraction and Semantic Similarity in OpenStreetMap|author1=A. Ballatore |author2=D.C. Wilson |author3=M. Bertolotto |journal=Knowledge and Information Systems|pages=61β81|url=http://irserver.ucd.ie/bitstream/handle/10197/3973/2012_-_Geographic_Knowledge_Extraction_and_Semantic_Similarity_in_OpenStreetMap_-_Ballatore_et_al.pdf?sequence=1}}</ref> === In computational linguistics === Several metrics use [[WordNet]], a manually constructed lexical database of English words. Despite the advantages of having human supervision in constructing the database, since the words are not automatically learned the database cannot measure relatedness between multi-word term, non-incremental vocabulary.<ref name=budanitsky2001 /><ref>{{cite book|author1=Kaur, I. |author2=Hornof, A.J. |title=Proceedings of the SIGCHI Conference on Human Factors in Computing Systems |chapter=A comparison of LSA, wordNet and PMI-IR for predicting user click behavior |name-list-style=amp |date=2005|pages=51β60|doi=10.1145/1054972.1054980|isbn=978-1-58113-998-3|s2cid=14347026 }}</ref> === In natural language processing === [[Natural language processing]] (NLP) is a field of computer science and linguistics. Sentiment analysis, Natural language understanding and Machine translation (Automatically translate text from one human language to another) are a few of the major areas where it is being used. For example, knowing one information resource in the internet, it is often of immediate interest to find similar resources. The [[Semantic Web]] provides semantic extensions to find similar data by content and not just by arbitrary descriptors.<ref>[http://www.di.uniba.it/~cdamato/PhDThesis_dAmato.pdf Similarity-based Learning Methods for the Semantic Web] (C. d'Amato, PhD Thesis)</ref><ref>{{cite journal|author1=Gracia, J. |author2=Mena, E. |name-list-style=amp |year=2008|url=http://disi.unitn.it/~p2p/RelatedWork/Matching/Gracia_wise08.pdf|title=Web-Based Measure of Semantic Relatedness|journal=Proceedings of the 9th International Conference on Web Information Systems Engineering (WISE '08)|pages=136β150}}</ref><ref>Raveendranathan, P. (2005). [http://www.d.umn.edu/~tpederse/Pubs/prath-thesis.pdf Identifying Sets of Related Words from the World Wide Web]. Master of Science Thesis, University of Minnesota Duluth.</ref><ref>Wubben, S. (2008). [http://ilk.uvt.nl/~swubben/publications/wubben2008-techrep.pdf Using free link structure to calculate semantic relatedness]. In ILK Research Group Technical Report Series, nr. 08-01, 2008.</ref><ref>Juvina, I., van Oostendorp, H., Karbor, P., & Pauw, B. (2005). [https://cloudfront.escholarship.org/dist/prd/content/qt0p7528tp/qt0p7528tp.pdf Towards modeling contextual information in web navigation]. In B. G. Bara & L. Barsalou & M. Bucciarelli (Eds.), 27th Annual Meeting of the Cognitive Science Society, CogSci2005 (pp. 1078β1083). Austin, Tx: The Cognitive Science Society, Inc.</ref><ref>Navigli, R., Lapata, M. (2007). [http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-272.pdf Graph Connectivity Measures for Unsupervised Word Sense Disambiguation], Proc. of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), Hyderabad, India, January 6β12th, 2007, pp. 1683β1688.</ref><ref>{{cite journal|author=Pirolli, P.|year=2005|title=Rational analyses of information foraging on the Web|journal=Cognitive Science|volume=29|issue=3|pages=343β373|doi=10.1207/s15516709cog0000_20|pmid=21702778|doi-access=free}}</ref><ref>{{cite book|author=Pirolli, P.|author2=Fu, W.-T.|name-list-style=amp |year=2003|chapter=SNIF-ACT: A model of information foraging on the World Wide Web|title=Lecture Notes in Computer Science|volume=2702|pages=45β54|doi=10.1007/3-540-44963-9_8|isbn=978-3-540-40381-4|citeseerx=10.1.1.6.1506}}</ref><ref>Turney, P. (2001). [https://arxiv.org/abs/cs/0212033 Mining the Web for Synonyms: PMI versus LSA on TOEFL]. In L. De Raedt & P. Flach (Eds.), Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001) (pp. 491β502). Freiburg, Germany.</ref> [[Deep learning]] methods have become an accurate way to gauge semantic similarity between two text passages, in which each passage is first embedded into a continuous vector representation.<ref>{{Cite book|last1=Reimers|first1=Nils|last2=Gurevych|first2=Iryna|title=Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) |chapter=Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks |date=November 2019|chapter-url=https://www.aclweb.org/anthology/D19-1410|location=Hong Kong, China|publisher=Association for Computational Linguistics|pages=3982β3992|doi=10.18653/v1/D19-1410|arxiv=1908.10084|doi-access=free}}</ref><ref>{{Cite journal|last1=Mueller|first1=Jonas|last2=Thyagarajan|first2=Aditya|date=2016-03-05|title=Siamese Recurrent Architectures for Learning Sentence Similarity|url=https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12195|journal=Thirtieth AAAI Conference on Artificial Intelligence|volume=30 |doi=10.1609/aaai.v30i1.10350 |s2cid=16657628 |language=en|doi-access=free}}</ref><ref>{{Citation|last1=Kiros|first1=Ryan|title=Skip-Thought Vectors|date=2015|url=http://papers.nips.cc/paper/5950-skip-thought-vectors.pdf|work=Advances in Neural Information Processing Systems 28|pages=3294β3302|editor-last=Cortes|editor-first=C.|publisher=Curran Associates, Inc.|access-date=2020-03-13|last2=Zhu|first2=Yukun|last3=Salakhutdinov|first3=Russ R|last4=Zemel|first4=Richard|last5=Urtasun|first5=Raquel|last6=Torralba|first6=Antonio|last7=Fidler|first7=Sanja|editor2-last=Lawrence|editor2-first=N. D.|editor3-last=Lee|editor3-first=D. D.|editor4-last=Sugiyama|editor4-first=M.}}</ref> === In ontology matching === Semantic similarity plays a crucial role in [[ontology alignment]], which aims to establish correspondences between [[Ontology components|entities]] from different ontologies. It involves quantifying the degree of similarity between concepts or terms using the information present in the ontology for each entity, such as labels, descriptions, and hierarchical relations to other entities. Traditional metrics used in ontology matching are based on a lexical similarity between features of the entities, such as using the Levenshtein distance to measure the edit distance between entity labels.<ref>{{Cite conference|last1=Cheatham |first1=Michelle |last2=Hitzler |first2=Pascal |title=Advanced Information Systems Engineering |chapter=String Similarity Metrics for Ontology Alignment |date=2013 |editor-last=Alani |editor-first=Harith |editor2-last=Kagal |editor2-first=Lalana |editor3-last=Fokoue |editor3-first=Achille |editor4-last=Groth |editor4-first=Paul |editor5-last=Biemann |editor5-first=Chris |editor6-last=Parreira |editor6-first=Josiane Xavier |editor7-last=Aroyo |editor7-first=Lora |editor8-last=Noy |editor8-first=Natasha |editor9-last=Welty |editor9-first=Chris |conference =The Semantic Web β ISWC 2013 |series=Lecture Notes in Computer Science |volume=7908 |language=en |location=Berlin, Heidelberg |publisher=Springer |pages=294β309 |doi=10.1007/978-3-642-41338-4_19 |isbn=978-3-642-41338-4|s2cid=18372966 |doi-access=free }}</ref> However, it is difficult to capture the semantic similarity between entities using these metrics. For example, when comparing two ontologies describing conferences, the entities "Contribution" and "Paper" may have high semantic similarity since they share the same meaning. Nonetheless, due to their lexical differences, lexicographical similarity alone cannot establish this alignment. To capture these semantic similarities, [[Latent space|embeddings]] are being adopted in ontology matching.<ref name=":0">Sousa, G., Lima, R., & Trojahn, C. (2022). An eye on representation learning in ontology matching. ''OM@ISWC''.</ref> By encoding semantic relationships and contextual information, embeddings enable the calculation of similarity scores between entities based on the proximity of their vector representations in the embedding space. This approach allows for efficient and accurate matching of ontologies since embeddings can model semantic differences in entity naming, such as homonymy, by assigning different embeddings to the same word based on different contexts.<ref name=":0" /> == Measures == === Topological similarity === There are essentially two types of approaches that calculate topological similarity between ontological concepts: * Edge-based: which use the edges and their types as the data source; * Node-based: in which the main data sources are the nodes and their properties. Other measures calculate the similarity between ontological instances: * Pairwise: measure functional similarity between two instances by combining the semantic similarities of the concepts they represent * Groupwise: calculate the similarity directly not combining the semantic similarities of the concepts they represent Some examples: ==== Edge-based ==== * Pekar et al.<ref>{{cite conference |last1=Pekar |first1=Viktor |conference=Proceedings of the 19th international conference on Computational linguistics β |last2=Staab |first2=Steffen |volume=1 |pages=1β7 |year=2002 |doi=10.3115/1072228.1072318|title=Taxonomy learning }}</ref> * Cheng and Cline<ref>{{cite journal |doi=10.1081/BIP-200025659 |last1=Cheng |first1=J |last2=Cline |first2=M |last3=Martin |first3=J |last4=Finkelstein |first4=D |last5=Awad |first5=T |last6=Kulp |first6=D |last7=Siani-Rose |first7=MA |title=A knowledge-based clustering algorithm driven by Gene Ontology |journal=Journal of Biopharmaceutical Statistics |volume=14 |issue=3 |pages=687β700 |year=2004 |pmid=15468759|s2cid=25224811 }}</ref> * Wu et al.<ref>{{cite journal |last1=Wu |first1=H |last2=Su |first2=Z |last3=Mao |first3=F |last4=Olman |first4=V |last5=Xu |first5=Y |title=Prediction of functional modules based on comparative genome analysis and Gene Ontology application |journal=Nucleic Acids Research |volume=33 |issue=9 |pages=2822β37 |year=2005 |pmid=15901854 |pmc=1130488 |doi=10.1093/nar/gki573}}</ref> * Del Pozo et al.<ref>{{cite journal |last1=Del Pozo |first1=Angela |last2=Pazos |first2=Florencio |last3=Valencia |first3=Alfonso |title=Defining functional distances over Gene Ontology |journal=BMC Bioinformatics |volume=9 |pages=50 |year=2008 |pmid=18221506 |pmc=2375122 |doi=10.1186/1471-2105-9-50 |doi-access=free }}</ref> * IntelliGO: Benabderrahmane et al.<ref name="ReferenceA" /> ==== Node-based ==== * Resnik<ref>{{cite journal|author=Philip Resnik|year=1995|title=Using information content to evaluate semantic similarity in a taxonomy|journal=Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI'95)|editor=Chris S. Mellish|volume=1|pages=448β453|citeseerx=10.1.1.41.6956|bibcode=1995cmp.lg...11007R|arxiv=cmp-lg/9511007}}</ref> ** based on the notion of [[information content]]. The information content of a concept (term or word) is the logarithm of the probability of finding the concept in a given corpus. ** only considers the information content of [[Lowest common ancestor|lowest common subsumer]] (lcs). A lowest common subsumer is a concept in a lexical taxonomy ( e.g. WordNet), which has the shortest distance from the two concepts compared. For example, animal and mammal both are the subsumers of cat and dog, but mammal is lower subsumer than animal for them. * Lin<ref>Dekang Lin. 1998. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.1832&rep=rep1&type=pdf An Information-Theoretic Definition of Similarity]. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML '98), Jude W. Shavlik (Ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 296β304</ref> ** based on Resnik's similarity. ** considers the information content of lowest common subsumer (lcs) and the two compared concepts. * Maguitman, [[Filippo Menczer|Menczer]], Roinestad and [[Alessandro Vespignani|Vespignani]]<ref>Ana Gabriela Maguitman, Filippo Menczer, Heather Roinestad, Alessandro Vespignani: [http://wwwconference.org/proceedings/www2005/docs/p107.pdf Algorithmic detection of semantic similarity]. WWW 2005: 107β116</ref> ** Generalizes Lin's similarity to arbitrary ontologies (graphs). * Jiang and Conrath<ref>J. J. Jiang and D. W. Conrath. [https://arxiv.org/abs/cmp-lg/9709008 Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy]. In International Conference on Research on Computational Linguistics (ROCLING X), pages 9008+, September 1997</ref> ** based on Resnik's similarity. ** considers the information content of lowest common subsumer (lcs) and the two compared concepts to calculate the distance between the two concepts. The distance is later used in computing the similarity measure. * [http://lcl.uniroma1.it/adw/ Align, Disambiguate, and Walk]: Random walks on Semantic Networks<ref>M. T. Pilehvar, D. Jurgens and R. Navigli. [http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity.]. Proc. of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 4β9, 2013, pp. 1341β1351.</ref> ==== Node-and-relation-content-based ==== * applicable to ontology * consider properties (content) of nodes * consider types (content) of relations * based on eTVSM<ref>{{cite book|last1=Dong|first1=Hai|series=Lecture Notes in Computer Science |title=On the Move to Meaningful Internet Systems: OTM 2009 Workshops|chapter=A Hybrid Concept Similarity Measure Model for Ontology Environment|date=2009|volume=5872|pages=848β857|url=https://www.researchgate.net/publication/44241193|bibcode=2009LNCS.5872..848D|doi=10.1007/978-3-642-05290-3_103|isbn=978-3-642-05289-7}}</ref> * based on Resnik's similarity<ref>{{cite journal|last1=Dong|first1=Hai|title=A context-aware semantic similarity model for ontology environments|journal=Concurrency and Computation: Practice and Experience|date=2011|volume=23|issue=2|pages=505β524|url=https://www.researchgate.net/publication/220105255|doi=10.1002/cpe.1652|s2cid=412845}}</ref> ==== Pairwise ==== * maximum of the pairwise similarities * composite average in which only the best-matching pairs are considered (best-match average) ==== Groupwise ==== * [[Jaccard index]] === Statistical similarity === Statistical similarity approaches can be learned from data, or predefined. [[Similarity learning]] can often outperform predefined similarity measures. Broadly speaking, these approaches build a statistical model of documents, and use it to estimate similarity. * LSA ([[latent semantic analysis]]):<ref>{{cite journal | last1 = Landauer | first1 = T. K. | last2 = Dumais | first2 = S. T. | year = 1997 | title = A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge | url = http://www.stat.cmu.edu/%7Ecshalizi/350/2008/readings/Landauer-Dumais.pdf| journal = Psychological Review | volume = 104 | issue = 2| pages = 211β240 | doi=10.1037/0033-295x.104.2.211| citeseerx = 10.1.1.184.4759 | s2cid = 1144461 }}</ref><ref>{{cite journal |author=Landauer, T. K.|author2=Foltz, P. W.|author3=Laham, D.|name-list-style=amp|year=1998|title=Introduction to Latent Semantic Analysis|journal=Discourse Processes|volume=25|issue=2β3|pages=259β284|url=http://lsa.colorado.edu/papers/dp1.LSAintro.pdf|doi=10.1080/01638539809545028|citeseerx=10.1.1.125.109|s2cid=16625196 }}</ref> (+) vector-based, adds vectors to measure multi-word terms; (β) non-incremental vocabulary, long pre-processing times * PMI ([[pointwise mutual information]]): (+) large vocab, because it uses any search engine (like Google); (β) cannot measure relatedness between whole sentences or documents * SOC-PMI ([[second-order co-occurrence pointwise mutual information]]): (+) sort lists of important neighbor words from a large corpus; (β) cannot measure relatedness between whole sentences or documents * GLSA (generalized latent semantic analysis): (+) vector-based, adds vectors to measure multi-word terms; (β) non-incremental vocabulary, long pre-processing times * ICAN (incremental construction of an associative network): (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (β) cannot measure relatedness between multi-word terms, long pre-processing times * NGD ([[normalized Google distance]]): (+) large vocab, because it uses any search engine (like Google); (β) can measure relatedness between whole sentences or documents but the larger the sentence or document, the more ingenuity is required (Cilibrasi & Vitanyi, 2007).<ref>{{cite web | url = http://iknowate.blogspot.com/2011/10/google-similarity-distance.html | title = Google Similarity Distance }}</ref> * TSS (Twitter semantic similarity):<ref>{{cite journal|author=Carrillo, F.|author2=Cecchi, G. A.|author3=Sigman, M.|author4=Slezak, D. F. |url=http://downloads.hindawi.com/journals/cin/2015/712835.pdf|title=Fast Distributed Dynamics of Semantic Networks via Social Media|journal=Computational Intelligence and Neuroscience|volume=2015|page=712835|date=2015|doi=10.1155/2015/712835|pmc=4449913|pmid=26074953|doi-access=free}}</ref> large vocab, because it use online tweets from Twitter to compute the similarity. It has high temporary resolution that allows to capture high frequency events. Open source * NCD ([[normalized compression distance]]) * ESA ([[explicit semantic analysis]]) based on [[Wikipedia]] and the [[Open Directory Project|ODP]] * SSA (salient semantic analysis)<ref>{{cite web|url=http://www.samerhassan.com/images/4/48/Hassan.pdf|title=Samer Hassan}}{{dead link|date=December 2023}}</ref> which indexes terms using salient concepts found in their immediate context. * nΒ° of Wikipedia (noW),<ref>{{cite conference|author1=Wilson Wong|author2=Wei Liu|author3=Mohammed Bennamoun|url=http://doi.acm.org/10.1145/1232425.1232448|title=Featureless similarities for terms clustering using tree-traversing ants|conference=PCAR '06: Proceedings of the 2006 international symposium on Practical cognitive agents and robots|date=November 2006|pages= 177β191|doi=10.1145/1232425.1232448|url-access=subscription}}</ref> inspired by the game Six Degrees of Wikipedia,<ref>{{cite web|url=http://chronicle.com/wiredcampus/article/3041/six-degrees-of-wikipedia|title=6 Degrees of Wikipedia|date=May 28, 2008|website=The Chronicle of Higher Education|series=The Wired Campus|archive-url=https://web.archive.org/web/20080530043310/http://chronicle.com/wiredcampus/article/3041/six-degrees-of-wikipedia|archive-date=May 30, 2008|url-status=dead}}</ref> is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later, [[Dijkstra's algorithm|Dijkstra's shortest path algorithm]] is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics (i.e. nodes) in the graph. * VGEM (vector generation of an explicitly-defined multidimensional semantic space):<ref>{{cite web|title=Defining the Dimensions of the Human Semantic Space|url=https://raw.githubusercontent.com/lyoshenka/papers/master/pp718-veksler.pdf|author1=V. D. Veksler|author2=Ryan Z. Govostes|date= 2008}}</ref> (+) incremental vocab, can compare multi-word terms (β) performance depends on choosing specific dimensions * [[SimRank]] * NASARI:<ref>{{cite conference|author1=J. Camacho-Collados|author2= M. T. Pilehvar|author3=R. Navigli|url=http://aclweb.org/anthology/N/N15/N15-1059.pdf |title=NASARI: a Novel Approach to a Semantically-Aware Representation of Items|conference=Proceedings of the North American Chapter of the Association of Computational Linguistics (NAACL 2015)|location= Denver, US|pages=567β577|date=2015}}</ref> Sparse vector representations constructed by applying the hypergeometric distribution over the Wikipedia corpus in combination with BabelNet taxonomy. Cross-lingual similarity is currently also possible thanks to the multilingual and unified extension.<ref>{{cite conference|author1=J. Camacho-Collados|author2=M. T. Pilehvar|author3=R. Navigli|url=http://aclweb.org/anthology/P/P15/P15-1072.pdf|title= A Unified Multilingual Semantic Representation of Concepts|conference= Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015)|location=Beijing, China|date=July 27β29, 2015|pages= 741β751}}</ref> === Semantics-based similarity === * Marker passing: Combining lexical decomposition for automated ontology creation and marker passing, the approach of FΓ€hndrich et al. introduces a new type of semantic similarity measure.<ref>{{cite conference|author1=FΓ€hndrich J.|author2= Weber S.|author3= Ahrndt S.|date=2016|chapter=Design and Use of a Semantic Similarity Measure for Interoperability Among Agents|editor1=Klusch M.|editor2= Unland R.|editor3=Shehory O.|editor4=Pokahr A.|editor5=Ahrndt S. |title=Multiagent System Technologies|conference= MATES 2016|series=Lecture Notes in Computer Science|volume=9872|publisher=Springer}} Available at [http://www.fΓ€hndrich.de author version]</ref> Here markers are passed from the two target concepts carrying an amount of activation. This activation might increase or decrease depending on the relations weight with which the concepts are connected. This combines edge and node based approaches and includes connectionist reasoning with symbolic information. * Good common subsumer (GCS)-based semantic similarity measure<ref>{{cite conference|author1=C. d'Amato|author2=S. Staab|author3=N. Fanizzi|chapter=On the influence of description logics ontologies on conceptual similarity|title=Knowledge Engineering: Practice and Patterns| pages=48β63|date= 2008|doi=10.1007/978-3-540-87696-0_7}}</ref> === Semantics similarity networks === * A '''[[semantic similarity network]]''' (SSN) is a special form of [[semantic network]]. designed to represent concepts and their semantic similarity. Its main contribution is reducing the complexity of calculating semantic distances. Bendeck (2004, 2008) introduced the concept of ''semantic similarity networks'' (SSN) as the specialization of a semantic network to measure semantic similarity from ontological representations.<ref name=bendeck>{{cite book|last=Bendeck|first=F.|year=2008|title=WSM-P Workflow Semantic Matching Platform, PhD dissertation, University of Trier, Germany |publisher=Verlag Dr. Hut|id={{ASIN|3899638549|country=de}}}}</ref> Implementations include genetic information handling. === Gold standards === Researchers have collected datasets with similarity judgements on pairs of words, which are used to evaluate the cognitive plausibility of computational measures. The golden standard up to today is an old 65 word list where humans have judged the word similarity.<ref>Rubenstein, Herbert, and John B. Goodenough. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.893.7406&rep=rep1&type=pdf Contextual correlates of synonymy]. Communications of the ACM, 8(10):627β633, 1965.</ref><ref>For a list of datasets, and an overview of the state of the art see [https://www.aclweb.org/aclwiki/index.php?title=Similarity_(State_of_the_art) https://www.aclweb.org/].</ref> * RG65<ref>{{Cite journal|last1=Rubenstein|first1=Herbert|last2=Goodenough|first2=John B.|date=1965-10-01|title=Contextual correlates of synonymy|journal=Communications of the ACM|volume=8|issue=10|pages=627β633|doi=10.1145/365628.365657|s2cid=18309234|doi-access=free}}</ref> * MC30<ref>{{Cite journal|last1=Miller|first1=George A.|last2=Charles|first2=Walter G.|date=1991-01-01|title=Contextual correlates of semantic similarity|journal=Language and Cognitive Processes|volume=6|issue=1|pages=1β28|doi=10.1080/01690969108406936|issn=0169-0965}}</ref> * WordSim353<ref>{{Cite journal|date=2002-01-01|title=Placing search in context|journal=ACM Transactions on Information Systems |volume=20|pages=116β131|language=EN|doi=10.1145/503104.503110|s2cid=12956853|citeseerx=10.1.1.29.1912}}</ref> == See also == {{Portal|Linguistics}} * [[Analogy]] * [[Componential analysis]] * [[Coherence (linguistics)]] * [[Levenshtein distance]] * [[Semantic differential]] * [[Semantic similarity network]] * [[Terminology extraction]] * [[Word2vec]] * {{annotated link|tf-idf}} == References == {{Reflist}} ==Sources== * {{cite journal | last1 = Chicco | first1 = D | last2 = Masseroli | first2 = M | year = 2015 | title = Software suite for gene and protein annotation prediction and similarity search | journal = IEEE/ACM Transactions on Computational Biology and Bioinformatics | volume = 12 | issue = 4 | pages = 837β843 | doi=10.1109/TCBB.2014.2382127 | pmid = 26357324 | hdl = 11311/959408 | s2cid = 14714823 | url = https://doi.org/10.1109/TCBB.2014.2382127 | hdl-access = free }} * {{cite journal|author1=Cilibrasi, R.L. |author2=Vitanyi, P.M.B. |name-list-style=amp |year=2007|title=The Google Similarity Distance|journal=IEEE Trans. Knowledge and Data Engineering|volume=19|issue=3|pages=370β383|doi=10.1109/TKDE.2007.48|arxiv=cs/0412098|s2cid=59777 }} * {{cite journal | last1 = Dumais | first1 = S | year = 2003 | title = Data-driven approaches to information access | journal = Cognitive Science | volume = 27 | issue = 3| pages = 491β524 | doi=10.1207/s15516709cog2703_7| doi-access = free }} * Gabrilovich, E. and Markovitch, S. (2007). [https://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis], Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007. * Lee, M. D., Pincombe, B., & Welsh, M. (2005). [https://cloudfront.escholarship.org/dist/prd/content/qt48g155nq/qt48g155nq.pdf An empirical evaluation of models of text document similarity]. In B. G. Bara & L. Barsalou & M. Bucciarelli (Eds.), 27th Annual Meeting of the Cognitive Science Society, CogSci2005 (pp. 1254β1259). Austin, Tx: The Cognitive Science Society, Inc. * Lemaire, B., & DenhiΓ©re, G. (2004). [http://cogprints.org/3779/01/cogsci04_2.pdf Incremental construction of an associative network from a corpus]. In K. D. Forbus & D. Gentner & T. Regier (Eds.), 26th Annual Meeting of the Cognitive Science Society, CogSci2004. Hillsdale, NJ: Lawrence Erlbaum Publisher. * {{cite journal|author=Lindsey, R. |author2=Veksler, V.D.|author3=Grintsvayg, A.|author4=Gray, W.D.|year=2007|title=The Effects of Corpus Selection on Measuring Semantic Relatedness|journal=Proceedings of the 8th International Conference on Cognitive Modeling, Ann Arbor, MI|url=http://sitemaker.umich.edu/iccm2007.org/files/lindsey__veksler__grintsvayg____gray.pdf}} * Navigli, R., Lapata, M. (2010). [http://www.dsi.uniroma1.it/~navigli/pubs/PAMI_2010_Navigli_Lapata.pdf "An Experimental Study of Graph Connectivity for Unsupervised Word Sense Disambiguation"]. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(4), IEEE Press, 2010, pp. 678β692. * {{cite journal|author1=Veksler, V.D. |author2=Gray, W.D. |name-list-style=amp |year=2006|title=Test Case Selection for Evaluating Measures of Semantic Distance|journal=Proceedings of the 28th Annual Meeting of the Cognitive Science Society, CogSci2006|url=http://csjarchive.cogsci.rpi.edu/Proceedings/2006/docs/p2624.pdf}} * Wong, W., Liu, W. & Bennamoun, M. (2008) Featureless Data Clustering. In: M. Song and Y. Wu; Handbook of Research on Text and Web Mining Technologies; IGI Global. {{ISBN|978-1-59904-990-8}} (the use of NGD and noW for term and URI clustering) == External links == * [http://www.similarity-blog.de/?page_id=3 List of related literature] === Survey articles === * ''Conference article'': C. d'Amato, S. Staab, N. Fanizzi. 2008. [https://dl.acm.org/citation.cfm?id=1434078 On the Influence of Description Logics Ontologies on Conceptual Similarity]. In Proceedings of the 16th international conference on Knowledge Engineering: Practice and Patterns Pages 48 β 63. Acitrezza, Italy, Springer-Verlag * ''Journal article'' on the more general topic of relatedness, also including similarity: Z. Zhang, A. Gentile, F. Ciravegna. 2013. [https://www.cambridge.org/core/journals/natural-language-engineering/article/recent-advances-in-methods-of-lexical-semantic-relatedness-a-survey/35BA94697B86B4B797FCF3ACCDE24FBD Recent advances in methods of lexical semantic relatedness β a survey]. Natural Language Engineering 19 (4), 411β479, Cambridge University Press * ''Book'': S. Harispe, S. Ranwez, S. Janaqi, J. Montmain. 2015. [http://www.morganclaypool.com/doi/10.2200/S00639ED1V01Y201504HLT027 Semantic Similarity from Natural Language and Ontology Analysis], Morgan & Claypool Publishers. {{Natural language processing}} [[Category:Computational linguistics]] [[Category:Statistical distance]] [[Category:Semantic relations| ]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:Annotated link
(
edit
)
Template:Citation
(
edit
)
Template:Citation needed
(
edit
)
Template:Cite book
(
edit
)
Template:Cite conference
(
edit
)
Template:Cite journal
(
edit
)
Template:Cite web
(
edit
)
Template:Dead link
(
edit
)
Template:ISBN
(
edit
)
Template:Natural language processing
(
edit
)
Template:Portal
(
edit
)
Template:Reflist
(
edit
)
Template:Short description
(
edit
)