Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Wiktionary
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Wiktionary data in natural language processing== Wiktionary has [[semi-structured data]].{{sfn|Meyer|Gurevych|2012|p=140}} Wiktionary [[Lexicography|lexicographic]] data can be converted to [[Machine-readable data|machine-readable format]] in order to be used in [[natural language processing]] tasks.{{sfn|Zesch|Müller|Gurevych|2008|p=4|loc=Figure 1}}{{sfn|Meyer|Gurevych|2010|p=40}}{{sfn|Krizhanovsky, Transformation|2010|p=1}} Wiktionary's [[data mining]] is a complex task. There are the following difficulties:{{sfn|Hellmann|Auer|2013|p=302|loc=p. 16 in PDF}} * (1) the constant and frequent changes to data and schemata * (2) the heterogeneity in Wiktionary language edition schemata{{efn|E.g. compare the entry structure and formatting rules in [[wikt:Wiktionary:Entry layout explained|English Wiktionary]] and [[wikt:ru:Викисловарь:Правила оформления статей|Russian Wiktionary]].}} and * (3) the human-centric nature of a [[wiki]]. There are several [[Parsing|parsers]] for different Wiktionary language editions:{{sfn|Hellmann|Brekle|Auer|2012|p=3|loc=Table 1}} * DBpedia Wiktionary :<ref>{{Cite web|url=http://dbpedia.org/Wiktionary|archiveurl=https://web.archive.org/web/20130504235547/http://dbpedia.org/Wiktionary|url-status=dead|title=DBpedia Wiktionary|archivedate=May 4, 2013}}</ref> a subproject of [[DBpedia]], the data are extracted from English, French, German, and Russian Wiktionaries; the data includes language, [[Part of speech|parts of speech]], definitions, [[Semantic relationship|semantic relations]] and translations. The declarative description of the [[Page schematic|page schema]],{{sfn|Hellmann|Brekle|Auer|2012|pp=8–9}} [[regular expression]]s{{sfn|Hellmann|Brekle|Auer|2012|p=10}} and [[finite state transducer]]{{sfn|Hellmann|Brekle|Auer|2012|p=11}} are used in order to extract information. * JWKTL ([[Java (programming language)|Java]] Wiktionary Library) :<ref>{{Cite web|url=https://dkpro.github.io/dkpro-jwktl/|title=Welcome|website=DKPro JWKTL|access-date=June 23, 2019|archive-date=January 23, 2021|archive-url=https://web.archive.org/web/20210123133521/https://dkpro.github.io/dkpro-jwktl/|url-status=live}}</ref> provides access to English Wiktionary and German Wiktionary dumps via a Java [[Ubiquitous Knowledge Processing Lab#Wiktionary API|Wiktionary API]].{{sfn|Zesch|Müller|Gurevych|2008}} The data includes language, parts of speech, definitions, quotations, semantic relations, etymologies and translations. JWKTL is distributed under the [[Apache License]]. * wikokit :<ref>{{Cite web|url=https://github.com/componavt/wikokit|title=Wikokit – Machine-readable Wiktionary|date=December 19, 2022|via=GitHub|access-date=November 7, 2015|archive-date=October 2, 2020|archive-url=https://web.archive.org/web/20201002225056/https://github.com/componavt/wikokit|url-status=live}}</ref> the [[parser]] of English Wiktionary and Russian Wiktionary.{{sfn|Krizhanovsky, Transformation|2010}} The parsed data includes language, parts of speech, definitions, quotations,{{sfn|Smirnov et al.|2012}}{{efn|Quotations are extracted only from Russian Wiktionary.{{sfn|Smirnov et al.|2012}}}} semantic relations{{sfn|Krizhanovsky, Comparison|2010}} and translations. This is a [[Multi-licensing#License compatibility|multi-licensed]] [[Open source|open-source]] software. * [[Etymology|Etymological]] entries have been parsed in the Etymological [[WordNet]] project.<ref>{{Cite web|url=http://gerard.demelo.org/berkeley/|title=Gerard de Melo's Research at ICSI, Berkeley|website=gerard.demelo.org|access-date=March 6, 2023|archive-date=March 27, 2023|archive-url=https://web.archive.org/web/20230327013529/http://gerard.demelo.org/berkeley/|url-status=live}}</ref> Examples of [[natural language processing]] tasks which have been solved with the help of Wiktionary data include: * [[Rule-based machine translation]] between [[Dutch language]] and [[Afrikaans]]; data of English Wiktionary, Dutch Wiktionary and Wikipedia were used with the [[Apertium]] [[machine translation]] platform.{{sfn|Otte|Tyers|2011}} * Construction of [[machine-readable dictionary]] by the parser NULEX, which integrates open linguistic resources: English Wiktionary, [[WordNet]], and [[VerbNet]].{{sfn|McFate|Forbus|2011}} The parser NULEX [[Web scraping|scrapes]] English Wiktionary for tense information (verbs), plural form and parts of speech (nouns). * [[Speech recognition]] and [[Speech synthesis|synthesis]], where Wiktionary was used to automatically create pronunciation dictionaries.{{sfn|Schlippe|Ochs|Schultz|2012}} Word-pronunciation pairs were retrieved from 6 Wiktionary language editions ([[Czech language|Czech]], English, French, [[Spanish language|Spanish]], Polish, and German). Pronunciations are in terms of the [[International Phonetic Alphabet]].{{efn|If there are several IPA notations on a Wiktionary page – either for different languages or for pronunciation variants, then the first pronunciation was extracted.{{sfn|Schlippe|Ochs|Schultz|2012|p=4802}}}} The [[Speech recognition|ASR]] system based on English Wiktionary has the highest word error rate, where each third [[phoneme]] has to be changed.{{sfn|Schlippe|Ochs|Schultz|2012|p=4804}} * [[Ontology engineering]]{{sfn|Meyer|Gurevych|2012}} and [[semantic network]] constructing.<ref>{{Cite web |title=ConceptNet 5 |url=http://conceptnet5.media.mit.edu/ |url-status=dead |archive-url=https://web.archive.org/web/20111019152920/http://conceptnet5.media.mit.edu/ |archive-date=2011-10-19 |access-date=2023-09-23 |website=conceptnet5.media.mit.edu}}</ref> * [[Ontology alignment|Ontology matching]].{{sfn|Lin|Krizhanovsky|2011}} * [[Text simplification]]. Medero & [[Mari Ostendorf|Ostendorf]]{{sfn|Medero|Ostendorf|2009}} assessed vocabulary difficulty ([[Readability|reading level]] detection) with the help of Wiktionary data. Properties of words extracted from Wiktionary entries (definition length and [[Part of speech|POS]], sense, and translation counts) were investigated. Medero & Ostendorf expected that ** (1) very common words will be more likely to have multiple parts of speech, ** (2) common words will be more likely to have multiple senses, ** (3) common words will be more likely to have been translated into multiple languages. These features extracted from Wiktionary entries were useful in distinguishing word types that appear in [[Simple English Wikipedia]] articles from words that only appear in the Standard English comparable articles. * [[Part-of-speech tagging]]. Li et al. (2012){{sfn|Li|Graça|Taskar|2012}} built multilingual POS-taggers for eight resource-poor languages on the basis of English Wiktionary and [[Part-of-speech tagging#Use of hidden Markov models|hidden Markov models]].{{efn|The source code and the results of POS-tagging are available at https://code.google.com/p/wikily-supervised-pos-tagger}} * [[Sentiment analysis]].{{sfn|Chesley|Vincent|Xu|Srihari|2006}} "[[Wikidata]]:Lexicographical data" was started in 2018 to provide structured data support to Wiktionaries. It stores word data of all languages in a machine readable data model, under a dedicated "[[Lexeme]]" namespace in Wikidata. As of October 2021, the project has amassed over 600,000 lexeme entries of various languages.<ref>{{cite web|url=https://www.wikidata.org/w/index.php?title=Wikidata:Wiktionary&oldid=1510363143|title=Wikidata:Wiktionary|access-date=12 October 2012|archive-date=January 3, 2023|archive-url=https://web.archive.org/web/20230103132433/https://www.wikidata.org/w/index.php?title=Wikidata:Wiktionary&oldid=1510363143|url-status=live}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)