Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Moby Project
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{short description|Collection of public-domain lexical resources}} {{About|the public-domain lexical resource from books|the community-driven software containerization project created by [[Docker, Inc.]]|Moby (software)}} {{multiple issues| {{more footnotes needed|date=January 2016}} {{one source|date=January 2016}} {{more citations needed|date=January 2016}} }} The '''Moby Project''' is a collection of public-domain lexical resources created by [[Grady Ward]]. The resources were dedicated to the public domain, and are now mirrored at [[Project Gutenberg]]. {{As of|2007}}, it contains the largest free phonetic database, with 177,267 words and corresponding pronunciations.<ref name=acl_siglex /> == Hyphenator == The '''Moby Hyphenator II''' contains [[Hyphenation algorithm|hyphenations]] of 187,175 words and phrases (including 9,752 entries where no hyphenations are given, such as ''through'' and ''avoir''). The character encoding appears to be [[Mac OS Roman|MacRoman]], and hyphenation is indicated by a bullet ({{angbr|•}}, character value 165 decimal, or A5 hexadecimal). Some entries, however, have a combination of actual hyphens and character 165, such as "{{not a typo|bar•ber-sur•geon}}". There is little to no documentation of the hyphenation choices made; the following examples might give some flavour of the style of hyphenation used: {{not a typo|at•mos•phere; at•tend•ant; ca•pac•i•ty; un•col•or•a•ble}}. == Languages == '''Moby Language II''' contains wordlists of five languages: [[French language|French]], [[German language|German]], [[Italian language|Italian]], [[Japanese language|Japanese]], and [[Spanish language|Spanish]]. Their statistics are: {| class="wikitable" |- ! Language ! Words ! Size (in [[byte]]s) |- ! French |align="right"| 138,257 |align="right"| 1,524,757 |- ! German |align="right"| 159,809 |align="right"| 2,055,986 |- ! Italian |align="right"| 60,453 |align="right"| 561,981 |- ! Japanese |align="right"| 115,523 |align="right"| 934,783 |- ! Spanish |align="right"| 86,059 |align="right"| 850,523 |- ! Total |align="right"| 560,101 ! 5,928,030 |} However, some of the lists are contaminated: for example, the Japanese list contains English words such as ''abnormal'' and non-words such as ''{{not a typo|abcdefgh}}'' and ''{{not a typo|m,./}}''. There are also unusual peculiarities in the sorting of these lists, as the French list contains a straight alphabetical listing, while the German list contains the alphabetical listing of traditionally capitalized words and then the alphabetical listing of traditionally lower-cased words. The list of Italian words, however, contains no capitalized words whatsoever. The lists do not use accented characters, so "{{not a typo|e^tre}}" is how a user would look up the French word {{lang|fr|être}}'' ("to be"). == Part-of-Speech == '''Moby Part-of-Speech''' contains 233,356 words fully described by [[Lexical category|part(s) of speech]], listed in priority order. The format of the file is ''word\parts-of-speech'', with the following parts of speech being identified: {| class="wikitable" |- ! Part-of-speech ! Code |- | [[Noun]] | N |- | [[Plural]] | p |- | [[Noun phrase]] | h |- | [[Verb]] (usually [[participle]]) | V |- | [[Transitive verb]] | t |- | [[Intransitive verb]] | i |- | [[Adjective]] | A |- | [[Adverb]] | v |- | [[Grammatical conjunction|Conjunction]] | C |- | [[Preposition]] | P |- | [[Interjection]] | ! |- | [[Pronoun]] | r |- | [[Article (grammar)|Definite article]] | D |- | [[Article (grammar)|Indefinite article]] | I |- | [[Nominative]] | o |} == Pronunciator == The '''Moby Pronunciator II''' contains 177,267 entries with corresponding pronunciations. Most of the entries describe a single word, but approximately 79,000<ref>Obtained by running the UNIX command ''grep '.*[-_].* .*' mobypron.unc | wc -l'' after converting the line endings and correcting some encoding errors.</ref> contain hyphenated or multiple word phrases, names, or [[lexemes]]. The Project Gutenberg distribution also contains a copy of the [[cmudict]] v0.3. The file contains lines of the format ''word[/part-of-speech] pronunciation''. Each line is ended with the ASCII [[carriage return]] character (CR, '\r', 0x0D, 13 in decimal). The ''word'' field can include apostrophes (e.g. ''isn't''), hyphens (e.g. ''able-bodied''), and multiple words separated by underscores (e.g. ''{{not a typo|monkey_wrench}}''). Non-English words are generally rendered, as stated in the documentation, without accents or other diacritical marks. However, in 36 entries (e.g. ''{{not a typo|São_Miguel}}''), some non-ASCII accented characters remain, represented using [[Mac OS Roman]] encoding. The part-of-speech field is used to disambiguate 770 of the words which have differing pronunciations depending on their part-of-speech. For example, for the words spelled ''close,'' the verb has the pronunciation {{IPAc-en|ˈ|k|l|oʊ|z}}, whereas the adjective is {{IPAc-en|ˈ|k|l|oʊ|s}}. The parts-of-speech have been assigned the following codes: {| class="wikitable" |- ! Part-of-speech ! Code |- | [[Noun]] | n |- | [[Verb]] | v |- | [[Adjective]] | aj |- | [[Adverb]] | av |- | [[Interjection]] | interj |} Following this is the pronunciation. Several special symbols are present: {| class="wikitable" |- ! Symbol ! Meaning |- | _ | Used to separate words |- | ' | [[Primary stress]] on the following syllable |- | , | [[Secondary stress]] on the following syllable |} The rest of the symbols are used to represent [[International Phonetic Alphabet|IPA]] characters. The pronunciations are generally consistent with a [[General American]] dialect of English, that exhibits [[father-bother merger]], [[hurry-furry merger]] and [[lot-cloth split]], but does not exhibit [[cot-caught merger]] or [[wine-whine merger]]. Each phoneme is represented by a sequence of one or more characters. Some of the sequences are delimited with a slash character "/", as shown in the following table, but note that the sequence for {{IPAc-en|ɔɪ}} is delimited by ''two'' slash characters at either end: {| class="wikitable" |- ! Symbol ! [[Help:IPA/English|IPA]] |- | /&/ | æ |- | /-/ | ə |- | /@/ | ʌ, ə |- | /[@]/r | ɜr, ər |- | /A/ | ɑ, ɑː |- | /aI/ | aɪ |- | /AU/ | aʊ |- | b | b |- | d | d |- | /D/ | ð |- | /dZ/ | dʒ |- | /E/ | ɛ |- | /eI/ | eɪ |- | f | f |- | g | ɡ |- | h | h |- | hw | hw |- | /i/ | iː |- | /I/ | ɪ |- | /j/ | j |- | /ju/ | juː |- | k | k |- | l | l |- | m | m |- | n | n |- | /N/ | ŋ |- | /O/ | ɔ, ɔː |- | //Oi// | ɔɪ |- | /oU/ | oʊ |- | p | p |- | r | r |- | s | s |- | /S/ | ʃ |- | t | t |- | /T/ | θ |- | /tS/ | tʃ |- | /u/ | uː |- | /U/ | ʊ |- | v | v |- | w | w |- | z | z |- | /Z/ | ʒ |} To this collection are added a number of extra sequences representing phonemes found in several other languages. These are used to encode the non-English words, phrases and names that are included in the database. The following table contains these extra phonemes, but note that the extent to which some of these may exist due to encoding errors is not clear. {| class="wikitable" |- ! Symbol ! [[Help:IPA/English|IPA]] |- | A | a |- | e | e, ɛ |- | i | i, ɪ |- | N | [[Nasalisation]] of preceding vowel |- | o | o |- | O | [intent not clear] |- | R | ʁ |- | S | s |- | u | u |- | V | v, β, ʋ |- | W | w |- | /x/ | x |- | /y/ | ø |- | Y | y |- | /z/ | ts |- | Z | z |} == Shakespeare == '''Moby Shakespeare''' contains the complete unabridged works of [[Shakespeare]]. This specific resource is not available from Project Gutenberg, but it is available in a 1993 version on the web.<ref>[http://shakespearereadingsociety.co.uk/texts/1993originals/mobyshak.txt mobyshak.txt 1993 version]</ref> == Thesaurus == The '''Moby Thesaurus II''' contains 30,260 root words, with 2,520,264 [[synonym]]s and related terms – an average of 83.3 per root word. Each line consists of a list of [[comma-separated values]], with the first term being the root word, and all following words being related terms. [[Grady Ward]] placed this thesaurus in the [[public domain]] in 1996. It is also available as a [[Debian]] package although the package has been discontinued starting with [[Debian version history#Debian 11 (Bullseye)|Bullseye]].<ref>{{cite web |url=https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=964991 |title=RM: dict-moby-thesaurus -- RoQA; dead upstream (10+ years); python2-only; no extrenal {{sic|nolink=y}} deps; extremely low popcon |last=Tosi |first=Sandro |date=July 13, 2020 |website=Debian Bug report logs |access-date=May 10, 2022 |quote=}}</ref> == Words == '''Moby Words II''' is the largest wordlist in the world.<ref name=acl_siglex>{{cite web |url=https://www.clres.com/dict.html |title=ACL SIGLEX Resource Links |author=<!--Not stated--> |date=August 13, 2004 |website= |publisher=Special Interest Group on the Lexicon of the Association for Computational Linguistics |access-date=May 9, 2022 |quote=Moby Words: 610,000+ words and phrases. The largest word list in the world |archive-url=https://web.archive.org/web/20181215174820/https://www.clres.com/dict.html |archive-date=December 15, 2018}}</ref>{{additional citation needed|date=September 2016}} The distribution consists of the following 16 files: {| class="wikitable" |- ! Filename ! Words ! Description |- | ACRONYMS.TXT | 6,213 | Common [[acronym]]s and [[abbreviation]]s |- | COMMON.TXT | 74,550 | Common words present in two or more published dictionaries |- | COMPOUND.TXT | 256,772 | Phrases, [[proper noun]]s, and [[acronym]]s not included in the common words file |- | CROSSWD.TXT | 113,809 | Words included in the first edition of the [[Official Scrabble Players Dictionary]] |- | CRSWD-D.TXT | 4,160 | Additions to the Official Scrabble Players Dictionary in the second edition |- | FICTION.TXT | 467 | A list of the most commonly occurring [[substring]]s in the book ''[[The Joy Luck Club (novel)|The Joy Luck Club]]'' |- | FREQ.TXT | 1,000 | Most frequently occurring words in the [[English language]], listed in descending order |- | FREQ-INT.TXT | 1,000 | Most frequently occurring words on [[Usenet]] in 1992, listed with corresponding percentage in decreasing order |- | KJVFREQ.TXT | 1,185 | Most frequently occurring [[substring]]s in the [[King James Version of the Bible]], listed in descending order |- | NAMES.TXT | 21,986 | Most common [[name]]s used in the United States and [[Great Britain]] |- | NAMES-F.TXT | 4,946 | Common English [[female]] names |- | NAMES-M.TXT | 3,897 | Common English [[male]] names |- | OFTENMIS.TXT | 366 | Most common misspelled English words |- | PLACES.TXT | 10,196 | Place names in the United States |- | SINGLE.TXT | 354,984 | Single words excluding proper nouns, acronyms, compound words and phrases, but including [[Archaism|archaic]] words and significant [[variant spellings]] |- | USACONST.TXT | 7,618 | [[United States Constitution]] including all amendments current to 1993 |- ! Total ! 863,149 | Not the total of unique words. |- ! Total Uniq ! 639,995 | Total of single, proper nouns, acronyms, and compound words and phrases (all of the files that contain unique words). |} == References == {{reflist}} == External links == * Former Moby Project site (icon.shef.ac.uk/Moby/) – No longer accessible. View a [https://web.archive.org/web/20170930060409/http://icon.shef.ac.uk/Moby/ copy] made by the [[Wayback Machine]], as it was on 30 September 2017. ("Last modified: October 24, 2000") [http://ai1.ai.uga.edu/ftplib/natural-language/moby/ working download site]. *[http://www.gutenberg.org/ebooks/3201 Project Gutenberg downloads] *''[http://www.foo.be/docs/tpj/issues/vol4_4/tpj0404-0003.html Searching for Rhymes with Perl]''; [http://interglacial.com/~sburke/mpron/ corresponding code] * [[Wiktionary:Appendix:Moby Thesaurus II]] * http://digital.library.upenn.edu/webbin/gutbook/lookup?num=3201 [[Category:Public domain databases]] [[Category:Corpora]] [[Category:Linguistic research]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:About
(
edit
)
Template:Additional citation needed
(
edit
)
Template:Angbr
(
edit
)
Template:As of
(
edit
)
Template:Cite web
(
edit
)
Template:IPAc-en
(
edit
)
Template:Lang
(
edit
)
Template:Multiple issues
(
edit
)
Template:Not a typo
(
edit
)
Template:Reflist
(
edit
)
Template:Short description
(
edit
)