Editing Lexicostatistics

{{Short description|Method of comparative linguistics}}
{{more footnotes|date=August 2014}}
'''Lexicostatistics''' is a method of [[comparative linguistics]] that involves comparing the percentage of [[lexical cognates]] between languages to determine their relationship. Lexicostatistics is related to the [[comparative method]] but does not reconstruct a [[proto-language]]. It is to be distinguished from [[glottochronology]], which attempts to use lexicostatistical methods to estimate the length of time since two or more languages diverged from a common earlier proto-language. This is merely one application of lexicostatistics, however; other applications of it may not share the assumption of a constant rate of change for basic lexical items.

The term "lexicostatistics" is misleading in that mathematical equations are used but not statistics. Other features of a language may be used other than the lexicon, though this is unusual. Whereas the comparative method used shared identified innovations to determine sub-groups, lexicostatistics does not identify these. Lexicostatistics is a distance-based method, whereas the comparative method considers language characters directly. The lexicostatistics method is a simple and fast technique relative to the comparative method but has limitations (discussed below). It can be validated by cross-checking the trees produced by both methods.

==History==
Lexicostatistics was developed by [[Morris Swadesh]] in a series of articles in the 1950s, based on earlier ideas.<ref>{{cite journal |last1=Swadesh |first1=Morris |title=Towards greater accuracy in lexicostatistical dating |journal=International Journal of American Linguistics |date=1955 |volume=21 |issue=2 |pages=121–137|doi=10.1086/464321 |s2cid=144581963 |url=https://www.journals.uchicago.edu/doi/abs/10.1086/464321|url-access=subscription }}</ref><ref>{{cite journal |last1=Swadesh |first1=Morris |title=Lexicostatistical dating of prehistoric ethnic contacts |journal=Proceedings of the American Philosophical Society |date=1952 |volume=96 |pages=452–463}}</ref><ref>{{cite journal |last1=Swadesh |first1=Morris |title=Salish internal relationships |journal=International Journal of American Linguistics |date=1950 |volume=16 |issue=4 |pages=157–167|doi=10.1086/464084 |s2cid=145122561 }}</ref> The concept's first known use was by [[Dumont d'Urville]] in 1834 who compared various "Oceanic" languages and proposed a method for calculating a coefficient of relationship. [[Dell Hymes|Hymes]] (1960) and Embleton (1986) both review the history of lexicostatistics.<ref>{{cite journal |last1=Hymes |first1=Dell |title=Lexicostatistics so far |journal=Current Anthropology |date=1960 |volume=1 |issue=1 |pages=3–44|doi=10.1086/200074 |s2cid=144569209 }}</ref><ref>{{cite book |last1=Embleton |first1=Sheila |title=Statistics in Historical Linguistics |date=1986 |publisher=Bochum}}</ref>

==Method==

===Create word list===
The aim is to generate a list of universally used meanings (hand, mouth, sky, I). Words are then collected for these meaning slots for each language being considered. Swadesh reduced a larger set of meanings down to 200 originally. He later found that it was necessary to reduce it further but that he could include some meanings that were not in his original list, giving his later 100-item list. The [[Swadesh list]] in [[Wiktionary]] gives the total 207 meanings in a number of languages. Alternative lists that apply more rigorous criteria have been generated, e.g. the [[Dolgopolsky list]] and the [[Leipzig–Jakarta list]], as well as lists with a more specific scope; for example, [[Isidore Dyen|Dyen]], [[Joseph Kruskal|Kruskal]] and Black have 200 meanings for 84 [[Indo-European languages]] in digital form.<ref name=Dyen&al1992>{{cite journal |last1=Dyen |first1=Isidore |last2=Kruskal |first2=Joseph |last3=Black |first3=Paul |title=An Indoeuropean Classification, a Lexicostatistical Experiment |journal=Transactions of the American Philosophical Society |date=1992 |volume=82 |issue=5|pages=iii–132 |doi=10.2307/1006517 |jstor=1006517 }}</ref>

===Determine cognacies===
A trained and experienced linguist is needed to make cognacy decisions. However, the decisions may need to be refined as the state of knowledge increases. However, lexicostatistics does not rely on all the decisions being correct. For each pair of words (in different languages) in this list, the cognacy of a form could be positive, negative or indeterminate. Sometimes a language has multiple words for one meaning, e.g. ''small'' and ''little'' for ''not big''.

===Calculate lexicostatistic percentages===
This percentage is related to the proportion of meanings for a particular language pair that are cognate, i.e. relative to the total without indeterminacy. This value is entered into an [[distance matrix|{{math|''N''×''N''}} table of distances]], where N is the number of languages being compared. When completed, this table is half-filled in [[triangular matrix|triangular]] form. The higher the proportion of cognacy the closer the languages are related.

===Create family tree===
Creation of the language tree is based solely on the table found above. Various sub-grouping methods can be used but that adopted by Dyen, Kruskal and Black was:
* all lists are placed in a [[Pool (computer science)|pool]]
* the two closest members are removed and form a nucleus which is placed in the pool
* this step is repeated
* under certain conditions a nucleus becomes a group
* this is repeated until the pool only contains one group.

Calculations have to be of nucleus and group lexical percentages.

==Applications==
A leading exponent of lexicostatistics application has been [[Isidore Dyen]].<ref>{{cite journal |last1=Dyen |first1=Isidore |title=The lexicostatistically determined relationship of a language group |journal=International Journal of American Linguistics |date=1962 |volume=28 |issue=3|pages=153–161 |doi=10.1086/464687 |s2cid=143070513 }}</ref><ref>{{cite journal |last1=Dyen |first1=Isidore |title=Lexicostatistically determined borrowing and taboo |journal=Language |date=1963 |volume=39 |issue=1 |pages=60–66|doi=10.2307/410762 |jstor=410762 }}</ref><ref>{{cite book |editor-last1=Dyen |editor-first1=Isidore  |title=Lexicostatistics in Genetic Linguistics |date=1973 |publisher=Mouton |location=The Hague}}</ref><ref>{{cite book |last1=Dyen |first1=Isidore |title=Linguistic Subgrouping and Lexicostatistics |date=1975 |publisher=Mouton |location=The Hague}}</ref> He used lexicostatistics to classify [[Austronesian languages]]<ref>{{cite journal |last1=Dyen |first1=Isidore |title=A lexicostatistical classification of the Austronesian languages. |journal=International Journal of American Linguistics |date=1965 |volume=19}}</ref> as well as [[Indo-European languages|Indo-European]] ones.<ref name=Dyen&al1992 /> A major study of the latter was reported by Dyen, Kruskal and Black (1992).<ref name=Dyen&al1992 /> Studies have also been carried out on [[Amerindian]] and [[African languages]].

===Pama-Nyungan===
The problem of internal branching within the [[Pama-Nyungan languages|Pama-Nyungan]] language family has been a long-standing issue for Australianist linguistics, and general consensus held that internal connections between the 25+ different subgroups of Pama-Nyungan were either impossible to reconstruct or that the subgroups were not in fact genetically related at all.<ref name = Dixon2002>{{cite book | quote = ''Australia provides a prototypical instance of a linguistic area. It has considerable time-depth, fairly uniform terrain leading to ease of interaction and communication, a fair proportion of reciprocal exogamous marriages, rampant multilingualism, and an open attitude to borrowing ... There is a basic uniformity to Australian languages which is the natural result of a long period of diffusion. Although no justification had been provided for 'Pama-Nyungan', it came to be accepted. People accepted it because it was accepted—as a species of belief. ... It is clear that 'Pama-Nyungan' cannot be supported as a genetic group. Nor is it a useful typological grouping.'' | first = Robert M.W. | last = Dixon | title = Australian languages: their nature and development | year = 2002 | pages = 48, 53 | publisher = Cambridge University Press}}</ref> In 2012, Claire Bowern and Quentin Atkinson published the results from their application of computational [[Phylogenetics|phylogenetic]] methods on 194 [https://en.wiktionary.org/wiki/doculect doculects] representing all major subgroups and isolates of Pama-Nyungan.<ref name = Bowern&Atkinson2012>{{cite journal | last1 = Bowern | first1 = Claire | last2 = Atkinson | first2 = Quentin | title  = Computational phylogenetics and the internal structure of Pama-Nyungan | journal = Language | volume = 88 | issue = 4 | year = 2012 | pages = 817–845| doi = 10.1353/lan.2012.0081 | hdl = 1885/61360 | s2cid = 4375648 | hdl-access = free }}</ref> Their model "recovered" many of the branches and divisions that had erstwhile been proposed and accepted by many other Australianists, while also providing some insight into the more problematic branches, such as [[Paman languages|Paman]] (which is complicated by the lack of data) and [[Ngumpin-Yapa languages|Ngumpin-Yapa]] (where the genetic picture is obscured by very high rates of borrowing between languages). Their dataset forms the largest of its kind for a [[hunter-gatherer]] language family, and the second largest overall after [[Austronesian languages|Austronesian]] ([https://abvd.shh.mpg.de/austronesian/ Greenhill et al. 2008] {{Webarchive|url=https://web.archive.org/web/20181219182305/https://abvd.shh.mpg.de/austronesian/ |date=2018-12-19 }}). They conclude that Pama-Nyungan languages are in fact not exceptional to lexicostatistical methods, which have successfully been applied to other language families of the world.

==Criticisms==
People such as [[Harry Hoijer|Hoijer]] (1956) have shown that there were difficulties in finding equivalents to the meaning items while many have found it necessary to modify Swadesh's lists.<ref>{{cite journal |last1=Hoijer |first1=Harry |title=Lexicostatistics: a critique |journal=Language |date=1956 |volume=32 |issue=1 |pages=49–60|doi=10.2307/410652 |jstor=410652 }}</ref> Gudschinsky (1956) questioned whether it was possible to obtain a universal list.<ref>{{cite journal |last1=Gudschinsky |first1=Sarah |title=The ABCs of lexicostatistics (glottochronology) |journal=Word |date=1956 |volume=12 |issue=2 |pages=175–210|doi=10.1080/00437956.1956.11659599 |doi-access=free }}</ref>

Factors such as [[loanword|borrowing]], tradition and [[taboo]] can skew the results, as with other methods. Sometimes lexicostatistics has been used with [[lexical similarity]] being used rather than cognacy to find resemblances. This is then equivalent to [[mass comparison]].

The choice of meaning slots is subjective, as is the choice of [[synonym]]s.

==Improved methods==
Some of the modern computational statistical hypothesis testing methods can be regarded as improvements of lexicostatistics in that they use similar word lists and distance measures.{{Citation needed|date=January 2025}}

==See also==
{{div col|colwidth=22em}}
*[[Basic English]]
*[[Cognate]]
*[[Comparative linguistics]]
*[[Comparative method]]
*[[Global Lexicostatistical Database]]
*[[Glottochronology]]
*[[Historical linguistics]]
*[[Indo-European studies]]
*[[Intercontinental Dictionary Series]]
*[[Linguistic distance]]
*[[Mass lexical comparison]]
*[[Proto-language]]
*[[Swadesh list]]
*[[Word list]]
{{div col end}}

==References==
{{Reflist}}

==Further reading==
* Dobson, Annette (1969). Lexicostatistical Grouping. Anthropological Linguistics 7, 216-221.
* Dobson, Annette and Black, Paul (1979). Multidimensional Scaling of some Lexicostatistical Data. Mathematical Scientist 1979/4, 55-61.
* McMahon, April and McMahon, Robert (2005). Language Classification by Numbers. Oxford University Press.
* Sankoff, David (1970). "On the Rate of Replacement of Word-Meaning Relationships." ''Language'' 46.564-569.
* Wittmann, Henri (1969). "A lexico-statistic inquiry into the diachrony of Hittite." ''Indogermanische Forschungen'' 74.1-10.[http://www.nou-la.org/ling/1969a-lexstatHitt.pdf]
* Wittmann, Henri (1973). "The lexicostatistical classification of the French-based Creole languages." ''Lexicostatistics in genetic linguistics: Proceedings of the Yale conference, April 3–4, 1971'', dir. Isidore Dyen, 89-99. La Haye: Mouton.[http://www.nou-la.org/ling/1973f-lexstatFC.pdf]

==External links==
{{Wiktionary|lexicostatistics}}
* [https://starling.rinet.ru/new100/ The Global Lexicostatistical Database], part of the [[Evolution of Human Languages]] project
* [https://web.archive.org/web/19991013184530/http://www.ntu.edu.au/education/langs/ielex/ IE database]
* [http://www.specgram.com/CLIV.1/08.phlogiston.cartoon.jiu.html A simplified explanation of the difference between glottochronology and lexicostatistics.]

{{Long-range comparative linguistics}}

[[Category:Historical linguistics]]
[[Category:Comparative linguistics]]
[[Category:Quantitative linguistics]]
[[Category:Mathematical linguistics]]