Editing Mann–Whitney U test (section)

==Examples==

===Illustration of calculation methods===

Suppose that [[Aesop]] is dissatisfied with his [[The Tortoise and the Hare|classic experiment]] in which one [[tortoise]] was found to beat one [[hare]] in a race, and decides to carry out a significance test to discover whether the results could be extended to tortoises and hares in general. He collects a sample of 6 tortoises and 6 hares, and makes them all run his race at once. The order in which they reach the finishing post (their rank order, from first to last crossing the finish line) is as follows, writing T for a tortoise and H for a hare:
:T H H H H H T T T T T H
What is the value of ''U''?
* Using the direct method, we take each tortoise in turn, and count the number of hares it beats, getting 6, 1, 1, 1, 1, 1, which means that {{math|1=''U<sub>T</sub>'' = 11}}. Alternatively, we could take each hare in turn, and count the number of tortoises it beats. In this case, we get 5, 5, 5, 5, 5, 0, so {{math|1=''U<sub>H</sub>'' = 25}}. Note that the sum of these two values for {{math|1=''U'' = 36}}, which is {{math|6×6}}.
* Using the indirect method:
: rank the animals by the time they take to complete the course, so give the first animal home rank 12, the second rank 11, and so forth.
: the sum of the ranks achieved by the tortoises is {{math|1=12 + 6 + 5 + 4 + 3 + 2 = 32}}.
:: Therefore {{math|1=''U<sub>T</sub>'' = 32 − (6×7)/2 = 32 − 21 = 11}} (same as method one).
:: The sum of the ranks achieved by the hares is {{math|1=11 + 10 + 9 + 8 + 7 + 1 = 46}}, leading to {{math|1=''U<sub>H</sub>'' = 46 − 21 = 25}}.
<!--
===Illustration of object of test===
A second example race illustrates the point that the Mann–Whitney ''U'' test does not test for inequality of [[median]]s, but rather for difference of distributions. Consider another hare and tortoise race, with 19 participants of each species, in which the outcomes are as follows, from first to last past the finishing post:

:H H H H H H H H H T T T T T T T T T '''T''' '''H''' H H H H H H H H H T T T T T T T T T

If we simply compared medians, we would conclude that the median time for tortoises is less than the median time for hares, because the median tortoise here (in bold) comes in at position 19, and thus actually beats the median hare (in bold), which comes in at position 20. However, the value of ''U'' is 100 (using the quick method of calculation described above, we see that each of 10 tortoises beats each of 10 hares, so {{math|1=''U'' = 10×10}}). Consulting tables, or using the approximation below, we find that this ''U'' value gives significant evidence that hares tend to have lower completion times than tortoises ({{math|''p'' < 0.05}}, two-tailed). Obviously these are extreme distributions that would be spotted easily, but in larger samples something similar could happen without it being so apparent. Notice that the problem here is not that the two distributions of ranks have different [[variance]]s; they are mirror images of each other, so their variances are the same, but they have very different [[skewness]].
-->

===Example statement of results===
In reporting the results of a Mann–Whitney ''U'' test, it is important to state:<ref>{{Cite journal |last1=Fritz |first1=Catherine O. |last2=Morris |first2=Peter E. |last3=Richler |first3=Jennifer J. |date=2012 |title=Effect size estimates: Current use, calculations, and interpretation. |url=http://doi.apa.org/getdoi.cfm?doi=10.1037/a0024338 |journal=Journal of Experimental Psychology: General |language=en |volume=141 |issue=1 |pages=2–18 |doi=10.1037/a0024338 |pmid=21823805 |issn=1939-2222}}</ref>
*A measure of the central tendencies of the two groups (means or medians; since the Mann–Whitney ''U'' test is an ordinal test, medians are usually recommended)
*The value of ''U'' (perhaps with some measure of effect size, such as [[#Common language effect size|common language effect size]] or [[#Rank-biserial correlation|rank-biserial correlation]]).
*The sample sizes
*The significance level.
In practice some of this information may already have been supplied and common sense should be used in deciding whether to repeat it. A typical report might run,
:"Median latencies in groups E and C were 153 and 247 ms; the distributions in the two groups differed significantly (Mann–Whitney {{math|1=''U'' = 10.5}}, {{math|1=''n''<sub>1</sub> = ''n''<sub>2</sub> = 8}}, {{math|1=''P'' < 0.05}} two-tailed)."
A statement that does full justice to the statistical status of the test might run,
:"Outcomes of the two treatments were compared using the Wilcoxon–Mann–Whitney two-sample rank-sum test. The treatment effect (difference between treatments) was quantified using the Hodges–Lehmann (HL) estimator, which is consistent with the Wilcoxon test.<ref>{{cite book |title= Nonparametric Statistical Methods |author1=Myles Hollander |author2=Douglas A. Wolfe |publisher= Wiley-Interscience |edition=2 |year=1999 |isbn= 978-0471190455}}</ref> This estimator (HLΔ) is the median of all possible differences in outcomes between a subject in group B and a subject in group&nbsp;A. A non-parametric 0.95 confidence interval for HLΔ accompanies these estimates as does ρ, an estimate of the probability that a randomly chosen subject from population B has a higher weight than a randomly chosen subject from population&nbsp;A. The median [quartiles] weight for subjects on treatment A and B respectively are 147 [121, 177] and 151 [130, 180] kg. Treatment A decreased weight by HLΔ = 5 kg (0.95 CL [2, 9] kg, {{math|1=2''P'' = 0.02}}, {{math|1=''ρ'' = 0.58}})."

However it would be rare to find such an extensive report in a document whose major topic was not statistical inference.