Editing Analysis of variance (section)

==Assumptions==
The analysis of variance has been studied from several approaches, the most common of which uses a [[linear model]] that relates the response to the treatments and blocks. Note that the model is linear in parameters but may be nonlinear across factor levels. Interpretation is easy when data is balanced across factors but much deeper understanding is needed for unbalanced data.

===Textbook analysis using a normal distribution===
The analysis of variance can be presented in terms of a [[linear model]], which makes the following assumptions about the [[probability distribution]] of the responses:<ref>{{cite book |title = Statistical Methods
| last1 = Snedecor | first1 = George W. 
| last2 = Cochran | first2 = William G.
| year = 1967 | edition = 6th | page = 321
}}</ref><ref>Cochran & Cox (1992, p 48)</ref><ref>Howell (2002, p 323)</ref><ref>
{{cite book | last1 = Anderson | first1 = David R.
| last2 = Sweeney | first2 = Dennis J.
| last3 = Williams | first3 = Thomas A.
| title = Statistics for business and economics 
| publisher = West Pub. Co | location = Minneapolis/St. Paul 
| year = 1996 | edition = 6th| isbn = 978-0-314-06378-6 | pages = 452–453}}
</ref>
* [[Statistical independence|Independence]] of observations – this is an assumption of the model that simplifies the statistical analysis.
* [[normal distribution|Normality]] – the distributions of the [[Residual (statistics)|residuals]] are [[Normal distribution|normal]].
* Equality (or "homogeneity") of variances, called [[homoscedasticity]]—the variance of data in groups should be the same.

The separate assumptions of the textbook model imply that the [[errors and residuals in statistics|errors]] are independently, identically, and normally distributed for fixed effects models, that is, that the errors (<math>\varepsilon</math>) are independent and
<math display="block">\varepsilon \thicksim N(0, \sigma^2).</math>

===Randomization-based analysis===
{{See also|Random assignment|Randomization test}}
In a [[Randomized controlled trial|randomized controlled experiment]], the treatments are randomly assigned to experimental units, following the experimental protocol. This randomization is objective and declared before the experiment is carried out. The objective random-assignment is used to test the significance of the [[null hypothesis]], following the ideas of [[Charles Sanders Peirce|C. S. Peirce]] and [[Ronald Fisher]]. This design-based analysis was discussed and developed by [[Francis J. Anscombe]] at [[Rothamsted Experimental Station]] and by [[Oscar Kempthorne]] at [[Iowa State University]].<ref>Anscombe (1948)</ref> Kempthorne and his students make an assumption of ''unit treatment additivity'', which is discussed in the books of Kempthorne and [[David R. Cox]].<ref>{{cite book |last1=Hinkelmann |first1=Klaus |last2=Kempthorne |first2=Oscar |title=Design and Analysis of Experiments, Volume 2: Advanced Experimental Design |date=2005 |publisher=John Wiley |page=213 |isbn=978-0-471-70993-0 |url=https://books.google.com/books?id=GiYc5nRVKf8C&pg=PA213 |language=en}}</ref><ref>{{cite book |last1=Cox |first1=D. R. |title=Planning of Experiments |date=1992 |publisher=Wiley |isbn=978-0-471-57429-3 |language=en}}</ref>

====Unit-treatment additivity====
In its simplest form, the assumption of unit-treatment additivity<ref group="nb">Unit-treatment additivity is simply termed additivity in most texts. Hinkelmann and Kempthorne add adjectives and distinguish between additivity in the strict and broad senses. This allows a detailed consideration of multiple error sources (treatment, state, selection, measurement and sampling) on page 161.</ref> states that the observed response <math>y_{i,j}</math> from experimental unit <math>i</math> when receiving treatment <math>j</math> can be written as the sum of the unit's response <math>y_i</math> and the treatment-effect <math> t_j</math>, that is <ref>Kempthorne (1979, p 30)</ref><ref name="Cox">Cox (1958, Chapter 2: Some Key Assumptions)</ref><ref>Hinkelmann and Kempthorne (2008, Volume 1, Throughout. Introduced in Section 2.3.3: Principles of experimental design; The linear model; Outline of a model)</ref>
<math display="block">y_{i,j}=y_i+t_j.</math>
The assumption of unit-treatment additivity implies that, for every treatment <math>j</math>, the <math>j</math>th treatment has exactly the same effect <math>t_j</math> on every experiment unit.

The assumption of unit treatment additivity usually cannot be directly [[Falsifiability|falsified]], according to Cox and Kempthorne. However, many ''consequences'' of treatment-unit additivity can be falsified. For a randomized experiment, the assumption of unit-treatment additivity ''implies'' that the variance is constant for all treatments. Therefore, by [[contraposition]], a necessary condition for unit-treatment additivity is that the variance is constant.

The use of unit treatment additivity and randomization is similar to the design-based inference that is standard in finite-population [[survey sampling]].

====Derived linear model====
Kempthorne uses the randomization-distribution and the assumption of ''unit treatment additivity'' to produce a ''derived linear model'', very similar to the textbook model discussed previously.<ref>Hinkelmann and Kempthorne (2008, Volume 1, Section 6.3: 
Completely Randomized Design; Derived Linear Model)</ref> The test statistics of this derived linear model are closely approximated by the test statistics of an appropriate normal linear model, according to approximation theorems and simulation studies.<ref name="HinkelmannKempthorne">Hinkelmann and Kempthorne (2008, Volume 1, Section 6.6: Completely randomized design; Approximating the randomization test)</ref> However, there are differences. For example, the randomization-based analysis results in a small but (strictly) negative correlation between the observations.<ref>Bailey (2008, Chapter 2.14 "A More General Model" in Bailey, pp.&nbsp;38–40)</ref><ref>Hinkelmann and Kempthorne (2008, Volume 1, Chapter 7: Comparison of Treatments)</ref> In the randomization-based analysis, there is ''no assumption'' of a ''normal'' distribution and certainly ''no assumption'' of ''independence''. On the contrary, ''the observations are dependent''!

The randomization-based analysis has the disadvantage that its exposition involves tedious algebra and extensive time. Since the randomization-based analysis is complicated and is closely approximated by the approach using a normal linear model, most teachers emphasize the normal linear model approach. Few statisticians object to model-based analysis of balanced randomized experiments.

====Statistical models for observational data====
However, when applied to data from non-randomized experiments or [[observational study|observational studies]], model-based analysis lacks the warrant of randomization.<ref>
Kempthorne (1979, pp 125–126, 
"The experimenter must decide which of the various causes that he feels will produce variations in his results must be controlled 
experimentally. Those causes that he does not control experimentally, because he is not cognizant of them, he must control by the device of randomization." "[O]nly when the treatments in the experiment are applied by the experimenter using the full randomization procedure is the chain of inductive inference sound. It is ''only'' under these circumstances that the experimenter can attribute whatever effects he observes to the treatment and the treatment only. Under these circumstances his conclusions are reliable in the statistical sense.") 
</ref> For observational data, the derivation of confidence intervals must use ''subjective'' models, as emphasized by [[Ronald Fisher]] and his followers. In practice, the estimates of treatment-effects from observational studies generally are often inconsistent. In practice, "statistical models" and observational data are useful for suggesting hypotheses that should be treated very cautiously by the public.<ref>Freedman {{full citation needed|date=November 2012}}</ref>

===Summary of assumptions===
{{See also|Shapiro–Wilk test|Bartlett's test|Levene's test}}
The normal-model based ANOVA analysis assumes the independence, normality, and homogeneity of variances of the residuals. The randomization-based analysis assumes only the homogeneity of the variances of the residuals (as a consequence of unit-treatment additivity) and uses the randomization procedure of the experiment. Both these analyses require [[homoscedasticity]], as an assumption for the normal-model analysis and as a consequence of randomization and additivity for the randomization-based analysis.

However, studies of processes that change variances rather than means (called dispersion effects) have been successfully conducted using ANOVA.<ref>Montgomery (2001, Section 3.8: Discovering dispersion effects)</ref> There are ''no'' necessary assumptions for ANOVA in its full generality, but the ''F''-test used for ANOVA hypothesis testing has assumptions and practical 
limitations which are of continuing interest.

Problems which do not satisfy the assumptions of ANOVA can often be transformed to satisfy the assumptions. 
The property of unit-treatment additivity is not invariant under a "change of scale", so statisticians often use transformations to achieve unit-treatment additivity. If the response variable is expected to follow a parametric family of probability distributions, then the statistician may specify (in the protocol for the experiment or observational study) that the responses be transformed to stabilize the variance.<ref>Hinkelmann and Kempthorne (2008, Volume 1, Section 6.10: Completely randomized design; Transformations)</ref> Also, a statistician may specify that logarithmic transforms be applied to the responses which are believed to follow a multiplicative model.<ref name="Cox" /><ref>Bailey (2008)</ref>
According to Cauchy's [[functional equation]] theorem, the [[logarithm]] is the only continuous transformation that transforms real multiplication to addition.{{citation needed|date=October 2013}}