Editing Multivariate statistics (section)

==Multivariate analysis==
{{see also|Univariate analysis}}

'''Multivariate analysis''' ('''MVA''') is based on the principles of multivariate statistics. Typically, MVA is used to address situations where multiple measurements are made on each experimental unit and the relations among these measurements and their structures are important.<ref name=":0">{{Citation|last1=Olkin|first1=I.|title=Multivariate Analysis: Overview|date=2001-01-01|url=http://www.sciencedirect.com/science/article/pii/B0080430767004721|encyclopedia=International Encyclopedia of the Social & Behavioral Sciences|pages=10240–10247|editor-last=Smelser|editor-first=Neil J.|publisher=Pergamon|isbn=9780080430768|access-date=2019-09-02|last2=Sampson|first2=A. R.|editor2-last=Baltes|editor2-first=Paul B.}}</ref> A modern, overlapping categorization of MVA includes:<ref name=":0" /> 
* Normal and general multivariate models and distribution theory
* The study and measurement of relationships
* Probability computations of multidimensional regions
* The exploration of data structures and patterns

Multivariate analysis can be complicated by the desire to include physics-based analysis to calculate the effects of variables for a hierarchical "system-of-systems". Often, studies that wish to use multivariate analysis are stalled by the dimensionality of the problem. These concerns are often eased through the use of [[surrogate model]]s, highly accurate approximations of the physics-based code. Since surrogate models take the form of an equation, they can be evaluated very quickly. This becomes an enabler for large-scale MVA studies: while a [[Monte Carlo simulation]] across the design space is difficult with physics-based codes, it becomes trivial when evaluating surrogate models, which often take the form of [[Response surface methodology|response-surface]] equations.

===Types of analysis===
Many different models are used in MVA, each with its own type of analysis:
# [[Multivariate analysis of variance]] (MANOVA) extends the [[analysis of variance]] to cover cases where there is more than one dependent variable to be analyzed simultaneously; see also [[Multivariate analysis of covariance]] (MANCOVA).
#Multivariate regression attempts to determine a formula that can describe how elements in a vector of variables respond simultaneously to changes in others. For linear relations, regression analyses here are based on forms of the [[general linear model]]. Some suggest that multivariate regression is distinct from multivariable regression, however, that is debated and not consistently true across scientific fields.<ref>{{cite journal | pmc = 3518362 | pmid=23153131 | doi=10.2105/AJPH.2012.300897 | volume=103 | title=Multivariate or multivariable regression? | year=2013 | journal=Am J Public Health | pages=39–40 | last1 = Hidalgo | first1 = B | last2 = Goodman | first2 = M| issue=1 }}</ref>
# [[Principal components analysis]] (PCA) creates a new set of orthogonal variables that contain the same information as the original set. It rotates the axes of variation to give a new set of orthogonal axes, ordered so that they summarize decreasing proportions of the variation.
# [[Factor analysis]] is similar to PCA but allows the user to extract a specified number of synthetic variables, fewer than the original set, leaving the remaining unexplained variation as error. The extracted variables are known as latent variables or factors; each one may be supposed to account for covariation in a group of observed variables.
# [[Canonical correlation analysis]] finds linear relationships among two sets of variables; it is the generalised (i.e. canonical) version of bivariate<ref>Unsophisticated analysts of bivariate Gaussian problems may find useful a crude but accurate [http://www.dioi.org/sta.htm#sdsx method] of accurately gauging probability by simply taking the sum ''S'' of the ''N'' residuals' squares, subtracting the sum ''Sm'' at minimum, dividing this difference by ''Sm'', multiplying the result by (''N'' - 2) and taking the inverse anti-ln of half that product.</ref> correlation.
# Redundancy analysis (RDA) is similar to canonical correlation analysis but allows the user to derive a specified number of synthetic variables from one set of (independent) variables that explain as much variance as possible in another (independent) set. It is a multivariate analogue of [[Regression analysis|regression]].<ref>{{cite journal|last=Van Den Wollenberg|first=Arnold L.|title=Redundancy analysis an alternative for canonical correlation analysis|journal=Psychometrika|volume=42|issue=2|year=1977|pages=207–219|doi=10.1007/BF02294050}}</ref>
# [[Correspondence analysis]] (CA), or reciprocal averaging, finds (like PCA) a set of synthetic variables that summarise the original set. The underlying model assumes chi-squared dissimilarities among records (cases).
# [[Canonical correspondence analysis|Canonical (or "constrained") correspondence analysis]] (CCA) for summarising the joint variation in two sets of variables (like redundancy analysis); combination of correspondence analysis and multivariate regression analysis. The underlying model assumes chi-squared dissimilarities among records (cases).
# [[Multidimensional scaling]] comprises various algorithms to determine a set of synthetic variables that best represent the pairwise distances between records. The original method is [[principal coordinates analysis]] (PCoA; based on PCA).
# [[Discriminant function|Discriminant analysis]], or canonical variate analysis, attempts to establish whether a set of variables can be used to distinguish between two or more groups of cases.
# [[Linear discriminant analysis]] (LDA) computes a linear predictor from two sets of normally distributed data to allow for classification of new observations.
# [[Cluster Analysis|Clustering systems]] assign objects into groups (called clusters) so that objects (cases) from the same cluster are more similar to each other than objects from different clusters.
# [[Recursive partitioning]] creates a decision tree that attempts to correctly classify members of the population based on a dichotomous dependent variable.
# [[Artificial neural networks]] extend regression and clustering methods to non-linear multivariate models.
# [[Statistical graphics]] such as tours, [[Parallel coordinates|parallel coordinate plots]], scatterplot matrices can be used to explore multivariate data.
# [[Simultaneous equations model]]s involve more than one regression equation, with different dependent variables, estimated together.
# [[Vector autoregression]] involves simultaneous regressions of various [[time series]] variables on their own and each other's lagged values.
# [[Principal response curve]]s analysis (PRC) is a method based on RDA that allows the user to focus on treatment effects over time by correcting for changes in control treatments over time.<ref>ter Braak, Cajo J.F. & Šmilauer, Petr (2012). ''Canoco reference manual and user's guide: software for ordination (version 5.0)'', p292. Microcomputer Power, Ithaca, NY.</ref>
# [[Iconography of correlations]] consists in replacing a correlation matrix by a diagram where the “remarkable” correlations are represented by a solid line (positive correlation), or a dotted line (negative correlation).

===Dealing with incomplete data===
It is very common that in an experimentally acquired set of data the values of some components of a given data point are [[Missing data|missing]]. Rather than discarding the whole data point, it is common to "fill in" values for the missing components, a process called "[[imputation (statistics)|imputation]]".<ref>{{cite book |title=Analysis of Incomplete Multivariate Data |author=J.L. Schafer |publisher=Chapman & Hall/CRC |year=1997 |isbn=978-1-4398-2186-2}}</ref>