Editing Linear discriminant analysis (section)

==Multiclass LDA==
[[File:4class3ddiscriminant.png|thumb|Visualisation for one-versus-all LDA axes for 4 classes in 3d]]
[[File:3dProjections.png|thumb|Projections along linear discriminant axes for 4 classes]]
In the case where there are more than two classes, the analysis used in the derivation of the Fisher discriminant can be extended to find a [[Linear subspace|subspace]] which appears to contain all of the class variability.<ref name="garson">Garson, G. D. (2008). Discriminant function analysis. {{cite web |url=http://www2.chass.ncsu.edu/garson/pa765/discrim.htm |title=PA 765: Discriminant Function Analysis |access-date=2008-03-04 |url-status=dead |archive-url=https://web.archive.org/web/20080312065328/http://www2.chass.ncsu.edu/garson/pA765/discrim.htm |archive-date=2008-03-12 }} .</ref> This generalization is due to [[C. R. Rao]].<ref name="Rao:1948">{{cite journal |last=Rao |first=R. C. |author-link=Calyampudi Radhakrishna Rao |title=The utilization of multiple measurements in problems of biological classification |journal=Journal of the Royal Statistical Society, Series B |volume=10 |issue=2 |pages=159–203 |year=1948 |doi=10.1111/j.2517-6161.1948.tb00008.x |jstor=2983775}}</ref> Suppose that each of C classes has a mean <math> \mu_i </math> and the same covariance <math> \Sigma </math>. Then the scatter between class variability may be defined by the sample covariance of the class means

:<math> \Sigma_b = \frac{1}{C} \sum_{i=1}^C (\mu_i-\mu) (\mu_i-\mu)^\mathrm{T} </math>

where <math> \mu </math> is the mean of the class means. The class separation in a direction <math> \vec w </math> in this case will be given by

:<math> S = \frac{{\vec w}^\mathrm{T} \Sigma_b \vec w}{{\vec w}^\mathrm{T} \Sigma \vec w} </math>

This means that when <math> \vec w </math> is an [[eigenvector]] of <math> \Sigma^{-1} \Sigma_b </math> the separation will be equal to the corresponding [[eigenvalue]].

If <math> \Sigma^{-1} \Sigma_b </math> is diagonalizable, the variability between features will be contained in the subspace spanned by the eigenvectors corresponding to the ''C''&nbsp;−&nbsp;1 largest eigenvalues (since <math> \Sigma_b </math> is of rank ''C''&nbsp;−&nbsp;1 at most). These eigenvectors are primarily used in feature reduction, as in PCA. The eigenvectors corresponding to the smaller eigenvalues will tend to be very sensitive to the exact choice of training data, and it is often necessary to use regularisation as described in the next section.

If classification is required, instead of [[dimension reduction]], there are a number of alternative techniques available. For instance, the classes may be partitioned, and a standard Fisher discriminant or LDA used to classify each partition. A common example of this is "one against the rest" where the points from one class are put in one group, and everything else in the other, and then LDA applied. This will result in C classifiers, whose results are combined. Another common
method is pairwise classification, where a new classifier is created for each pair of classes (giving ''C''(''C''&nbsp;−&nbsp;1)/2 classifiers in total), with the individual classifiers combined to produce a final classification.