Editing Linear discriminant analysis (section)

==LDA for two classes==

Consider a set of observations <math> { \vec x } </math> (also called features, attributes, variables or measurements) for each sample of an object or event with known class <math>y</math>. This set of samples is called the [[training set]] in a [[supervised learning]] context. The classification problem is then to find a good predictor for the class <math>y</math> of any sample of the same distribution (not necessarily from the training set) given only an observation <math> \vec x </math>.<ref name="Venables:2002">{{cite book |title=Modern Applied Statistics with S |first1=W. N. |last1=Venables |first2=B. D. |last2=Ripley |author-link2=Brian Ripley |publisher=Springer Verlag |isbn=978-0-387-95457-8 |year=2002 |edition=4th}}
</ref>{{rp|338}}

LDA approaches the problem by assuming that the conditional [[probability density function]]s <math>p(\vec x|y=0)</math> and <math>p(\vec x|y=1)</math> are both [[Multivariate normal distribution|the normal distribution]] with mean and [[covariance]] parameters <math>\left(\vec \mu_0, \Sigma_0\right)</math> and <math>\left(\vec \mu_1, \Sigma_1\right)</math>, respectively. Under this assumption, the [[Bayes classifier|Bayes-optimal solution]] is to predict points as being from the second class if the log of the likelihood ratios is bigger than some threshold T, so that:

: <math> \frac{1}{2} (\vec x - \vec \mu_0)^\mathrm{T} \Sigma_0^{-1} ( \vec x - \vec \mu_0) + \frac{1}{2} \ln|\Sigma_0| - \frac{1}{2} (\vec x - \vec \mu_1)^\mathrm{T} \Sigma_1^{-1} ( \vec x - \vec \mu_1) - \frac{1}{2} \ln|\Sigma_1| \ > \ T </math>

Without any further assumptions, the resulting classifier is referred to as [[quadratic classifier|quadratic discriminant analysis]] (QDA).

LDA instead makes the additional simplifying [[homoscedastic]]ity assumption (''i.e.'' that the class covariances are identical, so <math>\Sigma_0 = \Sigma_1 = \Sigma</math>) and that the covariances have full rank.
In this case, several terms cancel:

:<math> {\vec x}^\mathrm{T} \Sigma_0^{-1} \vec x = {\vec x}^\mathrm{T} \Sigma_1^{-1} \vec x</math>
:<math>{\vec x}^\mathrm{T} {\Sigma_i}^{-1} \vec{\mu}_i = {\vec{\mu}_i}^\mathrm{T}{\Sigma_i}^{-1} \vec x</math>  because <math>\Sigma_i</math> is [[Hermitian matrix|Hermitian]]

and the above decision criterion
becomes a threshold on the [[dot product]]

:<math> {\vec w}^\mathrm{T} \vec x > c </math>

for some threshold constant ''c'', where

:<math>\vec w = \Sigma^{-1} (\vec \mu_1 - \vec \mu_0)</math>
:<math> c = \frac12 \, {\vec w}^\mathrm{T} (\vec \mu_1 + \vec \mu_0)</math>

This means that the criterion of an input <math> \vec{ x }</math> being in a class <math>y</math> is purely a function of this linear combination of the known observations.

It is often useful to see this conclusion in geometrical terms: the criterion of an input <math> \vec{ x }</math> being in a class <math>y</math> is purely a function of projection of multidimensional-space point <math> \vec{ x }</math> onto vector <math> \vec{ w }</math> (thus, we only consider its direction). In other words, the observation belongs to <math>y</math> if corresponding <math> \vec{ x }</math> is located on a certain side of a hyperplane perpendicular to <math> \vec{ w }</math>. The location of the plane is defined by the threshold <math>c</math>.