Editing Linear discriminant analysis (section)

==Fisher's linear discriminant==
The terms ''Fisher's linear discriminant'' and ''LDA'' are often used interchangeably, although [[Ronald A. Fisher|Fisher's]] original article<ref name="Fisher:1936" /> actually describes a slightly different discriminant, which does not make some of the assumptions of LDA such as [[normal distribution|normally distributed]] classes or equal class [[covariance]]s.

Suppose two classes of observations have [[mean]]s <math> \vec \mu_0, \vec \mu_1 </math> and covariances <math>\Sigma_0,\Sigma_1 </math>. Then the linear combination of features <math> {\vec w}^\mathrm{T} \vec x </math> will have [[mean]]s <math> {\vec w}^\mathrm{T} \vec \mu_i </math> and [[variance]]s <math> {\vec w}^\mathrm{T} \Sigma_i \vec w </math> for <math> i=0,1 </math>. Fisher defined the separation between these two [[probability distribution|distributions]] to be the ratio of the variance between the classes to the variance within the classes:

:<math>S=\frac{\sigma_{\text{between}}^2}{\sigma_{\text{within}}^2}= \frac{(\vec w \cdot \vec \mu_1 - \vec w \cdot \vec \mu_0)^2}{{\vec w}^\mathrm{T} \Sigma_1 \vec w + {\vec w}^\mathrm{T} \Sigma_0 \vec w} = \frac{(\vec w \cdot (\vec \mu_1 - \vec \mu_0))^2}{{\vec w}^\mathrm{T} (\Sigma_0+\Sigma_1) \vec w} </math>

This measure is, in some sense, a measure of the [[signal-to-noise ratio]] for the class labelling. It can be shown that the maximum separation occurs when

:<math> \vec w \propto (\Sigma_0+\Sigma_1)^{-1}(\vec \mu_1 - \vec \mu_0) </math>

When the assumptions of LDA are satisfied, the above equation is equivalent to LDA.
[[File:Fisher2classes.png|thumb|Fisher's Linear Discriminant visualised as an axis]]
Be sure to note that the vector <math>\vec w</math> is the [[surface normal|normal]] to the discriminant [[hyperplane]]. As an example, in a two dimensional problem, the line that best divides the two groups is perpendicular to <math>\vec w</math>.

Generally, the data points to be discriminated are projected onto <math>\vec w</math>; then the threshold that best separates the data is chosen from analysis of the one-dimensional distribution. There is no general rule for the threshold. However, if projections of points from both classes exhibit approximately the same distributions, a good choice would be the hyperplane between projections of the two means, <math>\vec w \cdot \vec \mu_0 </math> and <math>\vec w \cdot \vec \mu_1 </math>. In this case the parameter c in threshold condition <math> \vec w \cdot \vec x > c </math> can be found explicitly:

:<math> c = \vec w \cdot \frac12 (\vec \mu_0 + \vec \mu_1) = \frac{1}{2} \vec\mu_1^\mathrm{T} \Sigma^{-1}_{1} \vec\mu_1 - \frac{1}{2} \vec\mu_0^\mathrm{T} \Sigma^{-1}_{0} \vec\mu_0 </math>.

[[Otsu's method]] is related to Fisher's linear discriminant, and was created to binarize the histogram of pixels in a grayscale image by optimally picking the black/white threshold that minimizes intra-class variance and maximizes inter-class variance within/between grayscales assigned to black and white pixel classes.