Editing Independent component analysis (section)

== Methods for blind source separation ==

=== Projection pursuit ===

Signal mixtures tend to have Gaussian probability density functions, and source signals tend to have non-Gaussian probability density functions.  Each source signal can be extracted from a set of signal mixtures by taking the inner product of a weight vector and those signal mixtures where this inner product provides an orthogonal projection of the signal mixtures.  The remaining challenge is finding such a weight vector. One type of method for doing so is [[projection pursuit]].<ref name="James V. Stone 2004">James V. Stone(2004); "Independent Component Analysis: A Tutorial Introduction", The MIT Press Cambridge, Massachusetts, London, England; {{ISBN|0-262-69315-1}}</ref><ref>Kruskal, JB. 1969; "Toward a practical method which helps uncover the structure of a set of observations by finding the line transformation which optimizes a new "index of condensation", Pages 427–440 of: Milton, RC, & Nelder, JA (eds), Statistical computation; New York, Academic Press</ref>

Projection pursuit seeks one projection at a time such that the extracted signal is as non-Gaussian as possible. This contrasts with ICA, which typically extracts ''M'' signals simultaneously from ''M'' signal mixtures, which requires estimating a ''M'' × ''M'' unmixing matrix. One practical advantage of projection pursuit over ICA is that fewer than ''M'' signals can be extracted if required, where each source signal is extracted from ''M'' signal mixtures using an ''M''-element weight vector.

We can use [[kurtosis]] to recover the multiple source signal by finding the correct weight vectors with the use of projection pursuit.

The kurtosis of the probability density function of a signal, for a finite sample, is computed as

:<math>
K=\frac{\operatorname{E}[(\mathbf{y}-\mathbf{\overline{y}})^4]}{(\operatorname{E}[(\mathbf{y}-\mathbf{\overline{y}})^2])^2}-3 
</math>

where <math>\mathbf{\overline{y}}</math> is the [[sample mean]] of <math>\mathbf{y}</math>, the extracted signals. The constant 3 ensures that Gaussian signals have zero kurtosis, Super-Gaussian signals have positive kurtosis, and Sub-Gaussian signals have negative kurtosis. The denominator is the [[variance]] of <math>\mathbf{y}</math>, and ensures that the measured kurtosis takes account of signal variance. The goal of projection pursuit is to maximize the kurtosis, and make the extracted signal as non-normal as possible.

Using kurtosis as a measure of non-normality, we can now examine how the kurtosis of a signal <math>\mathbf{y} = \mathbf{w}^T \mathbf{x}</math> extracted from a set of ''M'' mixtures <math>\mathbf{x}=(x_1,x_2,\ldots,x_M)^T</math> varies as the weight vector <math>\mathbf{w}</math> is rotated around the origin. Given our assumption that each source signal <math>\mathbf{s}</math> is super-gaussian we would expect:
#the kurtosis of the extracted signal <math>\mathbf{y}</math> to be maximal precisely when <math>\mathbf{y} = \mathbf{s}</math>.
#the kurtosis of the extracted signal <math>\mathbf{y}</math> to be maximal when <math>\mathbf{w}</math> is orthogonal to the projected axes <math>S_1</math> or <math>S_2</math>, because we know the optimal weight vector should be orthogonal to a transformed axis <math>S_1</math> or <math>S_2</math>.

For multiple source mixture signals, we can use kurtosis and [[Gram-Schmidt]] Orthogonalization (GSO) to recover the signals. Given ''M'' signal mixtures in an ''M''-dimensional space, GSO project these data points onto an (''M-1'')-dimensional space by using the weight vector. We can guarantee the independence of the extracted signals with the use of GSO.

In order to find the correct value of <math>\mathbf{w}</math>, we can use [[gradient descent]] method.  We first of all whiten the data, and transform <math>\mathbf{x}</math> into a new mixture <math>\mathbf{z}</math>, which has unit variance, and <math>\mathbf{z}=(z_1,z_2,\ldots,z_M)^T</math>. This process can be achieved by applying [[Singular value decomposition]] to <math>\mathbf{x}</math>,

: <math>\mathbf{x} = \mathbf{U} \mathbf{D} \mathbf{V}^T</math>

Rescaling each vector <math>U_i=U_i/\operatorname{E}(U_i^2)</math>, and let <math>\mathbf{z} = \mathbf{U}</math>. The signal extracted by a weighted vector <math>\mathbf{w}</math> is <math>\mathbf{y} = \mathbf{w}^T \mathbf{z}</math>. If the weight vector '''w''' has unit length, then the variance of '''y''' is also 1, that is <math>\operatorname{E}[(\mathbf{w}^T \mathbf{z})^2]=1</math>. The kurtosis can thus be written as:

:<math>
K=\frac{\operatorname{E}[\mathbf{y}^4]}{(\operatorname{E}[\mathbf{y}^2])^2}-3=\operatorname{E}[(\mathbf{w}^T \mathbf{z})^4]-3. 
</math>

The updating process for <math>\mathbf{w}</math> is:
:<math>\mathbf{w}_{new}=\mathbf{w}_{old}-\eta\operatorname{E}[\mathbf{z}(\mathbf{w}_{old}^T \mathbf{z})^3 ].</math>
where <math>\eta</math> is a small constant to guarantee that <math>\mathbf{w}</math> converges to the optimal solution. After each update, we normalize <math>\mathbf{w}_{new}=\frac{\mathbf{w}_{new}}{|\mathbf{w}_{new}|}</math>, and set <math>\mathbf{w}_{old}=\mathbf{w}_{new}</math>, and repeat the updating process until convergence. We can also use another algorithm to update the weight vector <math>\mathbf{w}</math>.

Another approach is using [[negentropy]]<ref name=comon94/><ref>{{cite journal|last=Hyvärinen|first=Aapo|author2=Erkki Oja|title=Independent Component Analysis:Algorithms and Applications|journal=Neural Networks|year=2000|volume=13|issue=4–5|series=4-5|pages=411–430|doi=10.1016/s0893-6080(00)00026-5|pmid=10946390|citeseerx=10.1.1.79.7003|s2cid=11959218 }}</ref> instead of kurtosis. Using negentropy is a more robust method than kurtosis, as kurtosis is very sensitive to outliers. The negentropy methods are based on an important property of Gaussian distribution: a Gaussian variable has the largest entropy among all continuous random variables of equal variance. This is also the reason why we want to find the most nongaussian variables. A simple proof can be found in [[Differential entropy]].
:<math>J(x) = S(y) - S(x)\,</math>

y is a Gaussian random variable of the same covariance matrix as x

:<math>S(x) = - \int p_x(u) \log p_x(u) du</math>

An approximation for negentropy is
:<math>J(x)=\frac{1}{12}(E(x^3))^2 + \frac{1}{48}(kurt(x))^2</math> 
A proof can be found in the original papers of Comon;<ref name=pc91/><ref name=comon94/> it has been reproduced in the book ''Independent Component Analysis'' by Aapo Hyvärinen, Juha Karhunen, and [[Erkki Oja]]<ref>{{cite book|first1=Aapo |last1=Hyvärinen |first2=Juha |last2=Karhunen |first3=Erkki |last3=Oja|title=Independent component analysis|year=2001|publisher=Wiley|location=New York, NY |isbn=978-0-471-40540-5|edition=Reprint}}</ref> This approximation also suffers from the same problem as kurtosis (sensitivity to outliers). Other approaches have been developed.<ref>{{cite journal|last=Hyvärinen|first=Aapo|title=New approximations of differential entropy for independent component analysis and projection pursuit.|journal=Advances in Neural Information Processing Systems|year=1998|volume=10|pages=273–279}}</ref> 
:<math>J(y) = k_1(E(G_1(y)))^2 + k_2(E(G_2(y)) - E(G_2(v))^2</math>
A choice of <math>G_1</math> and <math>G_2</math> are 
:<math>G_1 = \frac{1}{a_1}\log(\cosh(a_1u))</math> and <math>G_2 = -\exp(-\frac{u^2}{2})</math>

=== Based on infomax ===

Infomax ICA<ref name="Bell-Sejnowski">Bell, A. J.; Sejnowski, T. J. (1995). "An Information-Maximization Approach to Blind Separation and Blind Deconvolution", Neural Computation, 7, 1129-1159</ref> is essentially a multivariate, parallel version of projection pursuit. Whereas projection pursuit extracts a series of signals one at a time from a set of ''M'' signal mixtures, ICA extracts ''M'' signals in parallel. This tends to make ICA more robust than projection pursuit.<ref name="ReferenceA">James V. Stone (2004). "Independent Component Analysis: A Tutorial Introduction", The MIT Press
Cambridge, Massachusetts, London, England; {{ISBN|0-262-69315-1}}</ref>

The projection pursuit method uses [[Gram-Schmidt]] orthogonalization to ensure the independence of the extracted signal, while ICA use [[infomax]] and [[maximum likelihood]] estimate to ensure the independence of the extracted signal. The Non-Normality of the extracted signal is achieved by assigning an appropriate model, or prior, for the signal.

The process of ICA based on [[infomax]] in short is: given a set of signal mixtures <math>\mathbf{x}</math> and a set of identical independent model [[cumulative distribution functions]](cdfs) <math>g</math>, we seek the unmixing matrix <math>\mathbf{W}</math> which maximizes the joint [[entropy]] of the signals <math>\mathbf{Y}=g(\mathbf{y})</math>, where <math>\mathbf{y}=\mathbf{Wx}</math> are the signals extracted by <math>\mathbf{W}</math>. Given the optimal <math>\mathbf{W}</math>, the signals <math>\mathbf{Y}</math> have maximum entropy and are therefore independent, which ensures that the extracted signals <math>\mathbf{y}=g^{-1}(\mathbf{Y})</math> are also independent. <math>g</math> is an invertible function, and is the signal model. Note that if the source signal model [[probability density function]] <math>p_s</math> matches the [[probability density function]] of the extracted signal <math>p_{\mathbf{y}}</math>, then maximizing the joint entropy of <math>Y</math> also maximizes the amount of [[mutual information]] between <math>\mathbf{x}</math> and <math>\mathbf{Y}</math>. For this reason, using entropy to extract independent signals is known as [[infomax]].

Consider the entropy of the vector variable <math>\mathbf{Y}=g(\mathbf{y})</math>, where <math>\mathbf{y}=\mathbf{Wx}</math> is the set of signals extracted by the unmixing matrix <math>\mathbf{W}</math>. For a finite set of values sampled from a distribution with pdf <math>p_{\mathbf{y}}</math>, the entropy of <math>\mathbf{Y}</math> can be estimated as:
:<math>
H(\mathbf{Y})=-\frac{1}{N}\sum_{t=1}^N \ln p_{\mathbf{Y}}(\mathbf{Y}^t)
</math>
The joint pdf <math>p_{\mathbf{Y}}</math> can be shown to be related to the joint pdf <math>p_{\mathbf{y}}</math> of the extracted signals by the multivariate form:
:<math>
p_{\mathbf{Y}}(Y)=\frac{p_{\mathbf{y}}(\mathbf{y})}{|\frac{\partial\mathbf{Y}}{\partial \mathbf{y}}|}
</math>

where <math>\mathbf{J}=\frac{\partial\mathbf{Y}}{\partial \mathbf{y}}</math> is the [[Jacobian matrix]]. We have <math>|\mathbf{J}|=g'(\mathbf{y})</math>, and <math>g'</math> is the pdf assumed for source signals <math>g'=p_s</math>, therefore,
:<math>
p_{\mathbf{Y}}(Y)=\frac{p_{\mathbf{y}}(\mathbf{y})}{|\frac{\partial\mathbf{Y}}{\partial \mathbf{y}}|}=\frac{p_\mathbf{y}(\mathbf{y})}{p_\mathbf{s}(\mathbf{y})}
</math>
therefore,
:<math>
H(\mathbf{Y})=-\frac{1}{N}\sum_{t=1}^N \ln\frac{p_\mathbf{y}(\mathbf{y})}{p_\mathbf{s}(\mathbf{y})}
</math>

We know that when <math>p_{\mathbf{y}}=p_s</math>, <math>p_{\mathbf{Y}}</math> is of uniform distribution, and <math>H({\mathbf{Y}})</math> is maximized. Since
:<math>
p_{\mathbf{y}}(\mathbf{y})=\frac{p_\mathbf{x}(\mathbf{x})}{|\frac{\partial\mathbf{y}}{\partial\mathbf{x}}|}=\frac{p_\mathbf{x}(\mathbf{x})}{|\mathbf{W}|}
</math>
where <math>|\mathbf{W}|</math> is the absolute value of the determinant of the unmixing matrix <math>\mathbf{W}</math>. Therefore,
:<math>
H(\mathbf{Y})=-\frac{1}{N}\sum_{t=1}^N \ln\frac{p_\mathbf{x}(\mathbf{x}^t)}{|\mathbf{W}|p_\mathbf{s}(\mathbf{y}^t)}
</math>
so,
:<math>
H(\mathbf{Y})=\frac{1}{N}\sum_{t=1}^N \ln p_\mathbf{s}(\mathbf{y}^t)+\ln|\mathbf{W}|+H(\mathbf{x})
</math>
since <math>H(\mathbf{x})=-\frac{1}{N}\sum_{t=1}^N\ln p_\mathbf{x}(\mathbf{x}^t)</math>, and maximizing <math>\mathbf{W}</math> does not affect <math>H_{\mathbf{x}}</math>, so we can maximize the function
:<math>
h(\mathbf{Y})=\frac{1}{N}\sum_{t=1}^N \ln p_\mathbf{s}(\mathbf{y}^t)+\ln|\mathbf{W}|
</math>
to achieve the independence of the extracted signal.

If there are ''M'' marginal pdfs of the model joint pdf <math>p_{\mathbf{s}}</math> are independent and use the commonly super-gaussian model pdf for the source signals <math>p_{\mathbf{s}}=(1-\tanh(\mathbf{s})^2)</math>, then we have
:<math>
h(\mathbf{Y})=\frac{1}{N}\sum_{i=1}^M\sum_{t=1}^N \ln (1-\tanh(\mathbf{w}_i^\mathsf{T}\mathbf{x}^t)^2)+\ln|\mathbf{W}|
</math>

In the sum, given an observed signal mixture  <math>\mathbf{x}</math>, the corresponding set of extracted signals  <math>\mathbf{y}</math>  and source signal model <math>p_{\mathbf{s}}=g'</math>,  we can find the optimal unmixing matrix <math>\mathbf{W}</math>, and make the extracted signals independent and non-gaussian. Like the projection pursuit situation, we can use gradient descent method to find the optimal solution of the unmixing matrix.

=== Based on maximum likelihood estimation ===

'''[[Maximum likelihood]] estimation (MLE)''' is a standard statistical tool for finding parameter values (e.g. the unmixing matrix <math>\mathbf{W}</math>) that provide the best fit of some data (e.g., the extracted signals <math>y</math>) to a given a model (e.g., the assumed joint probability density function (pdf) <math>p_s</math> of source signals).<ref name="ReferenceA"/>

The '''ML''' "model" includes a specification of a pdf, which in this case is the pdf <math>p_s</math> of the unknown source signals <math>s</math>. Using '''ML ICA''', the objective is to find an unmixing matrix that yields extracted signals <math>y = \mathbf{W}x</math> with a joint pdf as similar as possible to the joint pdf <math>p_s</math> of the unknown source signals <math>s</math>.

'''MLE''' is thus based on the assumption that if the model pdf <math>p_s</math> and the model parameters <math>\mathbf{A}</math> are correct then a high probability should be obtained for the data <math>x</math> that were actually observed. Conversely, if <math>\mathbf{A}</math> is far from the correct parameter values then a low probability of the observed data would be expected.

Using '''MLE''', we call the probability of the observed data for a given set of model parameter values (e.g., a pdf <math>p_s</math> and a matrix <math>\mathbf{A}</math>) the ''likelihood'' of the model parameter values given the observed data.

We define a ''likelihood'' function <math>\mathbf{L(W)}</math> of <math>\mathbf{W}</math>:

<math>\mathbf{ L(W)} = p_s (\mathbf{W}x)|\det \mathbf{W}|. </math>

This equals to the probability density at <math>x</math>, since <math>s = \mathbf{W}x</math>.

Thus, if we wish to find a <math>\mathbf{W}</math> that is most likely to have generated the observed mixtures <math>x</math> from the unknown source signals <math>s</math> with pdf <math>p_s</math> then we need only find that <math>\mathbf{W}</math> which maximizes the ''likelihood'' <math>\mathbf{L(W)}</math>. The unmixing matrix that maximizes equation is known as the '''MLE''' of the optimal unmixing matrix.

It is common practice to use the log ''likelihood'', because this is easier to evaluate. As the logarithm is a monotonic function, the <math>\mathbf{W}</math> that maximizes the function <math>\mathbf{L(W)}</math> also maximizes its logarithm <math>\ln \mathbf{L(W)}</math>. This allows us to take the logarithm of equation above, which yields the log ''likelihood'' function

<math>\ln \mathbf{L(W)} =\sum_{i}\sum_{t} \ln p_s(w^T_ix_t) + N\ln|\det \mathbf{W}|</math>

If we substitute a commonly used high-[[Kurtosis]] model pdf for the source signals <math>p_s = (1-\tanh(s)^2)</math> then we have

<math>\ln \mathbf{L(W)} ={1 \over N}\sum_{i}^{M} \sum_{t}^{N}\ln(1-\tanh(w^T_i x_t )^2) + \ln |\det \mathbf{W}|</math>

This matrix <math>\mathbf{W}</math> that maximizes this function is the '''''[[maximum likelihood]] estimation'''''.