Editing Factor analysis (section)

===Mathematical model of the same example===
In the following, matrices will be indicated by indexed variables. "Academic Subject" indices will be indicated using letters <math>a</math>,<math>b</math> and <math>c</math>, with values running from <math>1</math> to <math>p</math> which is equal to <math>10</math> in the above example.  "Factor" indices will be indicated using letters <math>p</math>, <math>q</math> and <math>r</math>, with values running from <math>1</math> to <math>k</math> which is equal to <math>2</math> in the above example. "Instance" or "sample" indices will be indicated using letters <math>i</math>,<math>j</math> and <math>k</math>, with values running from <math>1</math> to <math>N</math>. In the example above, if a sample of <math>N=1000</math> students participated in the <math>p=10</math> exams, the <math>i</math>th student's score for the <math>a</math>th exam is given by <math>x_{ai}</math>. The purpose of factor analysis is to characterize the correlations between the variables <math>x_a</math> of which the <math>x_{ai}</math> are a particular instance, or set of observations. In order for the variables to be on equal footing, they are [[normalization (statistics)|normalized]] into standard scores <math>z</math>:
:<math>z_{ai}=\frac{x_{ai}-\hat\mu_a}{\hat\sigma_a}</math>
where the sample mean is:
:<math>\hat\mu_a=\tfrac{1}{N}\sum_i x_{ai}</math>
and the sample variance is given by:
:<math>\hat\sigma_a^2=\tfrac{1}{N-1}\sum_i (x_{ai}-\hat\mu_a)^2</math>
The factor analysis model for this particular sample is then:
:<math>\begin{matrix}z_{1,i} & =  & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
\vdots & & \vdots & & \vdots & & \vdots \\
z_{10,i} & =  & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i}
\end{matrix}</math>

or, more succinctly:
:<math>
z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai}
</math>

where
* <math>F_{1i}</math> is the <math>i</math>th student's "verbal intelligence",
* <math>F_{2i}</math> is the <math>i</math>th student's "mathematical intelligence",
* <math>\ell_{ap}</math> are the factor loadings for the <math>a</math>th subject, for <math>p=1,2</math>.

In [[Matrix (mathematics)|matrix]] notation, we have
:<math>Z=LF+\varepsilon</math>
Observe that by doubling the scale on which "verbal intelligence"—the first component in each column of <math>F</math>—is measured, and simultaneously halving the factor loadings for verbal intelligence makes no difference to the model. Thus, no generality is lost by assuming that the standard deviation of the factors for verbal intelligence is <math>1</math>. Likewise for mathematical intelligence. Moreover, for similar reasons, no generality is lost by assuming the two factors are [[uncorrelated]] with each other. In other words:
:<math>\sum_i F_{pi}F_{qi}=\delta_{pq}</math>
where <math>\delta_{pq}</math> is the [[Kronecker delta]] (<math>0</math> when <math>p \ne q</math> and <math>1</math> when <math>p=q</math>). The errors are assumed to be independent of the factors:
:<math>\sum_i F_{pi}\varepsilon_{ai}=0</math>
Since any rotation of a solution is also a solution, this makes interpreting the factors difficult. See disadvantages below. In this particular example, if we do not know beforehand that the two types of intelligence are uncorrelated, then we cannot interpret the two factors as the two different types of intelligence. Even if they are uncorrelated, we cannot tell which factor corresponds to verbal intelligence and which corresponds to mathematical intelligence without an outside argument.

The values of the loadings <math>L</math>, the averages <math>\mu</math>, and the [[variance]]s of the "errors" <math>\varepsilon</math> must be estimated given the observed data <math>X</math> and <math>F</math> (the assumption about the levels of the factors is fixed for a given <math>F</math>). 
The "fundamental theorem" may be derived from the above conditions:
:<math>\sum_i z_{ai}z_{bi}=\sum_j \ell_{aj}\ell_{bj}+\sum_i \varepsilon_{ai}\varepsilon_{bi}</math>
The term on the left is the <math>(a,b)</math>-term of the correlation matrix (a <math>p \times p</math> matrix derived as the product of the <math> p \times N</math> matrix of standardized observations with its transpose) of the observed data, and its <math>p</math> diagonal elements will be <math>1</math>s. The second term on the right will be a diagonal matrix with terms less than unity. The first term on the right is the "reduced correlation matrix" and will be equal to the correlation matrix except for its diagonal values which will be less than unity. These diagonal elements of the reduced correlation matrix are called "communalities" (which represent the fraction of the variance in the observed variable that is accounted for by the factors):
:<math>
h_a^2=1-\psi_a=\sum_j \ell_{aj}\ell_{aj}
</math>
The sample data <math>z_{ai}</math> will not exactly obey the fundamental equation given above due to sampling errors, inadequacy of the model, etc. The goal of any analysis of the above model is to find the factors <math>F_{pi}</math> and loadings <math>\ell_{ap}</math> which give a "best fit" to the data. In factor analysis, the best fit is defined as the minimum of the mean square error in the off-diagonal residuals of the correlation matrix:<ref name="Harman">{{cite book |last=Harman |first=Harry H. |year=1976 |title=Modern Factor Analysis |publisher=University of Chicago Press |pages=175, 176 |isbn=978-0-226-31652-9 }}</ref>
:<math>\varepsilon^2 = \sum_{a\ne b} \left[\sum_i z_{ai}z_{bi}-\sum_j \ell_{aj}\ell_{bj}\right]^2</math>

This is equivalent to minimizing the off-diagonal components of the error covariance which, in the model equations have expected values of zero. This is to be contrasted with principal component analysis which seeks to minimize the mean square error of all residuals.<ref name="Harman"/> Before the advent of high-speed computers, considerable effort was devoted to finding approximate solutions to the problem, particularly in estimating the communalities by other means, which then simplifies the problem considerably by yielding a known reduced correlation matrix. This was then used to estimate the factors and the loadings. With the advent of high-speed computers, the minimization problem can be solved iteratively with adequate speed, and the communalities are calculated in the process, rather than being needed beforehand. The [[Generalized minimal residual method|MinRes]] algorithm is particularly suited to this problem, but is hardly the only iterative means of finding a solution.

If the solution factors are allowed to be correlated (as in 'oblimin' rotation, for example), then the corresponding mathematical model uses [[skew coordinates]] rather than orthogonal coordinates.