Editing Cooley–Tukey FFT algorithm (section)

== The radix-2 DIT case ==
A '''radix-2''' decimation-in-time ('''DIT''') FFT is the simplest and most common form of the Cooley–Tukey algorithm, although highly optimized Cooley–Tukey implementations typically use other forms of the algorithm as described below. Radix-2 DIT divides a DFT of size ''N'' into two interleaved DFTs (hence the name "radix-2") of size ''N''/2 with each recursive stage.

The discrete Fourier transform (DFT) is defined by the formula:
:<math>      X_k = \sum_{n=0}^{N-1} x_n e^{-\frac{2\pi i}{N} nk},</math>
where <math>k</math> is an integer ranging from 0 to <math>N-1</math>.

Radix-2 DIT first computes the DFTs of the even-indexed inputs
<math>(x_{2m}=x_0, x_2, \ldots, x_{N-2})</math>
and of the odd-indexed inputs <math>(x_{2m+1}=x_1, x_3, \ldots, x_{N-1})</math>, and then combines those two results to produce the DFT of the whole sequence. This idea can then be performed [[recursion|recursively]] to reduce the overall runtime to O(''N'' log ''N'').  This simplified form assumes that ''N'' is a [[power of two]]; since the number of sample points ''N'' can usually be chosen freely by the application (e.g. by changing the sample rate or window, zero-padding, etcetera), this is often not an important restriction.

The radix-2 DIT algorithm rearranges the DFT of the function <math>x_n</math> into two parts: a sum over the even-numbered indices <math>n={2m}</math> and a sum over the odd-numbered indices <math>n={2m+1}</math>:

:<math>
X_k = \sum_{m=0}^{N/2-1} x_{2m}e^{-\frac{2\pi i}{N} (2m)k}   +   \sum_{m=0}^{N/2-1} x_{2m+1} e^{-\frac{2\pi i}{N} (2m+1)k}
</math>

One can factor a common multiplier <math>e^{-\frac{2\pi i}{N}k}</math> out of the second sum, as shown in the equation below. It is then clear that the two sums are the DFT of the even-indexed part <math>x_{2m}</math> and the DFT of odd-indexed part <math>x_{2m+1}</math> of the function <math>x_n</math>. Denote the DFT of the '''''E'''''ven-indexed inputs <math>x_{2m}</math> by <math>E_k</math> and the DFT of the '''''O'''''dd-indexed inputs <math>x_{2m + 1}</math> by <math>O_k</math> and we obtain:

:<math>
X_k= \underbrace{\sum \limits_{m=0}^{N/2-1} x_{2m}   e^{-\frac{2\pi i}{N/2} mk}}_{\text{DFT of even-indexed part of } x_n} {} +  e^{-\frac{2\pi i}{N}k}
 \underbrace{\sum \limits_{m=0}^{N/2-1} x_{2m+1} e^{-\frac{2\pi i}{N/2} mk}}_{\text{DFT of odd-indexed part of } x_n} =  E_k + e^{-\frac{2\pi i}{N}k} O_k\qquad\text{ for }k=0,\dots,\frac N2-1.
</math>

Note that the equalities hold for <math>k=0,\dots,N-1</math>, 
but the crux is that <math>E_k</math> and <math>O_k</math> are calculated in this way for <math>k=0,\dots,\frac N2-1</math> only.
Thanks to the [[List of trigonometric identities#Shifts and periodicity|periodicity of the complex exponential]], <math>X_{k+\frac{N}{2}}</math> is also obtained from <math>E_k</math> and <math>O_k</math>:

:<math>
\begin{align} X_{k + \frac{N}{2}} & =  \sum \limits_{m=0}^{N/2-1} x_{2m}   e^{-\frac{2\pi i}{N/2} m(k + \frac{N}{2})} +  e^{-\frac{2\pi i}{N}(k + \frac{N}{2})}
 \sum_{m=0}^{N/2-1} x_{2m+1} e^{-\frac{2\pi i}{N/2} m(k + \frac{N}{2} )} \\
 & =  \sum_{m=0}^{N/2-1} x_{2m}   e^{-\frac{2\pi i}{N/2} mk} e^{-2\pi m i} +  e^{-\frac{2\pi i}{N}k}e^{-\pi i}
 \sum_{m=0}^{N/2-1} x_{2m+1} e^{-\frac{2\pi i}{N/2} mk} e^{-2\pi m i} \\
 & =  \sum_{m=0}^{N/2-1} x_{2m}   e^{-\frac{2\pi i}{N/2} mk} - e^{-\frac{2\pi i}{N}k}
 \sum_{m=0}^{N/2-1} x_{2m+1} e^{-\frac{2\pi i}{N/2} mk} \\
& =  E_k - e^{-\frac{2\pi i}{N}k} O_k
\end{align}
</math>

We can rewrite <math> X_k </math> and <math> X_{k+\frac{N}{2}} </math> as:
:<math>
\begin{align}
X_k & = E_k + e^{-\frac{2\pi i}{N}{k}} O_k \\
X_{k+\frac{N}{2}} & = E_k - e^{-\frac{2\pi i}{N}{k}} O_k
\end{align}
</math>

This result, expressing the DFT of length ''N'' recursively in terms of two DFTs of size ''N''/2, is the core of the radix-2 DIT fast Fourier transform. The algorithm gains its speed by re-using the results of intermediate computations to compute multiple DFT outputs.  Note that final outputs are obtained by a +/− combination of <math>E_k</math> and <math>O_k \exp(-2\pi i k/N)</math>, which is simply a size-2 DFT (sometimes called a [[butterfly diagram|butterfly]] in this context); when this is generalized to larger radices below, the size-2 DFT is replaced by a larger DFT (which itself can be evaluated with an FFT).
[[File:DIT-FFT-butterfly.svg|thumb|300px|right|Data flow diagram for ''N''=8: a decimation-in-time radix-2 FFT breaks a length-''N'' DFT into two length-''N''/2 DFTs followed by a combining stage consisting of ''N''/2 size-2 DFTs called "butterfly" operations (so called because of the shape of the data-flow diagrams).]]

This process is an example of the general technique of [[divide and conquer algorithm]]s; in many conventional implementations, however, the explicit recursion is avoided, and instead one traverses the computational tree in [[breadth-first search|breadth-first]] fashion.

The above re-expression of a size-''N'' DFT as two size-''N''/2 DFTs is sometimes called the '''[[G. C. Danielson|Danielson]]–[[Cornelius Lanczos|Lanczos]]''' [[lemma (mathematics)|lemma]], since the identity was noted by those two authors in 1942<ref>Danielson, G. C., and C. Lanczos, "Some improvements in practical Fourier analysis and their application to X-ray scattering from liquids," ''J. Franklin Inst.'' '''233''', 365–380 and 435–452 (1942).</ref> (influenced by [[Carl David Tolmé Runge|Runge's]] 1903 work<ref name=Heideman84/>).  They applied their lemma in a "backwards" recursive fashion, repeatedly ''doubling'' the DFT size until the transform spectrum converged (although they apparently didn't realize the [[linearithmic]] [i.e., order ''N''&nbsp;log&nbsp;''N''] asymptotic complexity they had achieved).  The Danielson–Lanczos work predated widespread availability of [[History of computing hardware|mechanical or electronic computer]]s and required [[Human computer|manual calculation]] (possibly with mechanical aids such as [[adding machine]]s); they reported a computation time of 140 minutes for a size-64 DFT operating on [[Fast Fourier transform#FFT algorithms specialized for real or symmetric data|real inputs]] to 3–5 significant digits.  Cooley and Tukey's 1965 paper reported a running time of 0.02 minutes for a size-2048 complex DFT on an [[IBM 7094]] (probably in 36-bit [[floating point|single precision]], ~8 digits).<ref name=CooleyTukey65/>  Rescaling the time by the number of operations, this corresponds roughly to a speedup factor of around 800,000.  (To put the time for the hand calculation in perspective, 140 minutes for size 64 corresponds to an average of at most 16 seconds per floating-point operation, around 20% of which are multiplications.)

=== Pseudocode ===
In [[pseudocode]], the below procedure could be written:<ref name="Johnson08"/>

 ''X''<sub>0,...,''N''−1</sub> ← '''ditfft2'''(''x'', ''N'', ''s''):             ''DFT of (x''<sub>0</sub>, ''x<sub>s</sub>'', ''x''<sub>2''s''</sub>, ..., ''x''<sub>(''N''-1)''s''</sub>):
     '''if''' ''N'' = 1 '''then'''
         ''X''<sub>0</sub> ← ''x''<sub>0</sub>                                     ''trivial size-1 DFT base case''
     '''else'''
         ''X''<sub>0,...,''N''/2−1</sub> ← '''ditfft2'''(''x'', ''N''/2, 2''s'')             ''DFT of (x''<sub>0</sub>, ''x''<sub>2''s''</sub>, ''x''<sub>4''s''</sub>, ..., ''x''<sub>(''N''-2)''s''</sub>)
         ''X<sub>N</sub>''<sub>/2,...,''N''−1</sub> ← '''ditfft2'''(''x''+s, ''N''/2, 2''s'')           ''DFT of (x<sub>s</sub>'', ''x<sub>s</sub>''<sub>+2''s''</sub>, ''x<sub>s</sub>''<sub>+4''s''</sub>, ..., ''x''<sub>(''N''-1)''s''</sub>)
         '''for''' k = 0 to (N/2)-1 '''do'''                      ''combine DFTs of two halves:''
             p ← ''X<sub>k</sub>''
             q ← exp(−2π''i''/''N'' ''k'') ''X<sub>k</sub>''<sub>+''N''/2</sub>
             ''X<sub>k</sub>'' ← p + q 
             ''X<sub>k</sub>''<sub>+''N''/2</sub> ← p − q
         '''end for'''
     '''end if'''

Here, <code>'''ditfft2'''</code>(''x'',''N'',1), computes ''X''=DFT(''x'') [[In-place algorithm|out-of-place]] by a radix-2 DIT FFT, where ''N'' is an integer power of 2 and ''s''=1 is the [[stride of an array|stride]] of the input ''x'' [[Array data structure|array]].  ''x''+''s'' denotes the array starting with ''x<sub>s</sub>''.

(The results are in the correct order in ''X'' and no further [[bit-reversal permutation]] is required; the often-mentioned necessity of a separate bit-reversal stage only arises for certain in-place algorithms, as described below.)

High-performance FFT implementations make many modifications to the implementation of such an algorithm compared to this simple pseudocode.  For example, one can use a larger base case than ''N''=1 to amortize the overhead of recursion, the [[twiddle factor]]s <math>\exp[-2\pi i k/ N]</math> can be precomputed, and larger radices are often used for [[cache (computing)|cache]] reasons; these and other optimizations together can improve the performance by an order of magnitude or more.<ref name="Johnson08">S. G. Johnson and M. Frigo, "[http://cnx.org/content/m16336/ Implementing FFTs in practice]," in ''Fast Fourier Transforms'' (C. S. Burrus, ed.), ch. 11, Rice University, Houston TX: Connexions, September 2008.</ref>  (In many textbook implementations the [[depth-first]] recursion is eliminated in favor of a nonrecursive [[breadth-first]] approach, although depth-first recursion has been argued to have better [[memory locality]].<ref name="Johnson08"/><ref name=Singleton67/>) Several of these ideas are described in further detail below.