Editing Cooley–Tukey FFT algorithm (section)

== Variations ==
There are many other variations on the Cooley–Tukey algorithm.  '''Mixed-radix''' implementations handle composite sizes with a variety of (typically small) factors in addition to two, usually (but not always) employing the O(''N''<sup>2</sup>) algorithm for the prime base cases of the recursion <nowiki>(</nowiki>it is also possible to employ an ''N''&nbsp;log&nbsp;''N'' algorithm for the prime base cases, such as [[Rader's FFT algorithm|Rader]]'s or [[Bluestein's FFT algorithm|Bluestein]]'s algorithm<nowiki>)</nowiki>.  [[Split-radix FFT algorithm|Split radix]] merges radices 2 and 4, exploiting the fact that the first transform of radix 2 requires no twiddle factor, in order to achieve what was long the lowest known arithmetic operation count for power-of-two sizes,<ref name=DuhamelVe90/> although recent variations achieve an even lower count.<ref>Lundy, T., and J. Van Buskirk, "A new matrix approach to real FFTs and convolutions of length 2<sup>''k''</sup>," ''Computing'' '''80''', 23–45 (2007).</ref><ref>Johnson, S. G., and M. Frigo, "[http://www.fftw.org/newsplit.pdf A modified split-radix FFT with fewer arithmetic operations]," ''IEEE Trans. Signal Process.'' '''55''' (1), 111–119 (2007).</ref>  (On present-day computers, performance is determined more by [[CPU cache|cache]] and [[CPU pipeline]] considerations than by strict operation counts; well-optimized FFT implementations often employ larger radices and/or hard-coded base-case transforms of significant size.<ref name=FrigoJohnson05/>).

Another way of looking at the Cooley–Tukey algorithm is that it re-expresses a size ''N'' one-dimensional DFT as an ''N''<sub>1</sub> by ''N''<sub>2</sub> two-dimensional DFT (plus twiddles), where the output matrix is [[transpose]]d. The net result of all of these transpositions, for a radix-2 algorithm, corresponds to a bit reversal of the input (DIF) or output (DIT) indices.  If, instead of using a small radix, one employs a radix of roughly {{radic|''N''}} and explicit input/output matrix transpositions, it is called a [[four-step FFT]] algorithm (or ''six-step'', depending on the number of transpositions), initially proposed to improve memory locality,<ref name=GenSande66>Gentleman W. M., and G. Sande, "Fast Fourier transforms—for fun and profit," ''Proc. AFIPS'' '''29''', 563–578 (1966).</ref><ref name=Bailey90>Bailey, David H., "FFTs in external or hierarchical memory," ''J. Supercomputing'' '''4''' (1), 23–35 (1990)</ref> e.g. for cache optimization or [[out-of-core]] operation, and was later shown to be an optimal [[cache-oblivious algorithm]].<ref name=Frigo99>M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In ''Proceedings of the 40th IEEE Symposium on Foundations of Computer Science'' (FOCS 99), p.285-297. 1999. [https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=814600 Extended abstract at IEEE], [http://citeseer.ist.psu.edu/307799.html at Citeseer].</ref>

The general Cooley–Tukey factorization rewrites the indices ''k'' and ''n'' as <math>k = N_2 k_1 + k_2</math> and <math>n = N_1 n_2 + n_1</math>, respectively, where the indices ''k''<sub>a</sub> and ''n''<sub>a</sub> run from 0..''N''<sub>a</sub>-1 (for ''a'' of 1 or 2).  That is, it re-indexes the input (''n'') and output (''k'') as ''N''<sub>1</sub> by ''N''<sub>2</sub> two-dimensional arrays in [[column-major order|column-major]] and [[row-major order]], respectively; the difference between these indexings is a transposition, as mentioned above.  When this re-indexing is substituted into the DFT formula for ''nk'', the <math>N_1 n_2 N_2 k_1</math> cross term vanishes (its exponential is unity), and the remaining terms give

:<math>X_{N_2 k_1 + k_2} =
      \sum_{n_1=0}^{N_1-1} \sum_{n_2=0}^{N_2-1}
         x_{N_1 n_2 + n_1}
         e^{-\frac{2\pi i}{N_1 N_2} \cdot (N_1 n_2 + n_1) \cdot (N_2 k_1 + k_2) }</math>
::<math>= 
    \sum_{n_1=0}^{N_1-1} 
      \left[ e^{-\frac{2\pi i}{N_1N_2} n_1 k_2 } \right]
      \left( \sum_{n_2=0}^{N_2-1} x_{N_1 n_2 + n_1}  
              e^{-\frac{2\pi i}{N_2} n_2 k_2 } \right)
      e^{-\frac{2\pi i}{N_1} n_1 k_1 }
</math>
::<math>= 
    \sum_{n_1=0}^{N_1-1} 
    
      \left( \sum_{n_2=0}^{N_2-1} x_{N_1 n_2 + n_1}  
              e^{-\frac{2\pi i}{N_2} n_2 k_2 } \right)
      e^{-\frac{2\pi i}{N_1N_2} n_1(N_2k_1+k_2) }
</math>.

where each inner sum is a DFT of size ''N''<sub>2</sub>, each outer sum is a DFT of size ''N''<sub>1</sub>, and the <nowiki>[...]</nowiki> bracketed term is the twiddle factor.

An arbitrary radix ''r'' (as well as mixed radices) can be employed, as was shown by both Cooley and Tukey<ref name=CooleyTukey65/> as well as Gauss (who gave examples of radix-3 and radix-6 steps).<ref name=Heideman84/>  Cooley and Tukey originally assumed that the radix butterfly required O(''r''<sup>2</sup>) work and hence reckoned the complexity for a radix ''r'' to be O(''r''<sup>2</sup>&nbsp;''N''/''r''&nbsp;log<sub>''r''</sub>''N'') = O(''N''&nbsp;log<sub>2</sub>(''N'')&nbsp;''r''/log<sub>2</sub>''r''); from calculation of values of ''r''/log<sub>2</sub>''r'' for integer values of ''r'' from 2 to 12 the optimal radix is found to be 3 (the closest integer to ''[[e (mathematical constant)|e]]'', which minimizes ''r''/log<sub>2</sub>''r'').<ref name=CooleyTukey65/><ref>Cooley, J. W., P. Lewis and P. Welch, "The Fast Fourier Transform and its Applications", ''IEEE Trans on Education'' '''12''', 1, 28–34 (1969)</ref>  This analysis was erroneous, however: the radix-butterfly is also a DFT and can be performed via an FFT algorithm in O(''r''  log ''r'') operations, hence the radix ''r'' actually cancels in the complexity O(''r''&nbsp;log(''r'')&nbsp;''N''/''r''&nbsp;log<sub>''r''</sub>''N''), and the optimal ''r'' is determined by more complicated considerations.  In practice, quite large ''r'' (32 or 64) are important in order to effectively exploit e.g. the large number of [[processor register]]s on modern processors,<ref name=FrigoJohnson05/> and even an unbounded radix ''r''={{radic|''N''}} also achieves O(''N''&nbsp;log&nbsp;''N'') complexity and has theoretical and practical advantages for large ''N'' as mentioned above.<ref name=GenSande66/><ref name=Bailey90/><ref name=Frigo99/>