Editing Cooley–Tukey FFT algorithm (section)

== Data reordering, bit reversal, and in-place algorithms ==
Although the abstract Cooley–Tukey factorization of the DFT, above, applies in some form to all implementations of the algorithm, much greater diversity exists in the techniques for ordering and accessing the data at each stage of the FFT. Of special interest is the problem of devising an [[in-place algorithm]] that overwrites its input with its output data using only O(1) auxiliary storage.

The best-known reordering technique involves explicit '''bit reversal''' for in-place radix-2 algorithms.  [[Bit-reversal permutation|Bit reversal]] is the [[permutation]] where the data at an index ''n'', written in [[binary numeral system|binary]] with digits ''b''<sub>4</sub>''b''<sub>3</sub>''b''<sub>2</sub>''b''<sub>1</sub>''b''<sub>0</sub> (e.g. 5 digits for ''N''=32 inputs), is transferred to the index with reversed digits ''b''<sub>0</sub>''b''<sub>1</sub>''b''<sub>2</sub>''b''<sub>3</sub>''b''<sub>4</sub> . Consider the last stage of a radix-2 DIT algorithm like the one presented above, where the output is written in-place over the input: when <math>E_k</math> and <math>O_k</math> are combined with a size-2 DFT, those two values are overwritten by the outputs.  However, the two output values should go in the first and second ''halves'' of the output array, corresponding to the ''most'' significant bit ''b''<sub>4</sub> (for ''N''=32); whereas the two inputs <math>E_k</math> and <math>O_k</math> are interleaved in the even and odd elements, corresponding to the ''least'' significant bit ''b''<sub>0</sub>.  Thus, in order to get the output in the correct place, ''b''<sub>0</sub> should take the place of ''b''<sub>4</sub> and the index becomes ''b''<sub>0</sub>''b''<sub>4</sub>''b''<sub>3</sub>''b''<sub>2</sub>''b''<sub>1</sub>. And for next recursive stage, those 4 least significant bits will become ''b''<sub>1</sub>''b''<sub>4</sub>''b''<sub>3</sub>''b''<sub>2</sub>, If you include all of the recursive stages of a radix-2 DIT algorithm, ''all'' the bits must be reversed and thus one must pre-process the input (''or'' post-process the output) with a bit reversal to get in-order output. (If each size-''N''/2 subtransform is to operate on contiguous data, the DIT ''input'' is pre-processed by bit-reversal.) Correspondingly, if you perform all of the steps in reverse order, you obtain a radix-2 DIF algorithm with bit reversal in post-processing (or pre-processing, respectively).

The logarithm (log) used in this algorithm is a base 2 logarithm.

The following is pseudocode for iterative radix-2 FFT algorithm implemented using bit-reversal permutation.<ref name="clrs">{{cite book|last1=Cormen|first1=Thomas H. |last2=Leiserson|first2=Charles|last3=Rivest|first3=Ronald|last4=Stein|first4=Clifford|title=Introduction to algorithms|date=2009|publisher=MIT Press|location=Cambridge, Mass.|isbn=978-0-262-03384-8|pages=915–918|edition=3rd}}</ref>
 
 '''algorithm''' iterative-fft '''is'''
     '''input:''' Array ''a'' of ''n'' complex values where n is a power of 2.
     '''output:''' Array ''A'' the DFT of a.
  
     bit-reverse-copy(a, A)
     ''n'' ← ''a''.length 
     '''for''' ''s'' = 1 '''to''' log(''n'') '''do'''
         ''m'' ← 2<sup>''s''</sup>
         ''ω''<sub>''m''</sub> ← exp(−2π''i''/''m'') 
         '''for''' ''k'' = 0 '''to''' ''n''-1 '''by''' ''m'' '''do'''
             ''ω'' ← 1
             '''for''' ''j'' = 0 '''to''' ''m''/''2'' – 1 '''do'''
                 ''t'' ← ''ω'' ''A''[''k'' + ''j'' + ''m''/2]
                 ''u'' ← ''A''[''k'' + ''j'']
                 ''A''[''k'' + ''j''] ← ''u'' + ''t''
                 ''A''[''k'' + ''j'' + ''m''/2] ← ''u'' – ''t''
                 ''ω'' ← ''ω'' ''ω''<sub>''m''</sub>
    
     '''return''' ''A''

The bit-reverse-copy procedure can be implemented as follows.
 
 '''algorithm''' bit-reverse-copy(''a'',''A'') '''is'''
     '''input:''' Array ''a'' of ''n'' complex values where n is a power of 2.
     '''output:''' Array ''A'' of size ''n''.
 
     ''n'' ← ''a''.length
     '''for''' ''k'' = 0 ''to'' ''n'' – 1 '''do'''
         ''A''[rev(k)] := ''a''[k]

Alternatively, some applications (such as convolution) work equally well on bit-reversed data, so one can perform forward transforms, processing, and then inverse transforms all without bit reversal to produce final results in the natural order.

Many FFT users, however, prefer natural-order outputs, and a separate, explicit bit-reversal stage can have a non-negligible impact on the computation time,<ref name=FrigoJohnson05/> even though bit reversal can be done in O(''N'') time and has been the subject of much research.<ref>{{cite journal |last=Karp |first=Alan H. |title=Bit reversal on uniprocessors |journal=SIAM Review |volume=38 |issue=1 |pages=1–26 |year=1996 |jstor=2132972 |doi=10.1137/1038001|citeseerx=10.1.1.24.2913 }}</ref><ref>{{Cite book|last1=Carter |first1=Larry |first2=Kang Su |last2=Gatlin |title=Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280) |chapter=Towards an optimal bit-reversal permutation program |pages=544–553 |year=1998 |doi=10.1109/SFCS.1998.743505 |isbn=978-0-8186-9172-0 |citeseerx=10.1.1.46.9319 |s2cid=14307262 }}</ref><ref>{{cite journal |last1=Rubio |first1=M. |first2=P. |last2=Gómez |first3=K. |last3=Drouiche |title=A new superfast bit reversal algorithm |journal= International Journal of Adaptive Control and Signal Processing|volume=16 |issue=10 |pages=703–707 |year=2002 |doi=10.1002/acs.718 |s2cid=62201722 }}</ref> Also, while the permutation is a bit reversal in the radix-2 case, it is more generally an arbitrary (mixed-base) digit reversal for the mixed-radix case, and the permutation algorithms become more complicated to implement. Moreover, it is desirable on many hardware architectures to re-order intermediate stages of the FFT algorithm so that they operate on consecutive (or at least more localized) data elements. To these ends, a number of alternative implementation schemes have been devised for the Cooley–Tukey algorithm that do not require separate bit reversal and/or involve additional permutations at intermediate stages.

The problem is greatly simplified if it is '''out-of-place''': the output array is distinct from the input array or, equivalently, an equal-size auxiliary array is available.  The {{vanchor|Stockham FFT|text='''Stockham auto-sort'''}} algorithm<ref>Originally attributed to Stockham in W. T. Cochran ''et al.'', [https://dx.doi.org/10.1109/PROC.1967.5957 What is the fast Fourier transform?], ''Proc. IEEE'' vol. 55, 1664–1674 (1967).</ref><ref name=Swarztrauber84/> performs every stage of the FFT out-of-place, typically writing back and forth between two arrays, transposing one "digit" of the indices with each stage, and has been especially popular on [[SIMD]] architectures.<ref name=Swarztrauber84>P. N. Swarztrauber, [https://dx.doi.org/10.1016/S0167-8191(84)90413-7 FFT algorithms for vector computers], ''Parallel Computing'' vol. 1, 45–63 (1984).</ref><ref>{{cite book |last=Swarztrauber |first=P. N. |chapter=Vectorizing the FFTs |editor-first=G. |editor-last=Rodrigue |title=Parallel Computations |publisher=Academic Press |location=New York |year=1982 |pages=[https://archive.org/details/parallelcomputat0000unse/page/51 51–83] |isbn=978-0-12-592101-5 |chapter-url=https://archive.org/details/parallelcomputat0000unse/page/51 }}</ref>  Even greater potential SIMD advantages (more consecutive accesses) have been proposed for the '''Pease''' algorithm,<ref>{{cite journal |last=Pease |first=M. C. |title=An adaptation of the fast Fourier transform for parallel processing |journal=J. ACM |volume=15 |issue=2 |pages=252–264 |year=1968 |doi=10.1145/321450.321457 |s2cid=14610645 |doi-access=free }}</ref> which also reorders out-of-place with each stage, but this method requires separate bit/digit reversal and O(''N'' log ''N'') storage.  One can also directly apply the Cooley–Tukey factorization definition with explicit ([[depth-first search|depth-first]]) recursion and small radices, which produces natural-order out-of-place output with no separate permutation step (as in the pseudocode above) and can be argued to have [[cache-oblivious algorithm|cache-oblivious]] locality benefits on systems with [[cache (computing)|hierarchical memory]].<ref name=Singleton67>{{cite journal |last=Singleton |first=Richard C. |title=On computing the fast Fourier transform |journal=Commun. ACM |volume=10 |issue=10 |year=1967 |pages=647–654 |doi=10.1145/363717.363771 |s2cid=6287781 |doi-access=free }}</ref><ref name=FrigoJohnson05>{{cite journal |last1=Frigo |first1=M. |first2=S. G. |last2=Johnson |url=http://www.fftw.org/fftw-paper-ieee.pdf |title=The Design and Implementation of FFTW3 |journal=Proceedings of the IEEE |volume=93 |issue=2 |pages=216–231 |year=2005 |doi=10.1109/JPROC.2004.840301|bibcode=2005IEEEP..93..216F |citeseerx=10.1.1.66.3097 |s2cid=6644892 }}</ref><ref>{{cite web |last1=Frigo |first1=Matteo |first2=Steven G. |last2=Johnson |title=FFTW |url=http://www.fftw.org/ }} A free ([[GNU General Public License|GPL]]) C library for computing discrete Fourier transforms in one or more dimensions, of arbitrary size, using the Cooley–Tukey algorithm</ref>

A typical strategy for in-place algorithms without auxiliary storage and without separate digit-reversal passes involves small matrix transpositions (which swap individual pairs of digits) at intermediate stages, which can be combined with the radix butterflies to reduce the number of passes over the data.<ref name=FrigoJohnson05/><ref>{{cite journal |last1=Johnson |first1=H. W. |first2=C. S. |last2=Burrus |title=An in-place in-order radix-2 FFT |journal=Proc. ICASSP |pages=28A.2.1–28A.2.4 |year=1984 }}</ref><ref>{{cite journal |last=Temperton |first=C. |title=Self-sorting in-place fast Fourier transform |journal=SIAM Journal on Scientific and Statistical Computing |volume=12 |issue=4 |pages=808–823 |year=1991 |doi=10.1137/0912043 }}</ref><ref>{{cite journal |last1=Qian |first1=Z. |first2=C. |last2=Lu |first3=M. |last3=An |first4=R. |last4=Tolimieri |title=Self-sorting in-place FFT algorithm with minimum working space |journal=IEEE Trans. ASSP |volume=52 |issue=10 |pages=2835–2836 |year=1994 |doi=10.1109/78.324749 |bibcode=1994ITSP...42.2835Q }}</ref><ref>{{cite journal |last=Hegland |first=M. |title=A self-sorting in-place fast Fourier transform algorithm suitable for vector and parallel processing |journal=Numerische Mathematik |volume=68 |issue=4 |pages=507–547 |year=1994 |doi=10.1007/s002110050074 |citeseerx=10.1.1.54.5659 |s2cid=121258187 }}</ref>