Editing Discrete cosine transform (section)

== Computation ==
Although the direct application of these formulas would require <math>~ \mathcal{O}(N^2) ~</math> operations, it is possible to compute the same thing with only <math>~ \mathcal{O}(N \log N ) ~</math> complexity by factorizing the computation similarly to the [[fast Fourier transform]] (FFT). One can also compute DCTs via FFTs combined with <math>~\mathcal{O}(N)~</math> pre- and post-processing steps. In general, <math>~\mathcal{O}(N \log N )~</math> methods to compute DCTs are known as fast cosine transform (FCT) algorithms.

The most efficient algorithms, in principle, are usually those that are specialized directly for the DCT, as opposed to using an ordinary FFT plus <math>~ \mathcal{O}(N) ~</math> extra operations (see below for an exception). However, even "specialized" DCT algorithms (including all of those that achieve the lowest known arithmetic counts, at least for [[power of two|power-of-two]] sizes) are typically closely related to FFT algorithms – since DCTs are essentially DFTs of real-even data, one can design a fast DCT algorithm by taking an FFT and eliminating the redundant operations due to this symmetry. This can even be done automatically {{harv|Frigo|Johnson|2005}}. Algorithms based on the [[Cooley–Tukey FFT algorithm]] are most common, but any other FFT algorithm is also applicable. For example, the [[Winograd FFT algorithm]] leads to minimal-multiplication algorithms for the DFT, albeit generally at the cost of more additions, and a similar algorithm was proposed by {{harv|Feig|Winograd|1992a}} for the DCT. Because the algorithms for DFTs, DCTs, and similar transforms are all so closely related, any improvement in algorithms for one transform will theoretically lead to immediate gains for the other transforms as well {{harv|Duhamel|Vetterli|1990}}.

While DCT algorithms that employ an unmodified FFT often have some theoretical overhead compared to the best specialized DCT algorithms, the former also have a distinct advantage: Highly optimized FFT programs are widely available. Thus, in practice, it is often easier to obtain high performance for general lengths {{mvar|N}} with FFT-based algorithms.{{efn|
Algorithmic performance on modern hardware is typically not principally determined by simple arithmetic counts, and optimization requires substantial engineering effort to make best use, within its intrinsic limits, of available built-in hardware optimization.
}}
Specialized DCT algorithms, on the other hand, see widespread use for transforms of small, fixed sizes such as the {{nobr| 8 × 8 }} DCT-II used in [[JPEG]] compression, or the small DCTs (or MDCTs) typically used in audio compression. (Reduced code size may also be a reason to use a specialized DCT for embedded-device applications.)

In fact, even the DCT algorithms using an ordinary FFT are sometimes equivalent to pruning the redundant operations from a larger FFT of real-symmetric data, and they can even be optimal from the perspective of arithmetic counts. For example, a type-II DCT is equivalent to a DFT of size <math>~ 4N ~</math> with real-even symmetry whose even-indexed elements are zero. One of the most common methods for computing this via an FFT (e.g. the method used in [[FFTPACK]] and [[FFTW]]) was described by {{harvtxt|Narasimha|Peterson|1978}} and {{harvtxt|Makhoul|1980}}, and this method in hindsight can be seen as one step of a radix-4 decimation-in-time Cooley–Tukey algorithm applied to the "logical" real-even DFT corresponding to the DCT-II.{{efn|
The radix-4 step reduces the size <math>~ 4N ~</math> DFT to four size <math>~ N ~</math> DFTs of real data, two of which are zero, and two of which are equal to one another by the even symmetry. Hence giving a single size <math>~ N ~</math> FFT of real data plus <math>~ \mathcal{O}(N) ~</math> [[butterfly (FFT algorithm)|butterflies]], once the trivial and / or duplicate parts are eliminated and / or merged.
}}
Because the even-indexed elements are zero, this radix-4 step is exactly the same as a split-radix step. If the subsequent size <math>~ N ~</math> real-data FFT is also performed by a real-data [[split-radix FFT algorithm|split-radix algorithm]] (as in {{harvtxt|Sorensen|Jones|Heideman|Burrus|1987}}), then the resulting algorithm actually matches what was long the lowest published arithmetic count for the power-of-two DCT-II (<math>~ 2 N \log_2 N - N + 2 ~</math> real-arithmetic operations{{efn|
The precise count of real arithmetic operations, and in particular the count of real multiplications, depends somewhat on the scaling of the transform definition. The <math>~ 2 N \log_2 N - N + 2 ~</math> count is for the DCT-II definition shown here; two multiplications can be saved if the transform is scaled by an overall <math>\sqrt2</math> factor. Additional multiplications can be saved if one permits the outputs of the transform to be rescaled individually, as was shown by {{harvtxt|Arai|Agui|Nakajima|1988}} for the size-8 case used in JPEG.
}}).

A recent reduction in the operation count to <math>~ \tfrac{17}{9} N \log_2 N + \mathcal{O}(N)</math> also uses a real-data FFT.<ref>{{cite journal |doi=10.1016/j.sigpro.2008.01.004 |title=Type-II/III DCT/DST algorithms with reduced number of arithmetic operations |journal=Signal Processing |volume=88 |issue=6 |pages=1553–1564 |year=2008 |last1=Shao |first1=Xuancheng |last2=Johnson |first2=Steven G. |arxiv=cs/0703150 |bibcode=2008SigPr..88.1553S |s2cid=986733}}</ref> So, there is nothing intrinsically bad about computing the DCT via an FFT from an arithmetic perspective – it is sometimes merely a question of whether the corresponding FFT algorithm is optimal. (As a practical matter, the function-call overhead in invoking a separate FFT routine might be significant for small <math>~ N ~,</math> but this is an implementation rather than an algorithmic question since it can be solved by unrolling or inlining.)