Editing Rate–distortion theory (section)

== Rate–distortion functions ==
The functions that relate the rate and distortion are found as the solution of the following minimization problem:

:<math>\inf_{Q_{Y\mid X}(y\mid x)} I_Q(Y;X) \text{ subject to } D_Q \le D^*.</math>

Here <math>Q_{Y\mid X}(y\mid x)</math>, sometimes called a test channel, is the [[conditional probability|conditional]] [[probability density function]] (PDF) of the communication channel output (compressed signal) <math>Y</math> for a given input (original signal) <math>X</math>, and <math>I_Q(Y;X)</math> is the '''[[mutual information]]''' between <math>Y</math> and <math>X</math> defined as

:<math>I(Y;X) = H(Y) - H(Y\mid X) \, </math>

where <math>H(Y)</math> and <math>H(Y\mid X)</math> are the entropy of the output signal ''Y'' and the [[conditional entropy]] of the output signal given the input signal, respectively:

:<math> H(Y) = - \int_{-\infty}^\infty P_Y (y) \log_{2} (P_Y (y))\,dy </math>

:<math> H(Y\mid X) = 
       - \int_{-\infty}^\infty \int_{-\infty}^\infty Q_{Y\mid X}(y\mid x) P_X (x) \log_2 (Q_{Y\mid X} (y\mid x))\, dx\, dy. </math>

The problem can also be formulated as a distortion–rate function, where we find the [[Infimum and supremum|infimum]] over achievable distortions for given rate constraint. The relevant expression is:

:<math>\inf_{Q_{Y\mid X}(y\mid x)} E[D_Q[X,Y]] \text{ subject to } I_Q(Y;X)\leq R. </math>

The two formulations lead to functions which are inverses of each other.

The mutual information can be understood as a measure for 'prior' uncertainty the receiver has about the sender's signal (''H''(''Y'')), diminished by the uncertainty that is left after receiving information about the sender's signal (<math>H(Y\mid X)</math>). Of course the decrease in uncertainty is due to the communicated amount of information, which is <math>I \left(Y;X \right)</math>.

As an example, in case there is ''no'' communication at all, then <math>H(Y\mid X) = H (Y)</math> and <math>I(Y;X) = 0</math>. Alternatively, if the communication channel is perfect and the received signal <math>Y</math> is identical to the signal <math>X</math> at the sender, then <math>H(Y\mid X) = 0</math> and <math>I(Y;X) = H(X) = H(Y)</math>.

In the definition of the rate–distortion function, <math>D_Q</math> and <math>D^{*}</math> are the distortion between <math>X</math> and <math>Y</math> for a given <math>Q_{Y\mid X}(y\mid x)</math> and the prescribed maximum distortion, respectively. When we use the [[mean squared error]] as distortion measure, we have (for [[amplitude]]-[[continuous signal]]s):

:<math>D_Q = \int_{-\infty}^\infty \int_{-\infty}^\infty
          P_{X,Y}(x,y) (x-y)^2\, dx\, dy = \int_{-\infty}^\infty \int_{-\infty}^\infty
          Q_{Y\mid X}(y\mid x)P_{X}(x) (x-y)^2\, dx\, dy. </math>

As the above equations show, calculating a rate–distortion function requires the stochastic description of the input <math>X</math> in terms of the PDF <math>P_X (x)</math>, and then aims at finding the conditional PDF <math>Q_{Y\mid X}(y\mid x)</math> that minimize rate for a given distortion <math>D^{*}</math>. These definitions can be formulated measure-theoretically to account for discrete and mixed random variables as well.

An [[Analytical expression|analytical]] solution to this [[Optimization problem|minimization problem]] is often difficult to obtain except in some instances for which we next offer two of the best known examples. The rate–distortion function of any source is known to obey several fundamental properties, the most important ones being that it is a [[Continuous function|continuous]], [[monotonically decreasing]] [[Convex function|convex]] (U) [[function (mathematics)|function]] and thus the shape for the function in the examples is typical (even measured rate–distortion functions in real life tend to have very similar forms).

Although analytical solutions to this problem are scarce, there are upper and lower bounds to these functions including the famous [[Shannon lower bound]] (SLB), which in the case of squared error and memoryless sources, states that for arbitrary sources with finite differential entropy,

:<math> R(D) \ge h(X) - h(D) \, </math>

where ''h''(''D'') is the differential entropy of a Gaussian random variable with variance D. This lower bound is extensible to sources with memory and other distortion measures. One important feature of the SLB is that it is asymptotically tight in the low distortion regime for a wide class of sources and in some occasions, it actually coincides with the rate–distortion function. Shannon Lower Bounds can generally be found if the distortion between any two numbers can be expressed as a function of the difference between the value of these two numbers.

The [[Blahut–Arimoto algorithm]], co-invented by [[Richard Blahut]], is an elegant iterative technique for numerically obtaining rate–distortion functions of arbitrary finite input/output alphabet sources and much work has been done to extend it to more general problem instances.

The computation of the rate-distortion function requires knowledge of the underlying distribution, which is often unavailable in contemporary applications in data-science and machine learning. However, this challenge can be addressed using deep learning-based estimators of the rate-distortion function.<ref>{{Cite journal |last=Tsur |first=Dor |last2=Huleihel |first2=Bashar |last3=Permuter |first3=Haim H. |date=2024 |title=On Rate Distortion via Constrained Optimization of Estimated Mutual Information |url=https://ieeexplore.ieee.org/document/10681556/ |journal=IEEE Access |volume=12 |pages=137970–137987 |doi=10.1109/ACCESS.2024.3462853 |issn=2169-3536|doi-access=free }}</ref> These estimators are typically referred to as 'neural estimators', involving the optimization of a parametrized variational form of the rate distortion objective. 

When working with stationary sources with memory, it is necessary to modify the definition of the rate distortion function and it must be understood in the sense of a limit taken over sequences of increasing lengths.
:<math> 
R(D) = \lim_{n \rightarrow \infty} R_n(D)
</math>
where
:<math>
R_n(D) = \frac{1}{n} \inf_{Q_{Y^n\mid X^n} \in \mathcal{Q}} I(Y^n, X^n)
</math>
and
:<math>
\mathcal{Q} = \{ Q_{Y^n\mid X^n}(Y^n\mid X^n,X_0): E[d(X^n,Y^n)] \leq D \}
</math>
where superscripts denote a complete sequence up to that time and the subscript 0 indicates initial state.

===Memoryless (independent) Gaussian source with squared-error distortion===
If we assume that <math>X</math> is a [[normal distribution|Gaussian]] random variable with [[variance]] <math>\sigma^2</math>, and if we assume that successive samples of the signal <math>X</math> are [[stochastically independent]] (or equivalently, the source is ''[[memorylessness|memoryless]]'', or the signal is ''uncorrelated''), we find the following [[analytical expression]] for the rate–distortion function:

:<math> R(D) = \begin{cases}
  \frac{1}{2}\log_2(\sigma_x^2/D ), & \text{if } 0 \le D \le \sigma_x^2 \\
              0, & \text{if } D > \sigma_x^2. 
                      \end{cases}
</math>&nbsp;&nbsp;&nbsp;<ref>{{harvnb|Cover|Thomas|2012|p=310}}</ref>

The following figure shows what this function looks like:

[[File:Rate distortion function.png|400px]]

Rate–distortion theory tell us that 'no compression system exists that performs outside the gray area'. The closer a practical compression system is to the red (lower) bound, the better it performs. As a general rule, this bound can only be attained by increasing the coding block length parameter. Nevertheless, even at unit blocklengths one can often find good (scalar) [[Quantization (signal processing)|quantizers]] that operate at distances from the rate–distortion function that are practically relevant.<ref>{{cite book| first = Thomas M. |last=Cover |first2=Joy A. |last2=Thomas |chapter=10. Rate Distortion Theory | title = Elements of Information Theory | publisher = Wiley |orig-year=2006 |year=2012 |chapter-url={{GBurl|VWq5GG6ycxMC|p=301}} |isbn=978-1-118-58577-1 |edition=2nd}}</ref>

This rate–distortion function holds only for Gaussian memoryless sources. It is known that the Gaussian source is the most "difficult" source to encode: for a given mean square error, it requires the greatest number of bits. The performance of a practical compression system working on&mdash;say&mdash;images, may well be below the <math>R \left(D \right)</math> lower bound shown.

===Memoryless (independent) Bernoulli source with Hamming distortion===
The rate-distortion function of a [[Bernoulli random variable]] with Hamming distortion is given by:
:<math> R(D) = \left\{ \begin{matrix} 
  H_b(p)-H_b(D), & 0 \le D \le \min{(p,1-p)} \\
  0,         & D > \min{(p,1-p)}
                      \end{matrix} \right.
</math>
where <math>H_b</math> denotes the [[binary entropy function]].

Plot of the rate-distortion function for <math>p=0.5</math>:

[[File:Rate distortion function Bernoulli.png|400px]]