Editing Data compression (section)

== Theory ==
The theoretical basis for compression is provided by [[information theory]] and, more specifically, [[Shannon's source coding theorem]]; domain-specific theories include [[algorithmic information theory]] for lossless compression and [[rate–distortion theory]] for lossy compression. These areas of study were essentially created by [[Claude Shannon]], who published fundamental papers on the topic in the late 1940s and early 1950s. Other topics associated with compression include [[coding theory]] and [[statistical inference]].<ref name="Marak"/>

=== Machine learning ===
There is a close connection between [[machine learning]] and compression. A system that predicts the [[posterior probabilities]] of a sequence given its entire history can be used for optimal data compression (by using [[arithmetic coding]] on the output distribution). Conversely, an optimal compressor can be used for prediction (by finding the symbol that compresses best, given the previous history). This equivalence has been used as a justification for using data compression as a benchmark for "general intelligence".<ref name="Mahoney"/><ref name="Market Efficiency"/><ref name="Ben-Gal"/>

An alternative view can show compression algorithms implicitly map strings into implicit [[feature space vector]]s, and compression-based similarity measures compute similarity within these feature spaces. For each compressor C(.) we define an associated vector space ℵ, such that C(.) maps an input string x, corresponding to the vector norm ||~x||. An exhaustive examination of the feature spaces underlying all compression algorithms is precluded by space; instead, feature vectors chooses to examine three representative lossless compression methods, LZW, LZ77, and PPM.<ref name="ScullyBrodley"/>

According to [[AIXI]] theory, a connection more directly explained in [[Hutter Prize]], the best possible compression of x is the smallest possible software that generates x. For example, in that model, a zip file's compressed size includes both the zip file and the unzipping software, since you can not unzip it without both, but there may be an even smaller combined form.

Examples of AI-powered audio/video compression software include [[NVIDIA Maxine]], AIVC.<ref>{{cite web |author1=Gary Adcock |title=What Is AI Video Compression? |url=https://massive.io/file-transfer/what-is-ai-video-compression/ |website=massive.io |access-date=6 April 2023 |date=January 5, 2023}}</ref> Examples of software that can perform AI-powered image compression include [[OpenCV]], [[TensorFlow]], [[MATLAB]]'s Image Processing Toolbox (IPT) and High-Fidelity Generative Image Compression.<ref>{{cite arXiv |last1=Mentzer |first1=Fabian |last2=Toderici |first2=George |last3=Tschannen |first3=Michael |last4=Agustsson |first4=Eirikur |title=High-Fidelity Generative Image Compression |year=2020 |class=eess.IV |eprint=2006.09965}}</ref>

In [[unsupervised machine learning]], [[k-means clustering]] can be utilized to compress data by grouping similar data points into clusters. This technique simplifies handling extensive datasets that lack predefined labels and finds widespread use in fields such as [[image compression]].<ref>{{Cite web |title=What is Unsupervised Learning? {{!}} IBM |url=https://www.ibm.com/topics/unsupervised-learning |access-date=2024-02-05 |website=www.ibm.com |date=23 September 2021 |language=en-us}}</ref>

Data compression aims to reduce the size of data files, enhancing storage efficiency and speeding up data transmission. K-means clustering, an unsupervised machine learning algorithm, is employed to partition a dataset into a specified number of clusters, k, each represented by the [[centroid]] of its points. This process condenses extensive datasets into a more compact set of representative points. Particularly beneficial in [[Image processing|image]] and [[signal processing]], k-means clustering aids in data reduction by replacing groups of data points with their centroids, thereby preserving the core information of the original data while significantly decreasing the required storage space.<ref>{{Cite web |date=2023-05-25 |title=Differentially private clustering for large-scale datasets |url=https://blog.research.google/2023/05/differentially-private-clustering-for.html |access-date=2024-03-16 |website=blog.research.google |language=en}}</ref>

[[Large language model]]s (LLMs) are also efficient lossless data compressors on some data sets, as demonstrated by [[DeepMind]]'s research with the Chinchilla 70B model. Developed by DeepMind, Chinchilla 70B effectively compressed data, outperforming conventional methods such as [[Portable Network Graphics]] (PNG) for images and [[Free Lossless Audio Codec]] (FLAC) for audio. It achieved compression of image and audio data to 43.4% and 16.4% of their original sizes, respectively. There is, however, some reason to be concerned that the data set used for testing overlaps the LLM training data set, making it possible that the Chinchilla 70B model is only an efficient compression tool on data it has already been trained on.<ref>{{Cite web |last=Edwards |first=Benj |date=2023-09-28 |title=AI language models can exceed PNG and FLAC in lossless compression, says study |url=https://arstechnica.com/information-technology/2023/09/ai-language-models-can-exceed-png-and-flac-in-lossless-compression-says-study/ |access-date=2024-03-07 |website=Ars Technica |language=en-us}}</ref><ref>{{Cite arXiv |eprint=2309.10668 |last1=Delétang |first1=Grégoire |last2=Ruoss |first2=Anian |last3=Duquenne |first3=Paul-Ambroise |last4=Catt |first4=Elliot |last5=Genewein |first5=Tim |last6=Mattern |first6=Christopher |last7=Grau-Moya |first7=Jordi |author8=Li Kevin Wenliang |last9=Aitchison |first9=Matthew |last10=Orseau |first10=Laurent |last11=Hutter |first11=Marcus |last12=Veness |first12=Joel |title=Language Modeling is Compression |date=2023 |class=cs.LG }}</ref>

=== Data differencing ===
[[File:Nubio Diff Screenshot3.png|thumb|[[File comparison|Comparison]] of two revisions of a file]]

Data compression can be viewed as a special case of [[data differencing]].<ref name="RFC 3284"/><ref name="Vdelta"/> Data differencing consists of producing a ''difference'' given a ''source'' and a ''target,'' with patching reproducing the ''target'' given a ''source'' and a ''difference.'' Since there is no separate source and target in data compression, one can consider data compression as data differencing with empty source data, the compressed file corresponding to a difference from nothing. This is the same as considering absolute [[entropy (information theory)|entropy]] (corresponding to data compression) as a special case of [[relative entropy]] (corresponding to data differencing) with no initial data.

The term ''differential compression'' is used to emphasize the data differencing connection.