Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Quantile
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Approximate quantiles from a stream== Computing approximate quantiles from data arriving from a stream can be done efficiently using compressed data structures. The most popular methods are t-digest<ref name="Dunning2019">{{cite arXiv |author=Dunning, Ted |author2=Ertl, Otmar |title=Computing Extremely Accurate Quantiles Using t-Digests |date=February 2019 |eprint=1902.04023 |class=stat.CO}}</ref> and KLL.<ref name="Karnin2016">{{cite arXiv |author1=Zohar Karnin |author2=Kevin Lang |author3=Edo Liberty |title=Optimal Quantile Approximation in Streams |date=2016 |class=cs.DS |eprint=1603.05346}}</ref> These methods read a stream of values in a continuous fashion and can, at any time, be queried about the approximate value of a specified quantile. Both algorithms are based on a similar idea: compressing the stream of values by summarizing identical or similar values with a weight. If the stream is made of a repetition of 100 times v1 and 100 times v2, there is no reason to keep a sorted list of 200 elements, it is enough to keep two elements and two counts to be able to recover the quantiles. With more values, these algorithms maintain a trade-off between the number of unique values stored and the precision of the resulting quantiles. Some values may be discarded from the stream and contribute to the weight of a nearby value without changing the quantile results too much. The t-digest maintains a [[data structure]] of bounded size using an approach motivated by ''k''-means clustering to group similar values. The KLL algorithm uses a more sophisticated "compactor" method that leads to better control of the error bounds at the cost of requiring an unbounded size if errors must be bounded relative to {{mvar|p}}. Both methods belong to the family of ''data sketches'' that are subsets of [[Streaming algorithm|Streaming Algorithms]] with useful properties: t-digest or KLL sketches can be combined. Computing the sketch for a very large vector of values can be split into trivially parallel processes where sketches are computed for partitions of the vector in parallel and merged later. The algorithms described so far directly approximate the empirical quantiles without any particular assumptions on the data, in essence the data are simply numbers or more generally, a set of items that can be ordered. These algorithms are computer science derived methods. Another class of algorithms exist which assume that the data are realizations of a random process. These are statistics derived methods, sequential nonparametric estimation algorithms in particular. There are a number of such algorithms such as those based on stochastic approximation<ref name="tierney1983">{{cite journal |last1=Tierney |first1=Luke |title=A space-efficient recursive procedure for estimating a quantile of an unknown distribution |journal=SIAM Journal on Scientific and Statistical Computing |date=1983 |volume=4 |issue=4 |page=706-711|doi=10.1137/0904048|url=https://doi.org/10.1137/0904048|url-access=subscription }}</ref><ref name="chen2000">{{cite book |last1=Chen |first1=Fei |last2=Lambert |first2=Diane |last3=Pinheiro |first3=Jose |chapter=Incremental quantile estimation for massive tracking |title=Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining |date=2000 |page=516-522|doi=10.1145/347090.347195|isbn=1-58113-233-6 |chapter-url=https://doi.org/10.1145/347090.347195}}</ref> or Hermite series estimators.<ref name="stephanou2017">{{cite journal |last1=Stephanou |first1=Michael |last2=Varughese |first2=Melvin |last3=Macdonald |first3=Iain |title=Sequential quantiles via Hermite series density estimation |journal=Electronic Journal of Statistics |date=2017 |volume=11 |issue=1 |page=570-607 |doi=10.1214/17-EJS1245 |url=https://doi.org/10.1214/17-EJS1245|arxiv=1507.05073 }}</ref> These statistics based algorithms typically have constant update time and space complexity, but have different error bound guarantees compared to computer science type methods and make more assumptions. The statistics based algorithms do present certain advantages however, particularly in the non-stationary streaming setting i.e. time-varying data. The algorithms of both classes, along with some respective advantages and disadvantages have been recently surveyed.<ref name="StephanouHermiter2022">{{Cite journal |author=Stephanou, M. and Varughese, M |title=Hermiter: R package for sequential nonparametric estimation |journal=Computational Statistics |year=2023 |volume=39 |issue=3 |pages=1127β1163 |doi=10.1007/s00180-023-01382-0 |arxiv=2111.14091 |s2cid=244715035 }}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)