License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.ICALP.2021.104
URN: urn:nbn:de:0030-drops-141731
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2021/14173/
Pettie, Seth ;
Wang, Dingyu ;
Yin, Longhui
Non-Mergeable Sketching for Cardinality Estimation
Abstract
Cardinality estimation is perhaps the simplest non-trivial statistical problem that can be solved via sketching. Industrially-deployed sketches like HyperLogLog, MinHash, and PCSA are mergeable, which means that large data sets can be sketched in a distributed environment, and then merged into a single sketch of the whole data set. In the last decade a variety of sketches have been developed that are non-mergeable, but attractive for other reasons. They are simpler, their cardinality estimates are strictly unbiased, and they have substantially lower variance.
We evaluate sketching schemes on a reasonably level playing field, in terms of their memory-variance product (MVP). E.g., a sketch that occupies 5m bits and whose relative variance is 2/m (standard error √{2/m}) has an MVP of 10. Our contributions are as follows.
- Cohen [Edith Cohen, 2015] and Ting [Daniel Ting, 2014] independently discovered what we call the {Martingale transform} for converting a mergeable sketch into a non-mergeable sketch. We present a simpler way to analyze the limiting MVP of Martingale-type sketches.
- Pettie and Wang proved that the Fishmonger sketch [Seth Pettie and Dingyu Wang, 2021] has the best MVP, H₀/I₀ ≈ 1.98, among a class of mergeable sketches called "linearizable" sketches. (H₀ and I₀ are precisely defined constants.) We prove that the Martingale transform is optimal in the non-mergeable world, and that Martingale Fishmonger in particular is optimal among linearizable sketches, with an MVP of H₀/2 ≈ 1.63. E.g., this is circumstantial evidence that to achieve 1% standard error, we cannot do better than a 2 kilobyte sketch.
- Martingale Fishmonger is neither simple nor practical. We develop a new mergeable sketch called Curtain that strikes a nice balance between simplicity and efficiency, and prove that Martingale Curtain has limiting MVP≈ 2.31. It can be updated with O(1) memory accesses and it has lower empirical variance than Martingale LogLog, a practical non-mergeable version of HyperLogLog.
BibTeX - Entry
@InProceedings{pettie_et_al:LIPIcs.ICALP.2021.104,
author = {Pettie, Seth and Wang, Dingyu and Yin, Longhui},
title = {{Non-Mergeable Sketching for Cardinality Estimation}},
booktitle = {48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)},
pages = {104:1--104:20},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-195-5},
ISSN = {1868-8969},
year = {2021},
volume = {198},
editor = {Bansal, Nikhil and Merelli, Emanuela and Worrell, James},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2021/14173},
URN = {urn:nbn:de:0030-drops-141731},
doi = {10.4230/LIPIcs.ICALP.2021.104},
annote = {Keywords: Cardinality Estimation, Sketching}
}
Keywords: |
|
Cardinality Estimation, Sketching |
Collection: |
|
48th International Colloquium on Automata, Languages, and Programming (ICALP 2021) |
Issue Date: |
|
2021 |
Date of publication: |
|
02.07.2021 |