License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.ITCS.2018.47
URN: urn:nbn:de:0030-drops-83215
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2018/8321/
Qiao, Mingda ;
Valiant, Gregory
Learning Discrete Distributions from Untrusted Batches
Abstract
We consider the problem of learning a discrete distribution in the presence of an epsilon fraction of malicious data sources. Specifically, we consider the setting where there is some underlying distribution, p, and each data source provides a batch of >= k samples, with the guarantee that at least a (1 - epsilon) fraction of the sources draw their samples from a distribution with total variation distance at most \eta from p. We make no assumptions on the data provided by the remaining epsilon fraction of sources--this data can even be chosen as an adversarial function of the (1 - epsilon) fraction of "good" batches. We provide two algorithms: one with runtime exponential in the support size, n, but polynomial in k, 1/epsilon and 1/eta that takes O((n + k)/epsilon^2) batches and recovers p to error O(eta + epsilon/sqrt(k)). This recovery accuracy is information theoretically optimal, to constant factors, even given an infinite number of data sources. Our second algorithm applies to the eta = 0 setting and also achieves an O(epsilon/sqrt(k)) recover guarantee, though it runs in poly((nk)^k) time. This second algorithm, which approximates a certain tensor via a rank-1 tensor minimizing l_1 distance, is surprising in light of the hardness of many low-rank tensor approximation problems, and may be of independent interest.
BibTeX - Entry
@InProceedings{qiao_et_al:LIPIcs:2018:8321,
author = {Mingda Qiao and Gregory Valiant},
title = {{Learning Discrete Distributions from Untrusted Batches}},
booktitle = {9th Innovations in Theoretical Computer Science Conference (ITCS 2018)},
pages = {47:1--47:20},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-060-6},
ISSN = {1868-8969},
year = {2018},
volume = {94},
editor = {Anna R. Karlin},
publisher = {Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
address = {Dagstuhl, Germany},
URL = {http://drops.dagstuhl.de/opus/volltexte/2018/8321},
URN = {urn:nbn:de:0030-drops-83215},
doi = {10.4230/LIPIcs.ITCS.2018.47},
annote = {Keywords: robust statistics, information-theoretic optimality}
}
Keywords: |
|
robust statistics, information-theoretic optimality |
Collection: |
|
9th Innovations in Theoretical Computer Science Conference (ITCS 2018) |
Issue Date: |
|
2018 |
Date of publication: |
|
12.01.2018 |