License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.APPROX-RANDOM.2019.62
URN: urn:nbn:de:0030-drops-112778
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2019/11277/
Go to the corresponding LIPIcs Volume Portal


Braverman, Vladimir ; Feldman, Dan ; Lang, Harry ; Rus, Daniela

Streaming Coreset Constructions for M-Estimators

pdf-format:
LIPIcs-APPROX-RANDOM-2019-62.pdf (0.5 MB)


Abstract

We introduce a new method of maintaining a (k,epsilon)-coreset for clustering M-estimators over insertion-only streams. Let (P,w) be a weighted set (where w : P - > [0,infty) is the weight function) of points in a rho-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x,z) <=rho(D(x,y) + D(y,z)) for all x,y,z in X). For any set of points C, we define COST(P,w,C) = sum_{p in P} w(p) min_{c in C} D(p,c). A (k,epsilon)-coreset for (P,w) is a weighted set (Q,v) such that for every set C of k points, (1-epsilon)COST(P,w,C) <= COST(Q,v,C) <= (1+epsilon)COST(P,w,C). Essentially, the coreset (Q,v) can be used in place of (P,w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data.
M-estimators are functions D(x,y) that can be written as psi(d(x,y)) where ({X}, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (psi(x) =x) and k-means (psi(x) = x^2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(epsilon^{-2} k log k log n) points of storage. The previous state-of-the-art required storing at least O(epsilon^{-2} k log k log^{4} n) points.

BibTeX - Entry

@InProceedings{braverman_et_al:LIPIcs:2019:11277,
  author =	{Vladimir Braverman and Dan Feldman and Harry Lang and Daniela Rus},
  title =	{{Streaming Coreset Constructions for M-Estimators}},
  booktitle =	{Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)},
  pages =	{62:1--62:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-125-2},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{145},
  editor =	{Dimitris Achlioptas and L{\'a}szl{\'o} A. V{\'e}gh},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{http://drops.dagstuhl.de/opus/volltexte/2019/11277},
  URN =		{urn:nbn:de:0030-drops-112778},
  doi =		{10.4230/LIPIcs.APPROX-RANDOM.2019.62},
  annote =	{Keywords: Streaming, Clustering, Coresets}
}

Keywords: Streaming, Clustering, Coresets
Collection: Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)
Issue Date: 2019
Date of publication: 17.09.2019


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI