Abstract
We introduce a new method of maintaining a (k,epsilon)coreset for clustering Mestimators over insertiononly streams. Let (P,w) be a weighted set (where w : P  > [0,infty) is the weight function) of points in a rhometric space (meaning a set X equipped with a positivesemidefinite symmetric function D such that D(x,z) <=rho(D(x,y) + D(y,z)) for all x,y,z in X). For any set of points C, we define COST(P,w,C) = sum_{p in P} w(p) min_{c in C} D(p,c). A (k,epsilon)coreset for (P,w) is a weighted set (Q,v) such that for every set C of k points, (1epsilon)COST(P,w,C) <= COST(Q,v,C) <= (1+epsilon)COST(P,w,C). Essentially, the coreset (Q,v) can be used in place of (P,w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data.
Mestimators are functions D(x,y) that can be written as psi(d(x,y)) where ({X}, d) is a true metric (i.e. 1metric) space. Special cases of Mestimators include the wellknown kmedian (psi(x) =x) and kmeans (psi(x) = x^2) functions. Our technique takes an existing offline construction for an Mestimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any Mestimator that does not rely on the mergeandreduce tree. For example, our coreset for streaming metric kmeans uses O(epsilon^{2} k log k log n) points of storage. The previous stateoftheart required storing at least O(epsilon^{2} k log k log^{4} n) points.
BibTeX  Entry
@InProceedings{braverman_et_al:LIPIcs:2019:11277,
author = {Vladimir Braverman and Dan Feldman and Harry Lang and Daniela Rus},
title = {{Streaming Coreset Constructions for MEstimators}},
booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)},
pages = {62:162:15},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959771252},
ISSN = {18688969},
year = {2019},
volume = {145},
editor = {Dimitris Achlioptas and L{\'a}szl{\'o} A. V{\'e}gh},
publisher = {Schloss DagstuhlLeibnizZentrum fuer Informatik},
address = {Dagstuhl, Germany},
URL = {http://drops.dagstuhl.de/opus/volltexte/2019/11277},
URN = {urn:nbn:de:0030drops112778},
doi = {10.4230/LIPIcs.APPROXRANDOM.2019.62},
annote = {Keywords: Streaming, Clustering, Coresets}
}
Keywords: 

Streaming, Clustering, Coresets 
Collection: 

Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019) 
Issue Date: 

2019 
Date of publication: 

17.09.2019 