License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.STACS.2016.16
URN: urn:nbn:de:0030-drops-57179
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2016/5717/
Bhattacharya, Anup ;
Jaiswal, Ragesh ;
Kumar, Amit
Faster Algorithms for the Constrained k-Means Problem
Abstract
The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. For instance, consider the r-gather clustering problem where there is an additional constraint that each of the clusters should have at least r points or the capacitated clustering problem where there is an upper bound on the cluster sizes. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O_1, ..., O_k are an arbitrary partition of the dataset and the goal is to output k-centers c_1, ..., c_k such that the objective function sum_{i=1}^{k} sum_{x in O_{i}} ||x - c_{i}||^2 is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter epsilon > 0, let l denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1+epsilon) approximation w.r.t. the objective function above. In this paper, we show an upper bound on l by giving a randomized algorithm that outputs a list of 2^{~O(k/epsilon)} k-centers. We also give a closely matching lower bound of 2^{~Omega(k/sqrt{epsilon})}. Moreover, our algorithm runs in time O(n * d * 2^{~O(k/epsilon)}). This is a significant improvement over the previous result of Ding and Xu who gave an algorithm with running time O(n * d * (log{n})^{k} * 2^{poly(k/epsilon)}) and output a list of size O((log{n})^k * 2^{poly(k/epsilon)}). Our techniques generalize for the k-median problem and for many other settings where non-Euclidean distance measures are involved.
BibTeX - Entry
@InProceedings{bhattacharya_et_al:LIPIcs:2016:5717,
author = {Anup Bhattacharya and Ragesh Jaiswal and Amit Kumar},
title = {{Faster Algorithms for the Constrained k-Means Problem}},
booktitle = {33rd Symposium on Theoretical Aspects of Computer Science (STACS 2016)},
pages = {16:1--16:13},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-001-9},
ISSN = {1868-8969},
year = {2016},
volume = {47},
editor = {Nicolas Ollinger and Heribert Vollmer},
publisher = {Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
address = {Dagstuhl, Germany},
URL = {http://drops.dagstuhl.de/opus/volltexte/2016/5717},
URN = {urn:nbn:de:0030-drops-57179},
doi = {10.4230/LIPIcs.STACS.2016.16},
annote = {Keywords: k-means, k-median, approximation algorithm, sampling}
}
Keywords: |
|
k-means, k-median, approximation algorithm, sampling |
Collection: |
|
33rd Symposium on Theoretical Aspects of Computer Science (STACS 2016) |
Issue Date: |
|
2016 |
Date of publication: |
|
16.02.2016 |