License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.ICALP.2018.57
URN: urn:nbn:de:0030-drops-90612
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2018/9061/
Go to the corresponding LIPIcs Volume Portal


Gamlath, Buddhima ; Huang, Sangxia ; Svensson, Ola

Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering

pdf-format:
LIPIcs-ICALP-2018-57.pdf (0.5 MB)


Abstract

We study k-means clustering in a semi-supervised setting. Given an oracle that returns whether two given points belong to the same cluster in a fixed optimal clustering, we investigate the following question: how many oracle queries are sufficient to efficiently recover a clustering that, with probability at least (1 - delta), simultaneously has a cost of at most (1 + epsilon) times the optimal cost and an accuracy of at least (1 - epsilon)?
We show how to achieve such a clustering on n points with O{((k^2 log n) * m{(Q, epsilon^4, delta / (k log n))})} oracle queries, when the k clusters can be learned with an epsilon' error and a failure probability delta' using m(Q, epsilon',delta') labeled samples in the supervised setting, where Q is the set of candidate cluster centers. We show that m(Q, epsilon', delta') is small both for k-means instances in Euclidean space and for those in finite metric spaces. We further show that, for the Euclidean k-means instances, we can avoid the dependency on n in the query complexity at the expense of an increased dependency on k: specifically, we give a slightly more involved algorithm that uses O{(k^4/(epsilon^2 delta) + (k^{9}/epsilon^4) log(1/delta) + k * m{({R}^r, epsilon^4/k, delta)})} oracle queries.
We also show that the number of queries needed for (1 - epsilon)-accuracy in Euclidean k-means must linearly depend on the dimension of the underlying Euclidean space, and for finite metric space k-means, we show that it must at least be logarithmic in the number of candidate centers. This shows that our query complexities capture the right dependencies on the respective parameters.

BibTeX - Entry

@InProceedings{gamlath_et_al:LIPIcs:2018:9061,
  author =	{Buddhima Gamlath and Sangxia Huang and Ola Svensson},
  title =	{{Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering}},
  booktitle =	{45th International Colloquium on Automata, Languages, and  Programming (ICALP 2018)},
  pages =	{57:1--57:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-076-7},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{107},
  editor =	{Ioannis Chatzigiannakis and Christos Kaklamanis and D{\'a}niel Marx and Donald Sannella},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{http://drops.dagstuhl.de/opus/volltexte/2018/9061},
  URN =		{urn:nbn:de:0030-drops-90612},
  doi =		{10.4230/LIPIcs.ICALP.2018.57},
  annote =	{Keywords: Clustering, Semi-supervised Learning, Approximation Algorithms, k-Means, k-Median}
}

Keywords: Clustering, Semi-supervised Learning, Approximation Algorithms, k-Means, k-Median
Collection: 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)
Issue Date: 2018
Date of publication: 04.07.2018


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI