License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.ESA.2020.18
URN: urn:nbn:de:0030-drops-128848
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2020/12884/
Bhattacharya, Anup ;
Eube, Jan ;
Röglin, Heiko ;
Schmidt, Melanie
Noisy, Greedy and Not so Greedy k-Means++
Abstract
The k-means++ algorithm due to Arthur and Vassilvitskii [David Arthur and Sergei Vassilvitskii, 2007] has become the most popular seeding method for Lloyd’s algorithm. It samples the first center uniformly at random from the data set and the other k-1 centers iteratively according to D²-sampling, i.e., the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of ?(log k) in expectation.
Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to D²-sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an ?(log k)-approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy k-means++ yields only an Ω(?⋅log k)-approximation in expectation where ? is the number of possible centers that are sampled in each iteration.
Inspired by the negative results, we study a variation of greedy k-means++ which we call noisy k-means++ algorithm. In this variation only one center is sampled in every iteration but not exactly by D²-sampling. Instead in each iteration an adversary is allowed to change the probabilities arising from D²-sampling individually for each point by a factor between 1-ε₁ and 1+ε₂ for parameters ε₁ ∈ [0,1) and ε₂ ≥ 0. We prove that noisy k-means++ computes an ?(log² k)-approximation in expectation. We use the analysis of noisy k-means++ to design a moderately greedy k-means++ algorithm.
BibTeX - Entry
@InProceedings{bhattacharya_et_al:LIPIcs:2020:12884,
author = {Anup Bhattacharya and Jan Eube and Heiko R{\"o}glin and Melanie Schmidt},
title = {{Noisy, Greedy and Not so Greedy k-Means++}},
booktitle = {28th Annual European Symposium on Algorithms (ESA 2020)},
pages = {18:1--18:21},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-162-7},
ISSN = {1868-8969},
year = {2020},
volume = {173},
editor = {Fabrizio Grandoni and Grzegorz Herman and Peter Sanders},
publisher = {Schloss Dagstuhl--Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2020/12884},
URN = {urn:nbn:de:0030-drops-128848},
doi = {10.4230/LIPIcs.ESA.2020.18},
annote = {Keywords: k-means++, greedy, adaptive sampling}
}
Keywords: |
|
k-means++, greedy, adaptive sampling |
Collection: |
|
28th Annual European Symposium on Algorithms (ESA 2020) |
Issue Date: |
|
2020 |
Date of publication: |
|
26.08.2020 |