Abstract
We generalise the results of Bhattacharya et al. [Bhattacharya et al., 2018] for the listkmeans problem defined as  for a (unknown) partition X₁, ..., X_k of the dataset X ⊆ ℝ^d, find a list of kcentersets (each element in the list is a set of k centers) such that at least one of kcentersets {c₁, ..., c_k} in the list gives an (1+ε)approximation with respect to the cost function min_{permutation π} [∑_{i = 1}^{k} ∑_{x ∈ X_i} x  c_{π(i)}²]. The listkmeans problem is important for the constrained kmeans problem since algorithms for the former can be converted to {PTAS} for various versions of the latter. The algorithm for the listkmeans problem by Bhattacharya et al. is a D²sampling based algorithm that runs in k iterations. Making use of a constant factor solution for the (classical or unconstrained) kmeans problem, we generalise the algorithm of Bhattacharya et al. in two ways  (i) for any fixed set X_{j₁}, ..., X_{j_t} of t ≤ k clusters, the algorithm produces a list of (k/(ε))^{O(t/(ε))} tcenter sets such that (w.h.p.) at least one of them is good for X_{j₁}, ..., X_{j_t}, and (ii) the algorithm runs in a single iteration. Following are the consequences of our generalisations:
1) Faster PTAS under stability and a parameterised reduction: Property (i) of our generalisation is useful in scenarios where finding good centers becomes easier once good centers for a few "bad" clusters have been chosen. One such case is clustering under stability of Awasthi et al. [Awasthi et al., 2010] where the number of such bad clusters is a constant. Using property (i), we significantly improve the running time of their algorithm from O(dn³) (k log{n})^{poly(1/(β), 1/(ε))} to O (dn³ (k/(ε)) ^{O(1/βε²)}). Another application is a parameterised reduction from the outlier version of kmeans to the classical one where the bad clusters are the outliers.
2) Streaming algorithms: The sampling algorithm running in a single iteration (i.e., property (ii)) allows us to design a constantpass, logspace streaming algorithm for the listkmeans problem. This can be converted to a constantpass, logspace streaming PTAS for various constrained versions of the kmeans problem. In particular, this gives a 3pass, polylogspace streaming PTAS for the constrained binary kmeans problem which in turn gives a 4pass, polylogspace streaming PTAS for the generalised binary ?₀rankr approximation problem. This is the first constant pass, polylogspace streaming algorithm for either of the two problems. Coreset based techniques, which is another approach for designing streaming algorithms in general, is not known to work for the constrained binary kmeans problem to the best of our knowledge.
BibTeX  Entry
@InProceedings{bhattacharya_et_al:LIPIcs:2020:13254,
author = {Anup Bhattacharya and Dishant Goyal and Ragesh Jaiswal and Amit Kumar},
title = {{On Sampling Based Algorithms for kMeans}},
booktitle = {40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020)},
pages = {13:113:17},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959771740},
ISSN = {18688969},
year = {2020},
volume = {182},
editor = {Nitin Saxena and Sunil Simon},
publisher = {Schloss DagstuhlLeibnizZentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2020/13254},
URN = {urn:nbn:de:0030drops132549},
doi = {10.4230/LIPIcs.FSTTCS.2020.13},
annote = {Keywords: kmeans, low rank approximation}
}
Keywords: 

kmeans, low rank approximation 
Collection: 

40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020) 
Issue Date: 

2020 
Date of publication: 

04.12.2020 