A simple D 2-sampling based PTAS for k-means and other clustering problems

R. Jaiswal; A. Kumar; Sandeep Sen

doi:10.1007/978-3-642-32241-9_2

Given a set of points P ⊂ ℝ d, the k-means clustering problem is to find a set of k centers C = {c 1,...,c k}, c i ∈ ℝ d, such that the objective function Σ x∈P d(x,C) 2, where d(x,C) denotes the distance between x and the closest center in C, is minimized. This is one of the most prominent objective functions that have been studied with respect to clustering. D 2-sampling [1] is a simple non-uniform sampling technique for choosing points from a set of points. It works as follows: given a set of points P ⊆ ℝ d, the first point is chosen uniformly at random from P. Subsequently, a point from P is chosen as the next sample with probability proportional to the square of the distance of this point to the nearest previously sampled points. D 2-sampling has been shown to have nice properties with respect to the k-means clustering problem. Arthur and Vassilvitskii [1] show that k points chosen as centers from P using D 2-sampling gives an O(logk) approximation in expectation. Ailon et. al. [2] and Aggarwal et. al. [3] extended results of [1] to show that O(k) points chosen as centers using D 2-sampling give O(1) approximation to the k-means objective function with high probability. In this paper, we further demonstrate the power of D 2-sampling by giving a simple randomized (1 + ∈)-approximation algorithm that uses the D 2-sampling in its core. © 2012 Springer-Verlag.