**WARNING! **Sorry, my previous implementation was wrong, and thanks a lot for Jan Schlüter's correction!

It's fairly easy to run k-means clustering in python, refer to `$pydoc scipy.cluster.vq.kmeans (or kmeans2)`

. While the initial selected centers affect the performance a lot. Thanks Feng Zhu, that introduced k-means++ to us, which is a very good and effective way to select the initial centers.

I was totally confused about the step (1b) in paper, selecting c_{i}=x', with probability `frac {D(x')^2} {sum_{x in X} D(x)^2}`

. I referred to author's C++ implementation, thought the sampling step (Utils.cpp, lines 299-305) is just for optimizing. So I simply minimized the `sum_{x in X} min (D(x)^2, ||x-xi||^2)`

.

Thanks a lot to Jan Schlüter, he pointed out that my previous implementation was wrong, the sampling step (1b) is one crucial part of K-means++ algorithm. And he had posted an optimized implementation here,

Here comes my revised python code (unoptimized):

def kinit (X, k, ntries=None): 'init k seeds according to kmeans++' if not ntries: ntries = int (2 + log(k)) n = X.shape[0] 'choose the 1st seed randomly, and store D(x)^2 in D[]' centers = [X[randint(n)]] D = [norm(x-centers[0])**2 for x in X] Dsum = reduce (lambda x,y:x+y, D) for _ in range(k-1): bestDsum = bestIdx = -1 for _ in range(ntries): randVal = random() * Dsum for i in range(n): if randVal <= D[i]: break else: randVal -= D[i] 'tmpDsum = sum_{x in X} min(D(x)^2,||x-xi||^2)' tmpDsum = reduce(lambda x,y:x+y, (min(D[j], norm(X[j]-X[i])**2) for j in xrange(n))) if bestDsum < 0 or tmpDsum < bestDsum: bestDsum, bestIdx = tmpDsum, i Dsum = bestDsum centers.append (X[bestIdx]) D = [min(D[i], norm(X[i]-X[bestIdx])**2) for i in xrange(n)] return array (centers) 'to use kinit() with kmeans2()' res,idx = kmeans2(Y, kinit(Y,3), minit='points')

Hi,

I you are still remembered KMean++ concept, can you pls tell me what the

author of KMeans++ was doing in the code:

// calculate P (non-cumulative)

//

s = 0;

for(i = 0; i <= npoints-1; i++)

{

s = s+d2(i);

}

if( s==0 )

{

result = false;

return result;

}

s = 1/s;

ap::vmove(&p(0), &d2(0), ap::vlen(0,npoints-1), s);

//

// choose one of points with probability P

// random number within (0,1) is generated and

// inverse empirical CDF is used to randomly choose a point.

//

s = 0;

v = ap::randomreal();

for(i = 0; i <= npoints-1; i++)

{

s = s+p(i);

if( v<=s||i==npoints-1 )

{

ap::vmove(¢ers(cc, 0), &xy(i, 0), ap::vlen(0,nvars-1))

;

busycenters(cc) = true;

break;

}

}

1- What his P contains? Is this the same D^2 distance measure (normalized

version)

2- What he meant by inverse empirical CDF

In my application my P is an array of shortest distances calculated with the

existing centroids.

Can I just use the code the below mentioned code to find the next centroid?

s = 0;

v = ap::randomreal();

for(i = 0; i <= npoints-1; i++)

{

s = s+p(i);

if( v<=s||i==npoints-1 )

{

ap::vmove(¢ers(cc, 0), &xy(i, 0), ap::vlen(0,nvars-1))

;

busycenters(cc) = true;

break;

}

}

sorry, I did not dig too much in their C++ implementation 🙁 There is another C++ implementation, http://www.stanford.edu/~darthur/kmpp.zip, which has a good README.

what is norm in your script? Is it np.linalg.norm?

IIRC, yes 🙂