WARNING! Sorry, my previous implementation was wrong, and thanks a lot for Jan Schlüter's correction!
It's fairly easy to run k-means clustering in python, refer to $pydoc scipy.cluster.vq.kmeans (or kmeans2). While the initial selected centers affect the performance a lot. Thanks Feng Zhu, that introduced k-means++ to us, which is a very good and effective way to select the initial centers.
I was totally confused about the step (1b) in paper, selecting ci=x', with probability frac {D(x')^2} {sum_{x in X} D(x)^2}. I referred to author's C++ implementation, thought the sampling step (Utils.cpp, lines 299-305) is just for optimizing. So I simply minimized the sum_{x in X} min (D(x)^2, ||x-xi||^2).
Thanks a lot to Jan Schlüter, he pointed out that my previous implementation was wrong, the sampling step (1b) is one crucial part of K-means++ algorithm. And he had posted an optimized implementation here,
Here comes my revised python code (unoptimized):
def kinit (X, k, ntries=None):
'init k seeds according to kmeans++'
if not ntries: ntries = int (2 + log(k))
n = X.shape[0]
'choose the 1st seed randomly, and store D(x)^2 in D[]'
centers = [X[randint(n)]]
D = [norm(x-centers[0])**2 for x in X]
Dsum = reduce (lambda x,y:x+y, D)
for _ in range(k-1):
bestDsum = bestIdx = -1
for _ in range(ntries):
randVal = random() * Dsum
for i in range(n):
if randVal <= D[i]:
break
else:
randVal -= D[i]
'tmpDsum = sum_{x in X} min(D(x)^2,||x-xi||^2)'
tmpDsum = reduce(lambda x,y:x+y,
(min(D[j], norm(X[j]-X[i])**2) for j in xrange(n)))
if bestDsum < 0 or tmpDsum < bestDsum:
bestDsum, bestIdx = tmpDsum, i
Dsum = bestDsum
centers.append (X[bestIdx])
D = [min(D[i], norm(X[i]-X[bestIdx])**2) for i in xrange(n)]
return array (centers)
'to use kinit() with kmeans2()'
res,idx = kmeans2(Y, kinit(Y,3), minit='points')
Hi,
I you are still remembered KMean++ concept, can you pls tell me what the
author of KMeans++ was doing in the code:
// calculate P (non-cumulative)
//
s = 0;
for(i = 0; i <= npoints-1; i++)
{
s = s+d2(i);
}
if( s==0 )
{
result = false;
return result;
}
s = 1/s;
ap::vmove(&p(0), &d2(0), ap::vlen(0,npoints-1), s);
//
// choose one of points with probability P
// random number within (0,1) is generated and
// inverse empirical CDF is used to randomly choose a point.
//
s = 0;
v = ap::randomreal();
for(i = 0; i <= npoints-1; i++)
{
s = s+p(i);
if( v<=s||i==npoints-1 )
{
ap::vmove(¢ers(cc, 0), &xy(i, 0), ap::vlen(0,nvars-1))
;
busycenters(cc) = true;
break;
}
}
1- What his P contains? Is this the same D^2 distance measure (normalized
version)
2- What he meant by inverse empirical CDF
In my application my P is an array of shortest distances calculated with the
existing centroids.
Can I just use the code the below mentioned code to find the next centroid?
s = 0;
v = ap::randomreal();
for(i = 0; i <= npoints-1; i++)
{
s = s+p(i);
if( v<=s||i==npoints-1 )
{
ap::vmove(¢ers(cc, 0), &xy(i, 0), ap::vlen(0,nvars-1))
;
busycenters(cc) = true;
break;
}
}
sorry, I did not dig too much in their C++ implementation
There is another C++ implementation, http://www.stanford.edu/~darthur/kmpp.zip, which has a good README.