# K-means and K-means++ with Python

WARNING! Sorry, my previous implementation was wrong, and thanks a lot for Jan Schlüter's correction!

It's fairly easy to run k-means clustering in python, refer to `\$pydoc scipy.cluster.vq.kmeans (or kmeans2)`. While the initial selected centers affect the performance a lot. Thanks Feng Zhu, that introduced k-means++ to us, which is a very good and effective way to select the initial centers.

I was totally confused about the step (1b) in paper, selecting ci=x', with probability `frac {D(x')^2} {sum_{x in X} D(x)^2}`. I referred to author's C++ implementation, thought the sampling step (Utils.cpp, lines 299-305) is just for optimizing. So I simply minimized the `sum_{x in X} min (D(x)^2, ||x-xi||^2)`.

Thanks a lot to Jan Schlüter, he pointed out that my previous implementation was wrong, the sampling step (1b) is one crucial part of K-means++ algorithm. And he had posted an optimized implementation here,

Here comes my revised python code (unoptimized):

```def kinit (X, k, ntries=None):
'init k seeds according to kmeans++'
if not ntries: ntries = int (2 + log(k))
n = X.shape

'choose the 1st seed randomly, and store D(x)^2 in D[]'
centers = [X[randint(n)]]
D       = [norm(x-centers)**2 for x in X]
Dsum    = reduce (lambda x,y:x+y, D)

for _ in range(k-1):
bestDsum = bestIdx = -1

for _ in range(ntries):
randVal = random() * Dsum
for i in range(n):
if randVal <= D[i]:
break
else:
randVal -= D[i]

'tmpDsum = sum_{x in X} min(D(x)^2,||x-xi||^2)'
tmpDsum = reduce(lambda x,y:x+y,
(min(D[j], norm(X[j]-X[i])**2) for j in xrange(n)))

if bestDsum < 0 or tmpDsum < bestDsum:
bestDsum, bestIdx  = tmpDsum, i

Dsum = bestDsum
centers.append (X[bestIdx])
D = [min(D[i], norm(X[i]-X[bestIdx])**2) for i in xrange(n)]

return array (centers)

'to use kinit() with kmeans2()'
res,idx = kmeans2(Y, kinit(Y,3), minit='points')```

## 4 thoughts on “K-means and K-means++ with Python”

1. musi on said:

Hi,
I you are still remembered KMean++ concept, can you pls tell me what the
author of KMeans++ was doing in the code:
// calculate P (non-cumulative)
//
s = 0;
for(i = 0; i <= npoints-1; i++)
{
s = s+d2(i);
}
if( s==0 )
{
result = false;
return result;
}
s = 1/s;
ap::vmove(&p(0), &d2(0), ap::vlen(0,npoints-1), s);

//
// choose one of points with probability P
// random number within (0,1) is generated and
// inverse empirical CDF is used to randomly choose a point.
//
s = 0;
v = ap::randomreal();
for(i = 0; i <= npoints-1; i++)
{
s = s+p(i);
if( v<=s||i==npoints-1 )
{
ap::vmove(&centers(cc, 0), &xy(i, 0), ap::vlen(0,nvars-1))
;
busycenters(cc) = true;
break;
}
}

1- What his P contains? Is this the same D^2 distance measure (normalized
version)
2- What he meant by inverse empirical CDF

In my application my P is an array of shortest distances calculated with the
existing centroids.
Can I just use the code the below mentioned code to find the next centroid?

s = 0;
v = ap::randomreal();
for(i = 0; i <= npoints-1; i++)
{
s = s+p(i);
if( v<=s||i==npoints-1 )
{
ap::vmove(&centers(cc, 0), &xy(i, 0), ap::vlen(0,nvars-1))
;
busycenters(cc) = true;
break;
}
}

2. riyaz on said:

what is norm in your script? Is it np.linalg.norm?

3. yongsun on said:

IIRC, yes 🙂