K-means and K-means++ with Python

WARNING! Sorry, my previous implementation was wrong, and thanks a lot for Jan Schlüter's correction!

It's fairly easy to run k-means clustering in python, refer to $pydoc scipy.cluster.vq.kmeans (or kmeans2). While the initial selected centers affect the performance a lot. Thanks Feng Zhu, that introduced k-means++ to us, which is a very good and effective way to select the initial centers.

I was totally confused about the step (1b) in paper, selecting ci=x', with probability frac {D(x')^2} {sum_{x in X} D(x)^2}. I referred to author's C++ implementation, thought the sampling step (Utils.cpp, lines 299-305) is just for optimizing. So I simply minimized the sum_{x in X} min (D(x)^2, ||x-xi||^2).

Thanks a lot to Jan Schlüter, he pointed out that my previous implementation was wrong, the sampling step (1b) is one crucial part of K-means++ algorithm. And he had posted an optimized implementation here,

Here comes my revised python code (unoptimized):

def kinit (X, k, ntries=None):
    'init k seeds according to kmeans++'
    if not ntries: ntries = int (2 + log(k))
    n = X.shape[0]

    'choose the 1st seed randomly, and store D(x)^2 in D[]'
    centers = [X[randint(n)]]
    D       = [norm(x-centers[0])**2 for x in X]
    Dsum    = reduce (lambda x,y:x+y, D)

    for _ in range(k-1):
        bestDsum = bestIdx = -1

        for _ in range(ntries):
            randVal = random() * Dsum
            for i in range(n):
                if randVal <= D[i]:
                    break
                else:
                    randVal -= D[i]

            'tmpDsum = sum_{x in X} min(D(x)^2,||x-xi||^2)'
            tmpDsum = reduce(lambda x,y:x+y,
                             (min(D[j], norm(X[j]-X[i])**2) for j in xrange(n)))

            if bestDsum < 0 or tmpDsum < bestDsum:
                bestDsum, bestIdx  = tmpDsum, i

        Dsum = bestDsum
        centers.append (X[bestIdx])
        D = [min(D[i], norm(X[i]-X[bestIdx])**2) for i in xrange(n)]

    return array (centers)

'to use kinit() with kmeans2()'
res,idx = kmeans2(Y, kinit(Y,3), minit='points')

4 thoughts on “K-means and K-means++ with Python

  1. Hi,
    I you are still remembered KMean++ concept, can you pls tell me what the
    author of KMeans++ was doing in the code:
    // calculate P (non-cumulative)
    //
    s = 0;
    for(i = 0; i <= npoints-1; i++)
    {
    s = s+d2(i);
    }
    if( s==0 )
    {
    result = false;
    return result;
    }
    s = 1/s;
    ap::vmove(&p(0), &d2(0), ap::vlen(0,npoints-1), s);

    //
    // choose one of points with probability P
    // random number within (0,1) is generated and
    // inverse empirical CDF is used to randomly choose a point.
    //
    s = 0;
    v = ap::randomreal();
    for(i = 0; i <= npoints-1; i++)
    {
    s = s+p(i);
    if( v<=s||i==npoints-1 )
    {
    ap::vmove(&centers(cc, 0), &xy(i, 0), ap::vlen(0,nvars-1))
    ;
    busycenters(cc) = true;
    break;
    }
    }

    1- What his P contains? Is this the same D^2 distance measure (normalized
    version)
    2- What he meant by inverse empirical CDF

    In my application my P is an array of shortest distances calculated with the
    existing centroids.
    Can I just use the code the below mentioned code to find the next centroid?

    s = 0;
    v = ap::randomreal();
    for(i = 0; i <= npoints-1; i++)
    {
    s = s+p(i);
    if( v<=s||i==npoints-1 )
    {
    ap::vmove(&centers(cc, 0), &xy(i, 0), ap::vlen(0,nvars-1))
    ;
    busycenters(cc) = true;
    break;
    }
    }

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

To submit your comment, click the image below where it asks you to...
Clickcha - The One-Click Captcha