Mastering Python in 2 hours

这是准备向同事介绍Python v2.x的Slides,很抱歉我成了标题党,“mastering”的说法实在是言过其实的。我只是把一些python常用的用法,以及例如decorator和metaclass等不是很常用但比较常见的用法,罗列在一起。总体还是比较粗糙,也不免有很多遗漏和误谬,请大家多指正了 ...

double-array-trie and constructing with python

Double-Array-Trie is wildly used in dictionary constructing, and there are some implementations out there, such as, darts (a plain implementation, does not have tail pool), darts-clone, dastrie, libdatrie etc. While for a small trie (like pinyin table), an implementation without tail pool is good enough. I write a simple constructing program in python, you could have a look at the full source here.

It's basically a breadth-first traverse, get the head node from queue, find a proper base for it and mark the check array for its child nodes, then append the child nodes to queue, continue the above processing till the queue is empty.

class DATrie (object):
    ... ...

    def find_base (self, s, children, i=1):
        if s == 0:
            return 0

        if not children:
            return s

        while True:
            for ch in children:
                k = i + self.encode_character (ch)
                if self.base[k] or self.check[k] or k == s:
                    i += 1
                    break
            else:
                break

        return i
    ... ...

    def construct_from_trie (self, trie, with_value=True):
        nodes = [(trie.root, 0)]

        while nodes:
            trienode, s = nodes.pop(0)
            b = self.find_base (s, trienode.trans)
            self.base[s] = -b if trienode.val else b
            if with_value: self.value[s] = trienode.val

            for ch in trienode.trans:
                c = self.encode_character (ch)
                t = abs(self.base[s]) + c
                self.check[t] = s if s else -1

                nodes.append ((trienode.trans[ch], t))

        for i in xrange (self.encode_character (max(trie.root.trans))+1):
            if self.check[i] == -1:
                self.check[i] = 0

For constructing larger dictionaries, I'd recommend you to use darts-clone, dastrie (only for a static trie), or libdatrie (only supports 16-btis value data).

References:

  1. An Implementation of Double-Array Trie
  2. An Efficient Digital Search Algorithm by Using a Double-Array Structure.

build matplotlib-0.98.3 on Solaris with SunStudio 12

1. Set the CBE environment (/opt/jdsbld/bin/env.sh), and export CC=$CXX, so that /usr/lib/python2.4/pycc would eventually use the C++ compiler in SunStudio.

2. Download the source code from http://matplotlib.sourceforge.net, then apply the following patch,

--- setupext.py	2008-08-04 02:15:22.000000000 +0800
+++ setupext.py	2008-11-03 17:08:41.387138200 +0800
@@ -218,6 +218,8 @@ def get_win32_compiler():
 win32_compiler = get_win32_compiler()
 if sys.platform == 'win32' and win32_compiler == 'msvc':
     std_libs = []
+elif sys.platform == 'sunos5':
+    std_libs = ['Crun', 'Cstd']
 else:
     std_libs = ['stdc++', 'm']

@@ -298,6 +300,7 @@ def check_for_freetype():
         for d in basedirs:
             module.include_dirs.append(os.path.join(d, 'freetype2'))

+    module.include_dirs.append('/usr/include')
     print_status("freetype2", get_pkgconfig_version('freetype2'))
     if not find_include_file(module.include_dirs, 'ft2build.h'):
         print_message(
--- src/_backend_gdk.c	2008-08-04 02:14:18.000000000 +0800
+++ src/_backend_gdk.c	2008-11-03 17:31:32.896210699 +0800
@@ -62,6 +62,7 @@ static PyMethodDef _backend_gdk_function
     { NULL, NULL, 0 }
 };

+extern "C"
 DL_EXPORT(void)
 init_backend_gdk(void)
 {

3. Then build and install,

$ python setup.py build
$ pfexec python setup.py install

Note: this version of matplotlib requires numpy-1.1 or higher version, you need build and install a newer one since solaris only has 1.0.4.

build scipy-0.6.0 on Solaris with SunStudio 12

I finally got some clues about how to build scipy with SunStudio and Sun Performance library from this discussion thread. And here are the steps I sumerized:

1. Download the source tar file of scipy, SciPy 0.6.0 tarball, and extract it to anywhere. Apply the following changes to /usr/lib/python2.4/vendor-packages/numpy/distutils/fcompiler/sun.py, since f77compat library is not available on x86/x64 platform,

--- sun.py.orig	2008-10-31 18:30:17.519425733 +0800
+++ sun.py	2008-10-31 18:30:23.842495794 +0800
@@ -38,7 +38,7 @@
         return ['-xtarget=generic']
     def get_libraries(self):
         opt = []
-        opt.extend(['fsu','sunmath','mvec','f77compat'])
+        opt.extend(['fsu','sunmath','mvec'])
         return opt

 if __name__ == '__main__':

and apply the following patch,


--- scipy/sparse/sparsetools/sparsetools.h    2007-09-22 15:55:25.000000000 +0800
+++ scipy/sparse/sparsetools/sparsetools.h    2008-10-31 17:54:47.317379521 +0800
@@ -22,6 +22,7 @@

 #include <vector>
 #include <algorithm>
+#include <functional>

 /*

2. Set CBE environment (/opt/jdsbld/bin/env.sh), and the following environment variables,

$ export LDFLAGS="-lCrun -lCstd"
$ export LAPACK=/opt/SUNWspro/lib/libsunmath.so
$ export BLAS=/opt/SUNWspro/lib/libsunperf.so

3. Then build, install, and test

$ python setup.py build
$ pfexec setup.py install
$ cd ~ # don't run the test under the source directory
$ python
>>> import scipy
>>> scipy.test()

Though, you would see several errors, most of the tests passed.

Spectral Clustering with Python

Spectral Clustering is the last topic of our NLP learning group activity, hosted by Feng. Here is my homework, you may refer to this tutorial for the symbols used in this simple program. While I still have no idea about the underlying principles in the algorithm.

#!/usr/bin/python
# copyright (c) 2008 Feng Zhu, Yong Sun

import heapq
from functools import partial
from numpy import *
from scipy.linalg import *
from scipy.cluster.vq import *
import pylab

def line_samples ():
    vecs = random.rand (120, 2)
    vecs [:,0] *= 3
    vecs [0:40,1] = 1
    vecs [40:80,1] = 2
    vecs [80:120,1] = 3
    return vecs

def gaussian_simfunc (v1, v2, sigma=1):
    tee = (-norm(v1-v2)**2)/(2*(sigma**2))
    return exp (tee)

def construct_W (vecs, simfunc=gaussian_simfunc):
    n = len (vecs)
    W = zeros ((n, n))
    for i in xrange(n):
        for j in xrange(i,n):
            W[i,j] = W[j,i] = simfunc (vecs[i], vecs[j])
    return W

def knn (W, k, mutual=False):
    n = W.shape[0]
    assert (k>0 and k<(n-1))

    for i in xrange(n):
        thr = heapq.nlargest(k+1, W[i])[-1]
        for j in xrange(n):
            if W[i,j] < thr:
                W[i,j] = -W[i,j]

    for i in xrange(n):
        for j in xrange(i, n):
            if W[i,j] + W[j,i] < 0:
                W[i,j] = W[j,i] = 0
            elif W[i,j] + W[j,i] == 0:
                W[i,j] = W[j,i] = 0 if mutual else abs(W[i,j])

vecs = line_samples()
W = construct_W (vecs, simfunc=partial(gaussian_simfunc, sigma=2))
knn (W, 10)
D = diag([reduce(lambda x,y:x+y, Wi) for Wi in W])
L = D - W

evals, evcts = eig(L,D)
vals = dict (zip(evals, evcts.transpose()))
keys = vals.keys()
keys.sort()

Y = array ([vals[k] for k in keys[:3]]).transpose()
res,idx = kmeans2(Y, 3, minit='points')

colors = [(1,2,3)[i] for i in idx]
pylab.scatter(vecs[:,0],vecs[:,1],c=colors)
pylab.show()

K-means and K-means++ with Python

WARNING! Sorry, my previous implementation was wrong, and thanks a lot for Jan Schlüter's correction!

It's fairly easy to run k-means clustering in python, refer to $pydoc scipy.cluster.vq.kmeans (or kmeans2). While the initial selected centers affect the performance a lot. Thanks Feng Zhu, that introduced k-means++ to us, which is a very good and effective way to select the initial centers.

I was totally confused about the step (1b) in paper, selecting ci=x', with probability frac {D(x')^2} {sum_{x in X} D(x)^2}. I referred to author's C++ implementation, thought the sampling step (Utils.cpp, lines 299-305) is just for optimizing. So I simply minimized the sum_{x in X} min (D(x)^2, ||x-xi||^2).

Thanks a lot to Jan Schlüter, he pointed out that my previous implementation was wrong, the sampling step (1b) is one crucial part of K-means++ algorithm. And he had posted an optimized implementation here,

Here comes my revised python code (unoptimized):

def kinit (X, k, ntries=None):
    'init k seeds according to kmeans++'
    if not ntries: ntries = int (2 + log(k))
    n = X.shape[0]

    'choose the 1st seed randomly, and store D(x)^2 in D[]'
    centers = [X[randint(n)]]
    D       = [norm(x-centers[0])**2 for x in X]
    Dsum    = reduce (lambda x,y:x+y, D)

    for _ in range(k-1):
        bestDsum = bestIdx = -1

        for _ in range(ntries):
            randVal = random() * Dsum
            for i in range(n):
                if randVal <= D[i]:
                    break
                else:
                    randVal -= D[i]

            'tmpDsum = sum_{x in X} min(D(x)^2,||x-xi||^2)'
            tmpDsum = reduce(lambda x,y:x+y,
                             (min(D[j], norm(X[j]-X[i])**2) for j in xrange(n)))

            if bestDsum < 0 or tmpDsum < bestDsum:
                bestDsum, bestIdx  = tmpDsum, i

        Dsum = bestDsum
        centers.append (X[bestIdx])
        D = [min(D[i], norm(X[i]-X[bestIdx])**2) for i in xrange(n)]

    return array (centers)

'to use kinit() with kmeans2()'
res,idx = kmeans2(Y, kinit(Y,3), minit='points')

python and Gnu-libreadline

You may know that the interactive shell of python uses Gnu-libreadline (in GPL) to implement the history retrieving feature. However, does this impact your own python program or application with python interpreter embedded in?

I did a little study on the python interpreter, libraries and built-in modules on linux (which were built with libreadline), looks like there is only one extension module links to libreadline, /usr/lib/python2.5/lib-dynload/readline.so. And this module would only be loaded by the interactive python shell. When I launch an ordinary python script directly, or embed python interpreter to a C/C++ application, this module would not be loaded, unless the script imports the readline module explicitly. (P.S., I used pmap(1) to verify if libreadline.so is loaded in runtime.)

I think only the readline ext module is under GPL term, and it should be safe to build python with libreadline. :)

Python and UTF32/UCS4

You python interpreter maybe compiled with --enable-unocide=ucs2, so that the built-in unichr(i) function will raise an exception, if the given value is larger than 0xFFFF.

How to detect the unicode support in your python build,

import sys
print sys.maxunicode

While the 'ucs2' here actually means utf16, which is a variable length encoding. And you need a simple function to convert utf32/ucs4 to utf16. Here is the example code snippet,

def ucs4chr(codepoint):
    try:
        return unichr(codepoint)
    except ValueError:
        hi, lo = divmod (codepoint-0x10000, 0x400)
        return unichr(0xd800+hi) + unichr(0xdc00+lo)

def ucs4ord(str):
    if len(str)==1:
        return ord(str)
    if len(str)==2:
        hi, lo = ord(str[0])-0xd800, ord(str[1])-0xdc00
        return hi*0x400+0x10000+lo
    raise TypeError("ucs4ord() expected a valid ucs4 character")

Y combinator in Python

在lambda系统中实现递归,需要借用Y组合算子(combinator)。下面是Python中的一个实作。

Y = lambda F: (lambda f: F(lambda x: f(f)(x))) (lambda f: F(lambda x: f(f)(x)))
F = lambda f: (lambda x: 1 if x==0 else x*f(x-1))

fact = Y(F)
print fact(10)

感觉这个Y combinator实在是令人头晕目眩,不知所以。具体的推导过程和相关理论,请参见刘未鹏先生的宏文,以及负暄琐话相关的blog(12)。读了几遍,仍然只是朦朦胧胧地明白了一点点,特别是后面的对角线,再读、再读... ...