maxent分词补遗

在和同事讨论最大熵时,介绍了以前一个用最大熵分词的实验,突然对为什么每个事件需要U03~U05这三项产生了疑惑,当时也没有细想。

... ...
E U00-人 U01-们 U02-常 U03-人/们 U04-们/常 U05-人/常 B
... ...

再重新整理了一下头绪,在张乐的工具包中,事件并非是样本,样本应该是那个三字窗口。例如“人们常”,这个样本产生了7个feature,分别是(U00-人,
E), (U01-们, E), (U02-常, E), (U03-人/们, E), (U04-们/常, E),(U05-人/常),(B,
E),这些feature构成了一个事件。(U00-人,
E)描述的是,一个三字窗口,起始字符为“人”时,中间的字符被标记为“E”的情况;(U05-人/常,
E)描述的是,三字窗口的左右分别是“人”和“常”时,中间字符被标记为“E”的情况;(B,
E)描述的是,三字窗口的第一个字符(也就是前一个观测)被标记为B时,中间字符被标记为E的情况。

如此看来,我们原先训练的应该是加入状态转移约束的ME,而不是MEMM。MEMM的feature是,将ME的每个feature,额外加入上一个状态作为条件。因此,用来训练MEMM的事件,应该写成这个样子,

... ...
E U00-人-B U01-们-B U02-常-B U03-人/们-B U04-们/常-B U05-人/常-B
... ...

实验的结果,对msr的数据集准确率有小幅提高,但是对pku的数据集有小幅降低。

If your app starts earlier than gnome-session

We found a serious problem recently, if your application starts earlier than gnome-session, and has accesses to gconf services, like iiim-panel, you will see two gconfd daemons after you login to gnome. It might due to the introducing of gnome-keyring (with ssh-agent), so that gnome-session does not recognize and reuse the existing gconfd daemon?

Then when iiim-properties(1) is invoked later by user, it actually talks to the
2nd gconfd, but iiim-panel still connects to the 1st gconfd.That's why iiim-panel does not update its language list after adding/removing input-methods with iiim-properties. [OS.o bug #4024]

A workaround is to terminate your gconfd daemon after iiim-panel is started. There would be no explict impacts to existing clients, they will try to connect to the new gconfd. A better solution is to defer the iiim-panel starting after gnome-session.

P.S., it turns out killing gconfd does not work perfectly, iiim-properties(1)
launched from terminal and the one launched by iiim-panel(1) would still start two different gconf daemons.

Changing UID and GID on Mac OS X Leopard


$ id
uid=501(yongsun) gid=501(yongsun) groups=501(yongsun),98(_lpadmin),81(_appserveradm),79(_appserverusr),80(admin)

$ sudo dscl . -change /Users/yongsun UniqueID 501 1000
$ sudo dscl . -change /Users/yongsun PrimaryGroupID 501 1000
$ sudo dscl . -change /Groups/yongsun PrimaryGroupID 501 1000

$ id
uid=1000(yongsun) gid=1000(yongsun) groups=1000(yongsun),98(_lpadmin),81(_appserveradm),79(_appserverusr),80(admin)

$ chown -R yongsun:yongsun /Users/yongsun

Spectral Clustering with Python

Spectral Clustering is the last topic of our NLP learning group activity, hosted by Feng. Here is my homework, you may refer to this tutorial for the symbols used in this simple program. While I still have no idea about the underlying principles in the algorithm.

#!/usr/bin/python
# copyright (c) 2008 Feng Zhu, Yong Sun
import heapq
from functools import partial
from numpy import *
from scipy.linalg import *
from scipy.cluster.vq import *
import pylab
def line_samples ():
    vecs = random.rand (120, 2)
    vecs [:,0] *= 3
    vecs [0:40,1] = 1
    vecs [40:80,1] = 2
    vecs [80:120,1] = 3
    return vecs
def gaussian_simfunc (v1, v2, sigma=1):
    tee = (-norm(v1-v2)**2)/(2*(sigma**2))
    return exp (tee)
def construct_W (vecs, simfunc=gaussian_simfunc):
    n = len (vecs)
    W = zeros ((n, n))
    for i in xrange(n):
        for j in xrange(i,n):
            W[i,j] = W[j,i] = simfunc (vecs[i], vecs[j])
    return W
def knn (W, k, mutual=False):
    n = W.shape[0]
    assert (k>0 and k<(n-1))
    for i in xrange(n):
        thr = heapq.nlargest(k+1, W[i])[-1]
        for j in xrange(n):
            if W[i,j] < thr:
                W[i,j] = -W[i,j]
    for i in xrange(n):
        for j in xrange(i, n):
            if W[i,j] + W[j,i] < 0:
                W[i,j] = W[j,i] = 0
            elif W[i,j] + W[j,i] == 0:
                W[i,j] = W[j,i] = 0 if mutual else abs(W[i,j])
vecs = line_samples()
W = construct_W (vecs, simfunc=partial(gaussian_simfunc, sigma=2))
knn (W, 10)
D = diag([reduce(lambda x,y:x+y, Wi) for Wi in W])
L = D - W
evals, evcts = eig(L,D)
vals = dict (zip(evals, evcts.transpose()))
keys = vals.keys()
keys.sort()
Y = array ([vals[k] for k in keys[:3]]).transpose()
res,idx = kmeans2(Y, 3, minit='points')
colors = [(1,2,3)[i] for i in idx]
pylab.scatter(vecs[:,0],vecs[:,1],c=colors)
pylab.show()

K-means and K-means++ with Python

WARNING! Sorry, my previous implementation was wrong, and thanks a lot for Jan Schlüter's correction!

It's fairly easy to run k-means clustering in python, refer to $pydoc scipy.cluster.vq.kmeans (or kmeans2). While the initial selected centers affect the performance a lot. Thanks Feng Zhu, that introduced k-means++ to us, which is a very good and effective way to select the initial centers.

I was totally confused about the step (1b) in paper, selecting ci=x', with probability frac {D(x')^2} {sum_{x in X} D(x)^2}. I referred to author's C++ implementation, thought the sampling step (Utils.cpp, lines 299-305) is just for optimizing. So I simply minimized the sum_{x in X} min (D(x)^2, ||x-xi||^2).

Thanks a lot to Jan Schlüter, he pointed out that my previous implementation was wrong, the sampling step (1b) is one crucial part of K-means++ algorithm. And he had posted an optimized implementation here,

Here comes my revised python code (unoptimized):

def kinit (X, k, ntries=None):
    'init k seeds according to kmeans++'
    if not ntries: ntries = int (2 + log(k))
    n = X.shape[0]
    'choose the 1st seed randomly, and store D(x)^2 in D[]'
    centers = [X[randint(n)]]
    D       = [norm(x-centers[0])**2 for x in X]
    Dsum    = reduce (lambda x,y:x+y, D)
    for _ in range(k-1):
        bestDsum = bestIdx = -1
        for _ in range(ntries):
            randVal = random() * Dsum
            for i in range(n):
                if randVal <= D[i]:
                    break
                else:
                    randVal -= D[i]
            'tmpDsum = sum_{x in X} min(D(x)^2,||x-xi||^2)'
            tmpDsum = reduce(lambda x,y:x+y,
                             (min(D[j], norm(X[j]-X[i])**2) for j in xrange(n)))
            if bestDsum < 0 or tmpDsum < bestDsum:
                bestDsum, bestIdx  = tmpDsum, i
        Dsum = bestDsum
        centers.append (X[bestIdx])
        D = [min(D[i], norm(X[i]-X[bestIdx])**2) for i in xrange(n)]
    return array (centers)
'to use kinit() with kmeans2()'
res,idx = kmeans2(Y, kinit(Y,3), minit='points')

5 Years @ Sun

5 years ago, Oct/8/2003, I joined Sun China ERI, worked in Asian Globalization Center. It's really amazing time for me to stay at Sun and AGC. I love the open culture, awesome teammates and colleagues, comfortable and beautiful working environment, coffee and tea-times ...

It's a new start for me, I'm expecting my 10-years, 15-years, 20-years anniversaries :)

也说范跑跑

其实我挺理解范跑跑的,他非常非常爱家中尚在襁褓的幼女。那么幼小的生命,会让父母觉得要不惜一切来保护他/她,这一切之中,包括了自己的生命,或许也包括了对职业操守的坚守。如果他的女儿已经8、9岁,初长成了,也许就可以在那个危机时刻欣然赴死了。

在小小出生之后,我也曾回想过那个经典的命题,如果孩子、妻子和父母同船落水,你会先救哪一个。我和老婆的答案都是,先救孩子。我原先到国外出差的时候,保险的受益人是老婆,现在就改成了小小(也感谢老婆对此的理解)。如果有什么大灾难发生在我身上,我也一定会想尽一切办法活下来。这活下来的信念,即让你坚强,也让你脆弱。

我不能设想、也不允许,小小在6岁之前,长时间离开父母身边。尽管父母、岳父母有信心带好他,而我们也对此没什么太大的异议,但同时我们也坚定地相信我们能做的更好。

谁在说谎?

CCTV说,伊利在市场上销售的产品检测到了三聚氰胺,但是供应给奥运会的奶制品是没有问题的,言下之意是供应奥运会的产品是不同的,这也暗示着伊利其实是知道市场上销售的产品是存在质量隐患的。伊利则声明说,供应奥运和市场上的产品是相同品质的。而22家产品69批次产品检测的结果告诉我们,乳业公司其实是知道某些地方的奶源品质可能有问题,而这也基本上成了行业的一条潜规则。

希望能把整个事件调查清楚,不要仅仅是下架、销毁、赔偿就了事。一定要严惩!

从凤凰卫视抄来的,中国人在食品中完成了化学扫盲:

  • 从大米里我们认识了石蜡
  • 从火腿里我们认识了敌敌畏
  • 从咸鸭蛋、辣椒酱里我们认识了苏丹红
  • 从多宝鱼我们认识了孔雀石绿
  • 从火锅里我们认识了福尔马林
  • 从银耳、蜜枣里我们认识了硫磺
  • 从木耳中认识了硫酸铜
  • 今天三鹿又让同胞知道了三聚氰胺的化学作用