Thinking about a new IM platform, python+dbus

I'm recently thinking of a new input-method platform, that's python+dbus.

The advantages are, dbus has the python binding, that make it easier to write an IM server daemon with python. In client side, since dbus also has the glib and QT bindings, it also makes it easier to write the input method modules for gtk and QT. For input method developers, writing the input method in python (or plus with C/C++ python extension) is also a nice thing.

How about the performance? No idea, while I think it's worth to have a try and make a prototype. :)

野生动物园历险记

周五,在北京雨后的清晨,我们AGC的一行人乘车去八达岭野生动物公园,植树加参观(被参观)。天气预报八达岭地区有小雨且气温较低,所以特意加穿了一件抓绒。车快到目的地的时候,惊见天空中横飞着雨线,再细看,才发现是雪粒。心中暗苦,虽说穿了件防水的冲锋衣,但是背包和相机并不防水的。

到了动物园,去猴山脚下植树,树坑已经挖好了,栽下半人多高的树苗,又浇了些水。此刻间,雪又大了些。然后大家坐到游览巴士,去动物园里参观。前半程很顺利,不过在即将离开一个虎园的时候,由于雪大路滑,巴士在一个比较大的上坡转弯处,发生了“飘移”,左后轮陷到了路旁的“排水渠”,左侧的一扇窗被突出的粗大树杈顶得已经有些弯曲变型了。我们陷入了上上不去,退又担心车体进一步侧倾导致树杈将玻璃窗顶破的危险。动物园出动了巡逻车,试图将园中的三只虎驱赶入虎舍中,然后让我们步行到前面接应的巴士。不过,似乎有一只虎不见所踪。而且估计前面的路也很陡且湿滑,似乎接应的巴士未能到位。Aaron同学将最后一些口粮(巧克力)发给了大家,做好长时间待援的准备。大家虽然说笑打趣,但还是心存忌惮的。

等待了大约半个多小时,司机师傅决定还是倒出去。幸好玻璃窗经受住了考验。巴士在狭窄的雪路上倒行,右侧还有比较陡且深的山沟。真是挺惊险的。车倒至一个比较宽敞的岔路口,掉转车头进入到出虎园的另一条岔路上。出了虎园,大家终于松了一口气。才出虎口,又入狮林。由于出狮园有一个非常陡的下坡转弯,车不敢贸然下去了。临时调来了铲车为我们填沙铺路。车停在狮园有近一个小时,虽然狮子一直盘桓在车旁,大家已经不像在虎园时那般紧张了,不亦乐乎地忙着为狮子拍照,就是肚子有些饿了。到了下午1点半多,才脱困出了狮园。风卷残云般的解决掉午餐之后,步行经过小动物区,准备到山门口乘车出园。途中又被一扇铁门挡住了去路,想是因为雪大封园,所以铁门被工作人员锁上了。虽然我们这些男同事大都可以翻门而过,但对那些女同事就比较为难了。要不是导游联络到了工作人员前来开门,我们大概就要叠人梯了。

出园到了停车场,乘车返城。高速虽然比较湿滑,但还不至封路。路上还是比较顺利,到城里发现还在下雨。大家也疲乏了,就都各自散去了。

拍了几张照片,后悔没带长焦...

Donkey on Leopard

The latest official build of aMule is 2.1.3, though you could connect to the ED2K/Kad network, you may find that the download speed is very very slow. aMule-cvs is the solution, fortunately a Chinese developer, HDFreeleader, builds the aMule-cvs snapshots weekly.

The newest build is amule-cvs-20080322. And it requires the libjpeg.62.dylib in /usr/local/lib. I recommend you to install libjpeg via MacPorts, and then create a symbolic link there.

实验CRF++

使用赵海博士的6 tags + 6 templates法,对bakeoff2005公开的语料进行实验。用python写了个简单的转换脚本,将UTF-8编码的训练语料转换为CRF++支持的格式。MSR的语料库转换之后是24M,训练模型花了大概26个小时,得到的模型为25M,对MSR的测试数据F-score可以达到96%(python的评估脚本),对PKU的测试数据只有82%多。PKU的语料库转换之后是11M,训练模型花了近13个小时,得到的模型有14M,对PKU的测试数据F-score有92%多,对MSR的测试数据也只有82%左右。看来MSR和PKU训练语料的分词风格有较大的差异,导致交叉测试的分数比较低。

另外,大概是C++的STL线程安全有问题,在Linux、Solaris和Mac OS上使用多线程都SEGFAULT了,所以都是单线程训练的。不敢想象如果用数百兆的语料,会花多长时间、用多少内存...

下面是特征模板的定义:

# Unigram
U00:%x[-1,0]
U01:%x[0,0]
U02:%x[1,0]
U03:%x[-1,0]/%x[0,0]
U04:%x[0,0]/%x[1,0]
U05:%x[-1,0]/%x[1,0]
# Bigram
B

A Beginner's Note of CRF++

Thanks for Yandong's help and guidance, that I got some basic ideas about CRF (Conditional Random Filed) and how the CRF model looks like. The encoder of CRF++, crf_learn, could generate a model in text format with the '-t' option. Take the Japanese word segmentation demonstration (example/seg) as an example, the following is the model in text format:

ersion: 100
cost-factor: 1
maxid: 1386      /* the number of feature functions */
xsize: 1
B                /* the tag lists, in this case, we have two tags */
I
U00:%x[-2,0]     /* unigram feature templates */
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-2,0]/%x[-1,0]/%x[0,0]
U06:%x[-1,0]/%x[0,0]/%x[1,0]
U07:%x[0,0]/%x[1,0]/%x[2,0]
U08:%x[-1,0]/%x[0,0]
U09:%x[0,0]/%x[1,0]
B               /* bigram feature template */
0 B             /* bigram of the tags for C_{-1} and C_0,  */
                /* number of features are 2^(# of tags).   */
4 U00:_B-1      /* _B-1 is the starting of a sentence */
                /* _B+1 is the ending of a sentence   */
6 U00:_B-2      /* _B-2 is the pre-token of _B-1  */
                /* _B+2 is the post-token of _B+1 */
8 U00:
10 U00:、       /* feature function id, template id, and observation */
12 U00:〇       /* since we only have two tags, each entry could     */
14 U00:「       /* be expanded to 2 feature functions                */
20 U00:う
... ...
... ...
1382 U09:3/年
1384 U09:9/3
-0.0799963416235706     /* the weight for each feature function */
0.4346315510326526      /* the negative value indicates the     */
-0.1044728887459596     /* feature is rarely seen, and we have  */
-0.2501623206703318     /* 1386 weights in total.               */
... ...

My new desktop

客厅是儿子经常活动的场所,为了不让他看电视,客厅里的电视基本闲置起来了。和老婆商量,再买台20寸左右的液晶电视,放在卧室里。因为卧室里原来的台式机风扇噪音太大,也计划要换掉。我看中了MacMini,老婆看了也很喜欢,出差的时候就捎了一台回来,就是硬盘比较小,回头家里再添一个Time Capsule, 就齐了。然后到村里买了台BenQ 21.6寸的LCD,支持TV、VGA和HDMI接口,分辨率可达1680x1050。上图是MacMini和LCD的合影。

发现自己彻底变成了Apple的fans...

How tagging and filtering work for localization installation?

As I knew that IPS supports tagging on files (should be an attribute of 'file' action'?), and from the document I can tell, that tag is a key-value pair, e.g., arch=i386. And Indiana team may prefer to bundle the localization contents, such as messages and online helps, to base package. E.g., the french messages and docs of openoffice would not be shipped as individual packages, but in openoffice@2.3.1 package, so that you may set a filter to install them.

While I'm curious, can I specify the package patterns in a filter? Such as

(arc=i386 | arc=generic) &
    (packages=openoffice & locale=fr & (doc=true | message=true) |
     packages=all-installed & locale=fr & (doc=true | message=true) |
     packages=lang-french-support,ttf-french-fonts,...)

With this filter, I could install the docs and messages for openoffice and all existing packages (may already include oo), as well as the specified packages 'lang-french-support' and 'ttf-french-fonts'.

Certainly, lang-french-support and ttf-french-fonts should also have the locale=fr tag, while I may not want install all the fr l10n contents for un-installed packages.

And I assume that the sections or groups in IPS-Gui are actually filters, would the filters be able to be installed or updated by IPS updating, or enduser could import a filter?

I sent the above questions to pkg-discuss, while no response yet :(

在美租车/开车注意事项

有同事要去美国出差,知道我租车来着,就问我租车和开车有那些注意事项。答应他写在blog上,不过上周太忙,直到今天才有时间稍稍整理一下。

  1. 租车要用护照、驾驶证和信用卡。我在AVIS刷工行的VISA卡被reject,幸好带了招行的MASTER卡。我的美国同事建议我,如果你只是去一两周,可以考虑在租车公司买一箱油,这样还车的时候就不用加满油箱了。如果路不熟且没有人同行,建议加租GPS。我最痛恨听到的就是,GPS告诉我“recalculating”。

  2. 基本上高速路的最内侧车道是“carpool lane”,即所谓拼车道,一个人开车是不能走的。你可以超过美国道路限速10M/H,也就是说如果限速是60,你可以开到70而不会接到罚单。据同事说如果再高速路上,低于40也会被罚。不过好像没有摄像头等非现场设施,都是警察手持式的测速仪。
  3. 刚到美国,感觉美国人民开车很迅猛,在不宽的路面上也开的飞快。在路口上起步很猛,刹车也比较急。我在US101的最外侧车道,已经开到60多M,左边的车还是飕飕地超过。不过欧洲的同事觉得美国限速较低,开不起来,呵呵。后来就比较习惯了。
  4. Stop-Sign,车到了这个标志处,必须先停一下。如果是没有路灯的路口,各路口的车交替通过。
  5. 如果租车的时候没买油,还的时候要加满。也不用特别严格,只要油箱的指针是满的(8/8)可以了。加满油,然后再跑二十个英里左右应该没问题。有的加油站,需要输入信用卡帐单地址的zip code,所以你的信用卡可能就不能用了。通常加油站都有一个小商店,可以到里面跟店员讲用现金来加,比如先押50美金,剩下的钱他还会找给你。
  6. 美国高速在上班高峰期也挺堵的,也就20个M左右。我个人的感觉是,8点到9点车很多。要计划好出门时间。