SunPinyin代码导读(一)

首先check out输入法项目的代码:
$ hg clone ssh://anon@hg.opensolaris.org/hg/nv-g11n/inputmethod

或者到www.opensolaris.org上的input-method项目,下载最新的代码仓库快照。(但是其中并不包括ime/data目录下那些较大的数据文件,你可以到这里下载它们)

inputmethod/sunpinyin目录下是SunPinyin的代码,其中包括两部分,slm目录下是统计语言模型的代码(slm: statistical language model),ime目录下是和输入法相关的接口(ime: input method engine)。

slm中的代码用来构建一个支持back-off(回退)的n-gram(n元语法)静态统计语言模型。

首先执行下面的步骤来编译各个工具:

$ ./autogen.sh --prefix=/usr
$ make

build:这个目录下保存编译好的各个工具(以及它们的object文件),例如ids2ngram、slmbuild等。通过查看这个目录下的Makefile.am,你可以了解如何一步一步地构建词库和统计语言模型的二进制文件。

$ export LC_ALL=zh_CN.UTF-8
因为语料库和词表都使用UTF-8的编码,因此在进行后续步骤之前,需要将当前的语言环境设置为任意一个UTF-8的locale。

$ make test_corpus (或real_corpus)
这一步在raw目录中,建立一个corpus.utf8的符号链接。你完全可以手工建立这个链接。

$ make trigram
这一步建立一个trigram语言模型。

  1. 首先使用mmseg(最大正向分词),根据词表,对语料库进行分词,并将词的ID序列输出到一个文件中(../swap/lm_sc.ids)。
  2. 然后使用ids2ngram,对所有3元组出现的次数进行统计,并输出到一个文件中。例如下面的句子:ABCEBCA。得到的3元序列包括:(<S> A B), (A B C), (B C E) ... ( B C A), (C A '。'), (A '。' </S>)。<S>和</S>分别表示句首和句尾。
  3. 使用slmbuild来构造一个back-off的trigram语言模型,但是未经裁剪。
  4. 使用slmprune对刚输出的trigram语言模型进行剪裁,使用的算法是基于熵的剪裁算法。
  5. 使用slmthread对剪裁后的语言模型进行线索化,加快back-off(回退)查找的速度。将最终的结果输出到../data/lm_sc.t3g。

$ make bs_trigram
这个目标和上面的类似,不同的是使用slmseg进行分词,它借助刚刚生成的基于最大正向分词得到的语言模型,对语料库进行重新分词。这样可以进一步提高语言模型的精确程度。

$ make lexicon
根据生成语言模型中的unigram数据,对词表进行处理,得到一个支持不完全拼音的词库文件。

代码的细节我们随后再详述。

sunpinyin-gtk on Mac (Intel)

With a little hacking, the gtk front-end of SunPinyin could run on Mac OS X (Intel) now, here is a screenshot:

To build the code, you need install X11 package (available in the 2nd Tiger install DVD), gtk2 (with MacPorts), and a little tricking in configure.ac. You could just use the language model and dictionary files for linux.

Hopefully, we will have a full-porting to Mac. There is a sample on developer.apple.com, about writing a basic input method service. Anyone has interests?

Other references:

1. 现在如何在 Mac OS X 中写一套输入法 (一)
2. 怎樣在 Mac OS X 下寫一套輸入法?(一)(二)(三)

Add new session in Dtlogin

Pradhap had posted an entry about adding new session in dtlogin. I just add several notes,

  1. You need copy the Xresources.<new> to every locale (or at least the locale you daily use). E.g., if your default login language setting (in /etc/default/init) or your preferred locale is en_US.UTF-8, you need copy the file to /usr/dt/config/en_US.UTF-8/Xresources.d (I have to say, it's really stupidly bad :( ).
  2. The Dtlogin*altDtKey in Xresources.<new> must be a valid executable file. Otherwise, your new session would not appear in the dtlogin session list.
  3. The Xinitrc.<new> is not mandatory, you could just start your session in Xsession.<new>
  4. The new X session would not execute all the initial session scripts in /usr/dt/bin/Xsession.d. You may refer to Xsession.jds that supports this (it sets the SDT_ALT_SESSION to Xsession2.jds, and executes /usr/dt/bin/Xsession to firstly launch these scripts, then start the $SDT_ALT_SESSION).

Configure xchat-aqua to notify you on hightlight words

Unlike the X-Chat on X11, by default, X-Chat Aqua (v0.16.0) does not notify you when somebody talks to you, or your interested keywords are mentioned in the chat. To enable it, in [Preferences]->[Chatting]->[Events/Sounds]:

Sound FileGBS
Channel Msg Hilightany.aiff--
Private Messageany.aiff--
Private Message to Dialogany.aiff--

Another tip, you could enable "Tab-key completion", in [Preferences]->[Interface]->[Input box], so that you could use 'tab' key to complete the nickname.

Use cscope to index your code

If you are using cscope with Emacs, you may have noticed that, in cscope-15.6/contrib/xcscope, there is an utility named cscope-indexer. While I think its usage is not as easy as of ctags-exuberant.

So, I created my own indexer:

#!/bin/bash

LIST_FILE='cscope.files'
DATABASE_FILE='cscope.out'

if [ $# == 0 ]; then
        echo ""
        echo "  cscope-indexer: please sepcify the folders you want to index!"
        echo "  E.g., cscope-indexer . ../include ../lib"
        echo ""
        exit
fi

rm $LIST_FILE $DATABASE_FILE &>/dev/null

for i in $@; do
        find $i -type f -o -type l | \
                egrep -i '\.([chly](xx|pp)*|cc|hh)$' | \
                sed -e '/\/CVS\//d' -e '/\/RCS\//d' -e 's/^\.\///' | \
                sort >> $LIST_FILE
done

cscope -b -i $LIST_FILE -f $DATABASE_FILE

Tail-Recursion in C

昨天下午的TOI,向同事简要介绍了Erlang,包括在函数式编程语言中常见(亦常用)的尾递归。例如下面的代码:

factorial(N) -> fac_i(N, 1).
fac_i(1, P) -> P;
fac_i(N, P) -> fac_i (N-1, N*P).

这段貌似递归的代码实际上是一个迭代过程。一直以为尾递归是函数式编程语言特有的,昨天在网上看到,其实许多C编译器在打开优化编译选项时,也是可以消解尾递归的,例如SunStudio或GCC。考虑下面的代码:

static int fac_i (int N, int P) {
    return N==1? P: fac_i(N-1, P*N);
}
int factorial (int N) {return fac_i (N, 1);}

我们用cc -O -S test.c,将测试的C文件编译为汇编代码,从中你可以清晰地看到,编译器已经将尾递归解开了:

.CG0.67:
        cmpl       $4,%esi         ;/如果参数N小于4,则跳转到.LU0.68,进行单步迭代
        jl         .LU0.68
.LP2.74:
        movl       12(%ebp),%eax
.CG2.14:
        imul       %ecx,%eax       ;/执行
4次连续迭代,尽量避免跳转可能造成的性能损失
        decl       %ecx
        imul       %ecx,%eax
        decl       %ecx
        imul       %ecx,%eax
        decl       %ecx
        imul       %ecx,%eax
        decl       %ecx
.LU3.71:
        cmpl       $4,%ecx         ;/如果N大于4,跳转回去,继续执行4次连续迭代
        jg         .CG2.14
.LX2.75:
        movl       %eax,12(%ebp)
.LE2.76:
        cmpl       $1,%ecx         ;/如果N<=1,则跳转到函数结尾,准备返回
        jle        .LX0.56
.LU0.68:
        movl       12(%ebp),%eax
.LU4.72:
        imul       %ecx,%eax       ;/单步迭代
        decl       %ecx
        cmpl       $1,%ecx         ;/如果1<N<4,则跳转回去继续执行单步迭代
        jg         .LU4.72
.LX3.77:
        movl       %eax,12(%ebp)

我们也可以用gcc -O2 -S test.c,看到类似的结果。

关于尾递归的判断条件,应该是: 对函数自身的递归调用,是函数体执行的最后一步(也就是在尾部)。如果它有返回结果,这个返回值不能再参与任何后续的运算。如果我们将fac_i()改为:

static int fac_i (int N, int P) {
    return N==1? P: 1+fac_i(N-1, P*N);
}

我们可以看到汇编代码中,看到下面的结果,

.CG3.15:
        subl       $8,%esp
        movl       12(%ebp),%ecx
        imul       %eax,%ecx
        push       %ecx
        decl       %eax
        push       %eax
        call       fac_i           ;/递归调用fac_i
        addl       $16,%esp
        incl       %eax

fac_i()变为了递归调用。

虽然C语言中也可以编写尾递归的代码,但这依赖于编译器的优化能力。还应慎用!

iTerm on Mac OS

Are you looking for an terminal emulation program that supports "Tabbing" for Mac OS?

Try iTerm, which is a opensource'd terminal program for Mac OS. What? The arrow keys do not work in vi/vim, and Solaris does not recognize the terminal type? Add the following environment variable in your ~/.bash_profile (or ~/.bashrc) :

TERM=ansi