SunPinyin Code Tour (1)

0. Overview

At first, let's check out the sources from inputmethod repository,
$ hg clone ssh://anon@hg.opensolaris.org/hg/nv-g11n/inputmethod

Or you could navigate to the input-method project on www.opensolaris.org, and download the latest repository snapshot. (While it does not include the large binary data files in ime/data, you could download them from here.)

SunPinyin's source is located under inputmethod/sunpinyin, which contains two parts: the sources under slm directory are the training utilities for statistical language model; the ones under ime are related to input method engine usage, including the core logic and various portings for different input method frameworks.

The code in slm directory is used to build a back-off based n-gram statistical language model,

You could build the utilities as following steps:

$ ./autogen.sh --prefix=/usr
$ make

build: all compiled utilities (as their object files) are located in this directory, e.g., ids2ngram, slmbuild etc. By checking the Makefile.am in this folder, you could know how to build the slm and lexicon files step by step.

$ export LC_ALL=zh_CN.UTF-8
Since the dictionaries and raw corpus are in UTF-8 encoding, so please set your locale to any UTF-8 locale before we proceed.

$ make test_corpus (or real_corpus)
This target is to create a symbol link to corpus.utf8. You could manually create it.

$ make trigram
This target is to produce a trigram language model.

  1. Use mmseg (the forward maximum matching segmenting), to tokenize the raw corpus basing the dictionary, and output the word IDs to a file (../swap/lm_sc.ids).
  2. Use ids2ngram to count all occurrence numbers for every trigram, and output the result to a file. Take the following sentence as an example, 'ABCEBCA。', the trigrams we could get are (<S> A B), (A B C), (B C E) ... ( B C A), (C A 。), (A '。' </S>). (<S> and </S> stand for start and end of sentence separately.)
  3. Use slmbuild to construct a back-off trigram language model, but not pruned yet.
  4. Use slmprune to prune the trigram model, with an Entropy-based algorithm.
  5. Use slmthread to thread the language model, to speed up the locating of back-off states. And save the final output to ../data/lm_sc.t3g。

$ make bs_trigram
This target is similar with the above one, the only difference is it use slmseg to do the segmentation, it utilizes the language model we just got (segmented by FMM), to re-segment the raw corpus. That could improve the accuracy of language model.

$ make lexicon
According to the unigram information in the generated language model, process the dictionary file, and get a lexicon file (a trie indexed by pinyin characters) that supports incomplete pinyin.

We will introduce the code in details later.

ibus: An input method framework basing on dbus+python

As I blog'd before, dbus+python maybe a wonderful technical choice for input method framework. I discussed with Huang Peng (the author of scim-python), we both thought it's worth to have a try. Huang Peng is an expert in dbus and python, after a short period, the project has grown to a remarkable level. That's the ibus project, licensed in LGPLv2.1.

Huang Peng contributed the dbus server API python binding to dbus community, and implemented gtk and qt input method modules basing on glib-dbus and qt-dbus, then implemented the input method bus system with python-dbus, and ported the pinyin IME from scim-python, added python binding for libanthy and libm17n, then added this two IMEs to ibus platform. Maybe the only missing part now, is an XIM frontend. I only contributed some design ideas :$. ibus leveraged many good designs from scim and imbus, it's a project which has huge potentials, and it's deserved to be named as "the next generation input method framework".

You could check out the latest code from http://github.com/phuang, then follow the instructions on http://code.google.com/p/ibus/wiki/ReadMe to build the project. BTW, the gnome.asia submit would be held in Oct. in Beijing, we may have a session about input methods, and we would invite the active input method developers to have a forum, looking forward to see you! :)

ibus: An input method framework basing on dbus+python

之前提到过,dbus+python可能是实现输入法框架一个很好的技术选择。和scim-python的作者Huang Peng也交流了这个想法,大家都觉得值得一试。Huang Peng兄对dbus和python都有深入的掌握,开始动手实现不久,就已经颇具规模。这就是现在的ibus项目,采用的开放协议为LGPLv2.1。

Huang Peng为dbus社区贡献了dbus server API的python binding,基于glib-dbus和qt-dbus实现了gtk和qt的input method module,用python-dbus实现了输入法BUS平台,将scim-python中的pinyin输入法移植过来,编写了anthy和m17n的python binding、并将这两个输入法加入到ibus平台中。目前所缺的也许只有一个XIM的前端了。而我只是偶尔提供一些意见以供参详,惭愧惭愧啊。ibus借鉴了许多scim和imbus的设计思想,是一个非常有潜力的开源项目。称之为“next gernation input method framework”也毫不过分。

你可以从http://github.com/phuang下载最新的源代码,再按照http://code.google.com/p/ibus/wiki/ReadMe的指示来build这个项目。另外,今年10月的gnome.asia峰会将在北京召开,到时候可能会有一个关于输入法的session,我们邀请了许多活跃在输入法开发社区的开发者和大家进行交流,希望大家到时候踊跃参加哦!:)