At first, let's check out the sources from inputmethod repository,
$ hg clone ssh://email@example.com/hg/nv-g11n/inputmethod
Or you could navigate to the input-method project on www.opensolaris.org, and download the latest repository snapshot. (While it does not include the large binary data files in ime/data, you could download them from here.)
SunPinyin's source is located under inputmethod/sunpinyin, which contains two parts: the sources under slm directory are the training utilities for statistical language model; the ones under ime are related to input method engine usage, including the core logic and various portings for different input method frameworks.
The code in slm directory is used to build a back-off based n-gram statistical language model,
You could build the utilities as following steps:
$ ./autogen.sh --prefix=/usr
build: all compiled utilities (as their object files) are located in this directory, e.g., ids2ngram, slmbuild etc. By checking the Makefile.am in this folder, you could know how to build the slm and lexicon files step by step.
$ export LC_ALL=zh_CN.UTF-8
Since the dictionaries and raw corpus are in UTF-8 encoding, so please set your locale to any UTF-8 locale before we proceed.
$ make test_corpus (or real_corpus)
This target is to create a symbol link to corpus.utf8. You could manually create it.
$ make trigram
This target is to produce a trigram language model.
- Use mmseg (the forward maximum matching segmenting), to tokenize the raw corpus basing the dictionary, and output the word IDs to a file (../swap/lm_sc.ids).
- Use ids2ngram to count all occurrence numbers for every trigram, and output the result to a file. Take the following sentence as an example, 'ABCEBCA。', the trigrams we could get are (<S> A B), (A B C), (B C E) ... ( B C A), (C A 。), (A '。' </S>). (<S> and </S> stand for start and end of sentence separately.)
- Use slmbuild to construct a back-off trigram language model, but not pruned yet.
- Use slmprune to prune the trigram model, with an Entropy-based algorithm.
- Use slmthread to thread the language model, to speed up the locating of back-off states. And save the final output to ../data/lm_sc.t3g。
$ make bs_trigram
This target is similar with the above one, the only difference is it use slmseg to do the segmentation, it utilizes the language model we just got (segmented by FMM), to re-segment the raw corpus. That could improve the accuracy of language model.
$ make lexicon
According to the unigram information in the generated language model, process the dictionary file, and get a lexicon file (a trie indexed by pinyin characters) that supports incomplete pinyin.
We will introduce the code in details later.