SunPinyin Code Tour (3)

2. Counting Trigrams

After we tokenized the corpus, the task is to count the occurrence numbers of each trigram:

$ make m3_idngram
./ids2ngram -n 3 -s ../swap/swap -o ../swap/lm_sc.id3gram -p 5000000 ../swap/lm_sc.ids

ids2ngram.cpp

ProcessingRead()

Read the ID stream file generated in previous step. Read N-1 (in trigram case, it's 2) IDs firstly, save to the ids array (a member inngram object). Then read another ID, save it as the third element (.e., ids[N-1]) in ids array. Invoke the operator [] of std::map, by using array ids[0..2] as the key, retrieve the occurrence number for this trigram, (if never seen in the map, operator [] will insert the <key,0> pair to map), increase the number. Shift the ids array to left by one cell, continue the above processing.

In the iterations, we also monitor if the size of map reaches the maximum number (paraMax). It's to prevent the map from occupying too much memory space. If reaches the maximum number, output the map to a swap file, record the offset for this paragraph, then clear the map, continue the counting. Note, the std::map is an ordered container, the inner structure is a balanced-ordered-binary-tree (usually is a red-black tree). We then need perform a merge sort for all paragraphs.

ProcessingIdngramMerge()

Merging sort. To perform a merging sort for several ordered data sources, is to find the minimal (or maximum) one in the head elements and output it, then the head of the affected data source moves to its next element, repeat the above processing, till the set of head elements is empty. If we have many data sources to be sorted, sorting the head elements becomes a critical step for performance. InSunPinyin, there is a template class to deal the multiple-way merging, i.e., slm/sim_fmerge.h. It uses the heap sorting algorithm to sort the head elements. getBest() returns the paragraph who has the minimal head element, next() adds the next element in this paragraph to heap.

Finally, we got a raw trigram model without any smoothing, i.e., all trigrams with their occurrence numbers.

3. Build the back-off based n-gram language model

Next step is to build a back-off based n-gram model:

$ make m3_slm
./slmbuild -n 3 -o ../swap/lm_sc.3gram  -w 120000 -c 0,2,2 -d ABS,0.0005 -d ABS -d ABS,0.6 -b 10 -e 9 ../swap/lm_sc.id3gram

Firstly, let's have a look at the definition of n-gram.

For the statistical based NLP (natural language processing), the probability of a sentence (S=W1,W2,...Wn), according to the chain rule, is:

P(S) = P(W1).P(W2|W1).P(W3|W1,W2).P(W4|W1,W2,W3)...P(Wn|W1,W2,...Wn-1)
     = P(W1).prod_i^n (P(Wi|W1,W2,...Wi-1))

To express mathematics equations in text mode, we use Latex-like language here. To express the above equation with Latex, it's P(S) = P(W_1)\prod_i^nP(W_i|W_1,W_2,...W_{i-1}), the visualizing form is:

While in reality, due to the data sparseness, it's impossible to calculate the probability in such a way. A particle method is, to assume the P(Wi|W1,W2,...Wi-1) only depends on the previous N words, i.e., Wi-N+1,Wi-N+2,...Wi-1. In particular, we have unigram (N=0,context-free grammar), bigram (N=1), trigram (N=2), and fourgram (N=3). The most commonly used is trigram.

Let's have a look at several useful terms:

  • types: the size of vocabulary (or dictionary)
  • tokens: the size of corpus
  • vector/solution space: for a N-gram model, V = types^N
  • data sparseness: tokens << V, tokens is much smaller than V
  • Maximum likelihood: Pml(Wi|Wi-2,Wi-1) = C(Wi-2,Wi-1,Wi)/C(Wi-2,Wi-1), C stands for the occurrence numbers. This means the maximum likelihood estimation of P(Wi|Wi-2,Wi-1) equals to the ratio of occurrence numbers of (Wi-2,Wi-1,Wi) and (Wi-2,Wi-1).

Many possible word sequences, may not be collected in the training corpus. If Pml(Wi|h) is 0, then the probability of entire sentence becomes 0. E.g., if we only see "Tom read a book" in corpus, while we don't see anything like "Yong read xxx", then Pml(Yong read a book) = 0. So, we need to smooth the model. Smoothing is to allocate some probabilities from known events to unknown events (i.e., the events whose occurrence number is 0).

Some smoothing methodologies:

  • Simple Smoothing: add-one (or add delta), poor performance if used alone
  • Discounting smoothing: E.g., Absolute Discounting, liner smoothing, Witten-Bell, Good-Turing
  • Composite smoothing: back-off smoothing, and interpolated smoothing

A general back-off model could be expressed as following:

Ps(Wi|h) = Pd(Wi|h)         -- C(h,Wi) > 0
           bow(h).Ps(Wi|h') -- C(h,Wi) == 0

  • h' is the history h truncated by the first word. For trigram (A,B,Wi), h is (A,B), so h' is B.
  • Pd(Wi|h) < Pml(Wi|h), to discount the ML estimation, many discounting methods could be used
  • bow(h) is the back-off weight, to a given h, bow(h) is a const number, and could be determined by Ps(W1|h)+Ps(W2|h)+...+Ps(Wn|h) = sum_i (Ps(Wi|h)) = 1
  • this is a recursive expression, if the occurrence number of (h',Wi) is 0, then back-off continues, it may back-off to Wi, even to a average distribution (if Wi is not seen).
  • if h is not seen in training stage, P(Wi|h) = P(Wi|h')

Katz's back-off model used Good-Turing. Kneser-Ney is another back-off model, whose performance is better than Katz's model. Stanley Chen and Joshua Goodman once wrote a thesis, discussed the pros and cons of different smoothing methods, you could refer to it for more details.

Let's look at the code.

sim_slmbuilder.[h|cpp]:

CSlmBuilder::Create(n)

To initialize the level array according to parameter n, level[0] is pseudo root, used for average distribution. level[1] is forunigram,level[2] is for bigram , and level[3] is for trigram... For trigram, the nodes on level[3] are leaf nodes. The leaf nodes do not have bow information, since they belong to circumstances that C(h,Wi)>0.

CSlmBuilder::AddNGram(ngram, fr)

Call isExcludeId() to check if the first word of ngram is an excluded word (in this case it's 9, the AMBI-ID). For every vector in array level[1..n], check if its reserved space is used up, if true, allocate more memory for it. If the 1st word is not a excluded word, add its occurrence number to level[0][0] (i.e., pseudo root). Add each word id inngram to level[i] (0<i<=n), and accumulate the occurrence number in parent node. For level[n] (i.e., the leaf level), only the ngram whose occurrence number is bigger than the specified cut number, would be added; it's to avoid the leaf node numbers are too high (maybe tens of millions), so that we could not save them in memory.

In this function, we check, if ngram[i] (0<i<=n-1) is an excluded word, then only the ngram[0..i-1] would be counted; if ngram[i] (0<=i<=n-1) is a sentence stop (as we specified, it's 10), then only ngram[0..i] is counted. E.g., for trigram (9, x, y), it's ignored directly; for (x, 9, y), only unigram (x) is counted; for (10, x, y), unigram (10) is counted; for (x, 10, y), unigram (x) and bigram (x, 10) are counted.

CSlmBuilder::Build()

The entry point to build the back-off based n-gram model.

CSlmBuilder::CountNr()

Counting the total occurrence numbers of ngrams whose occurence number is less than SLM_MAX_R (i.e., 16). E.g., nr[3][0] is the tatal number of trigrams, nr[3][10] is the total occurrence numbers for the trigrams which occurs 10 times (can only be times of 10, e.g., 500). These data would be used when initializing discounters.

CSlmBuilder::AppendTails()

To add a tail node for each level, just for convenience in iterating.

CSlmBuilder::Cut()

Delete the ngrams whose occurrence number is less than specified threshold. The thresholds we specified is (-c 0,2,2), which means we will ignore thebigrams and trigrams whose occurrence number is less than 2, but do not cut any unigrams.

CSlmBuilder::Discount()

Initialize the discounter for each level. Call DiscountOneLevel() to perform discounting for each level, from higher to lower. And set the level[0] (pseudo root) as average distribution, its probability is 1 divided by the number of types (specified by -w flag, in this example is 120000).

CSlmBuilder::DiscountOneLevel(v, ch, disc ...)

v is the previous level, e.g., if we are discounting level[3], then the 1st argument is level[2]. To discount level[m], iterate every node in level[m-1], then discount its child nodes ([it->child,itnext ->child)). The number subtracted here are frequencies, not probabilities. Then divided by the frequency in parent node, get the conditional probability, save it in the original node.

CSlmBuilder::CalcBOW()

Calculate back-off weights, from higher to lower level. base[i] refers to the 1st element in level[i], idx[i] is the cursor in level[i], so (base[i])[idx[i]) is the visiting node. (Here we rely on the fact that the memory space allocated in std::vector is continuous, we could use pointer or array to access the elements.)

Take calculating BOWs on level[2] as an example. Then, lvl is 2, and we are going to iterate every word in base[2]. At first, try to find the parent node for this word with a for loop, .e.g,idx [2] refers to the level[2][1] (C) in above diagram, then its parent is level[1][0] (A), put them in the word[0..4] array, as a result it's {0, A, C, .}. Then callCalcNodeBow () to get the BOW for this node, the last two parameters in this function, is the range of children nodes [ch+node.child, ch+nodenext.child).

CalcNodeBow(...)

Iterate the node in [chh, cht), accumulate every probability to variable sumnext; and call builder->getPr(lvl, words+2) to get a probability, accumulate to variable sum. The actual effect of words+2, is to truncate the 1st word in history. Use above example, on the 1st iteration, words[0.4]={0, A, C, D}, so words+2 is {C, D}, the probability returned bygetPr() is P s(D|C) (Note, getPr() itself is a recursive function, when lvl is 0, return the average distribution probability). BOW is then calculated as (1.0-sumnext)/(1.0-sum).

As we explained, BOW is determined by equation sum_i (Ps(Wi|h)) = 1. From the following transformations, you could see the meanings of sumnext and sum.

SunPinyin Code Tour (2)

1. Dictionary and Word Segmentation

As we described previously, Sunpinyin uses FMM (forward maximum matching) for the 1 pass word segmentation. The dictionary are loaded to a trie structure, the key is the Chinese character in UCS-4. Some nodes come into valid words, e.g., on the most left path in following diagram, "中国" is a valid word; while on the path of "中华人民共和国", the node "共" could not stands for a valid word. (This structure is similar with the one in reference 3.)

We could see, all child nodes of root are valid words, since every character could be considered as "single character word".

sim_dict.[h|cpp]:

parseText ()   : load the dictionary in text file, and construct a trie.
matchLongest (): to do the longest match, and return the length.
step ()        : by given a character, transfer to a child node, and return it.

mmseg/mmseg.cpp:

In function processSingleFile(), read the corpus sentence by sentence. For example, "为人民办实事的精神", and then call SIMDict::matchLongest() to do the longest match.

Because there is the entry "为人民" in dictionary,so the 1st segmented word we got is "为人民",its length is 3, then call getAmbiLen() to see if there are crossing-ambiguities, and return the maximum length of crossing-ambiguities. The following is the result for each iteration:

  1. 人民, 办实事的精神 -> i=1, len=2, word_len=3, do FMM for "人民办实事", get the matched word "人民", its length is 2.
  2. 民办, 实事的精神 -> i=2, len=2, word_len=4, do FMM for "民办实事",get the matched word "民办",its length is 2. Since the i+len is longer than the original parameter word_len, set the word_len to i+len (ie. 4), continue the iteration. You could see, we have detect the ambiguity.
  3. 办实事, 的精神 -> i=3, len=3, word_len=6
  4. 实事, 的精神 -> i=4, len=2, word_len=6
  5. 事, 的精神 -> i=5, len=1, word_len=6, the ambiguity length is now determined as 6.
  6. break the loop, and return the ambiguity length, ie. 6

If an ambiguous-id is specified (by the option -a), these 6 characters would be ignored, and output the specified AMBI-ID to the result, then continue the processing. When reach the end of a sentence, if we require the binary output format, a sentence ending ID (specified by option -s) would be output. Here shows a result in text format:

$ echo "为人民办实事的精神" | ./mmseg -d ../raw/dict.utf8 -f text -s 10  -a 9
<ambi>为人民办实事</ambi> 的 精神

So, in our segmentation result, all crossing-ambiguities are marked as single one word ID (AMBI-ID). That means we ignored that part of information (the ratio is considerable). Thus in our later processing, most of the trigrams are meaningful and valuable. The language model we got, could avoid the cross-ambiguities' impact somehow. Then, useslmseg on this language model, to re-segment the raw corpus and get the new language model. This time, the original ambigious information is also leveraged.

$ echo "为人民办实事的精神" | ./slmseg -d ../raw/dict.utf8 -f text -s 10 -m ../data/lm_sc.t3g
为人民 办实事 的 精神

Note, dueto the copyright of raw corpus, we may not be able to open it, however, to use slmseg, you could create a symbol link to ime/data/lm_sc.t3g.<arch> in slm/data directory.

References:

  1. 数学之美 系列二 -- 谈谈中文分词
  2. 中文搜索引擎技术揭密:中文分词
  3. “天堂的阶梯”设计的中文分词算法

SunPinyin Code Tour (1)

0. Overview

At first, let's check out the sources from inputmethod repository,
$ hg clone ssh://anon@hg.opensolaris.org/hg/nv-g11n/inputmethod

Or you could navigate to the input-method project on www.opensolaris.org, and download the latest repository snapshot. (While it does not include the large binary data files in ime/data, you could download them from here.)

SunPinyin's source is located under inputmethod/sunpinyin, which contains two parts: the sources under slm directory are the training utilities for statistical language model; the ones under ime are related to input method engine usage, including the core logic and various portings for different input method frameworks.

The code in slm directory is used to build a back-off based n-gram statistical language model,

You could build the utilities as following steps:

$ ./autogen.sh --prefix=/usr
$ make

build: all compiled utilities (as their object files) are located in this directory, e.g., ids2ngram, slmbuild etc. By checking the Makefile.am in this folder, you could know how to build the slm and lexicon files step by step.

$ export LC_ALL=zh_CN.UTF-8
Since the dictionaries and raw corpus are in UTF-8 encoding, so please set your locale to any UTF-8 locale before we proceed.

$ make test_corpus (or real_corpus)
This target is to create a symbol link to corpus.utf8. You could manually create it.

$ make trigram
This target is to produce a trigram language model.

  1. Use mmseg (the forward maximum matching segmenting), to tokenize the raw corpus basing the dictionary, and output the word IDs to a file (../swap/lm_sc.ids).
  2. Use ids2ngram to count all occurrence numbers for every trigram, and output the result to a file. Take the following sentence as an example, 'ABCEBCA。', the trigrams we could get are (<S> A B), (A B C), (B C E) ... ( B C A), (C A 。), (A '。' </S>). (<S> and </S> stand for start and end of sentence separately.)
  3. Use slmbuild to construct a back-off trigram language model, but not pruned yet.
  4. Use slmprune to prune the trigram model, with an Entropy-based algorithm.
  5. Use slmthread to thread the language model, to speed up the locating of back-off states. And save the final output to ../data/lm_sc.t3g。

$ make bs_trigram
This target is similar with the above one, the only difference is it use slmseg to do the segmentation, it utilizes the language model we just got (segmented by FMM), to re-segment the raw corpus. That could improve the accuracy of language model.

$ make lexicon
According to the unigram information in the generated language model, process the dictionary file, and get a lexicon file (a trie indexed by pinyin characters) that supports incomplete pinyin.

We will introduce the code in details later.

ibus: An input method framework basing on dbus+python

As I blog'd before, dbus+python maybe a wonderful technical choice for input method framework. I discussed with Huang Peng (the author of scim-python), we both thought it's worth to have a try. Huang Peng is an expert in dbus and python, after a short period, the project has grown to a remarkable level. That's the ibus project, licensed in LGPLv2.1.

Huang Peng contributed the dbus server API python binding to dbus community, and implemented gtk and qt input method modules basing on glib-dbus and qt-dbus, then implemented the input method bus system with python-dbus, and ported the pinyin IME from scim-python, added python binding for libanthy and libm17n, then added this two IMEs to ibus platform. Maybe the only missing part now, is an XIM frontend. I only contributed some design ideas :$. ibus leveraged many good designs from scim and imbus, it's a project which has huge potentials, and it's deserved to be named as "the next generation input method framework".

You could check out the latest code from http://github.com/phuang, then follow the instructions on http://code.google.com/p/ibus/wiki/ReadMe to build the project. BTW, the gnome.asia submit would be held in Oct. in Beijing, we may have a session about input methods, and we would invite the active input method developers to have a forum, looking forward to see you! :)

ibus: An input method framework basing on dbus+python

之前提到过,dbus+python可能是实现输入法框架一个很好的技术选择。和scim-python的作者Huang Peng也交流了这个想法,大家都觉得值得一试。Huang Peng兄对dbus和python都有深入的掌握,开始动手实现不久,就已经颇具规模。这就是现在的ibus项目,采用的开放协议为LGPLv2.1。

Huang Peng为dbus社区贡献了dbus server API的python binding,基于glib-dbus和qt-dbus实现了gtk和qt的input method module,用python-dbus实现了输入法BUS平台,将scim-python中的pinyin输入法移植过来,编写了anthy和m17n的python binding、并将这两个输入法加入到ibus平台中。目前所缺的也许只有一个XIM的前端了。而我只是偶尔提供一些意见以供参详,惭愧惭愧啊。ibus借鉴了许多scim和imbus的设计思想,是一个非常有潜力的开源项目。称之为“next gernation input method framework”也毫不过分。

你可以从http://github.com/phuang下载最新的源代码,再按照http://code.google.com/p/ibus/wiki/ReadMe的指示来build这个项目。另外,今年10月的gnome.asia峰会将在北京召开,到时候可能会有一个关于输入法的session,我们邀请了许多活跃在输入法开发社区的开发者和大家进行交流,希望大家到时候踊跃参加哦!:)

Exchange changesets of mercurial between repositories/branches

There are several ways to exchange changesets of mercurial between repositories (branches).

  1. hg bundle/unbundle
  2. Just like an offline hg pull/update, which means you could not apply the changesets from another repository if they do not share the same changesets.

  3. hg export/import
  4. hg export -o changesets/patch%r REV1:REV2. It could work for the repositories that does not share the same changesets (repositories may come from different ancestors), but it does not work so well for merged revisions (the changesets have multiple parents). Even with --switch-parent option, you still need to manually edit the patch to eliminate the additional parent declaration in comments.

  5. hg transplant extension
  6. So far the best solution, though I still met a problem when transplanting a specific merged revision (for other merged revisions, it worked smoothly). So I use hg cat -r REV to output the relevant files, copy them to the new repository and commit. Then skip the specific revision.

The only 64bits GtkApp on Solairs

I tried to test the scim/scim-bridge gtk-im-modules for 64bits applications, however, the gnome applications on Solaris are 32bits, except this one,

/usr/demo/bin/{amd64,sparcv9}/gtkdemo.

Unlike Linux, which has different distributions for 32bits and 64bits (for both kernel and userland), Solaris ships them together (the default loaded kernel is 64bits).

D700真的来了

我等N粉终于盼到这一天了,不过刚发布价格还比较高,估计大量铺货后能降1成。当然如果能降2成就比较完美了。希望C家的“无敌兔”赶紧发布,多给Nikon些压力,呵呵 ... 为那些近期入手了D300的人感到惋惜,再忍一忍就好了。同时发布的还有SB900和两支移轴的镜头。