An SMF service for querying/updating gtk-im-modules

Just found that from opensolaris_b93, an SMF service for querying/updating gtk-im-modules is available, svc:/application/desktop-cache/input-method-cache:default. This service is online by default, and would be executed on every booting. However,
it's not available on Nevada yet.


刚刚发现,从opensolaris_b93开始,已经含有一个用来查询/更新gtk输入法模块的SMF服务了,svc:/application/desktop-cache/input-method-cache:default。这个服务缺省就是打开的,并且在每次启动时都会被执行。不过,在Nevada中还没有集成这个服务。

The latest status of ibus

Just got the existing news from Huang Peng, ibus platform is approaching to be finished. Huang added XIM server agent recently, and also implemented ibus-chewing, ibus-hangul. Really impressive and rapid progress!


刚刚从Huang Peng那里听到一个好消息,ibus已经接近完成了。Huang Peng最近刚刚为ibus添加了XIM服务器的agent模块,并实现了ibus-chewing和ibus-hangul。真是进展神速啊!

Using Gnu-iconv on Solaris and OpenSolaris

Gnu-iconv supports more encoding conversions and provides better performances for some conversions over the Solaris-iconv. E.g., currently, Solaris-iconv does not support the conversions between GB18030 and UCS-2BE, UCS-4LE/BE, UTF-16LE/BE; and the conversions of GB18030<->UTF-8 (UCS-2LE) in Gnu-iconv is two times faster. And Gnu-iconv has a star-shaped structure with some exceptions, which uses UCS-4 as the intermediary encoding. While Solaris-iconv has a peer-2-peer structure (with alias), it's really painful to add a new encoding.

So, you may want to use Gnu-iconv library. For Solaris 10/Nevada, you could download&install gnu-libiconv from www.sunfreeware.com,  for opensolaris, you could install SUNWgnu-libiconv from pkg.opensolaris.org,  but the OS.o package does not contain the header files.

You may notice that, the function symboles in gnu-libiconv, had been added the prefix of "lib", e.g., iconv_open -> libiconv_open. So, LD_PRELOAD and RUNPATH are not sufficient for replacing iconv(3) routines in libc. You need to make sure to include the "iconv.h" from gnu-libiconv.


和Solaris的iconv相比较,Gnu-iconv支持更多的编码转换,并且在某些编码转换上有更好的性能。例如,目前Solaris-iconv不支持从GB18030到UCS-2BE、UCS-4LE/BE和UTF-16LE/BE之间的转换;而GB18030<->UTF-8 (UCS-2LE)在Gnu-iconv中的转换速度,是Solaris-iconv的两倍。并且Gnu-iconv是一种星型结构(也有某些点到点的例外情况),它使用UCS-4作为中间转换的介质。而Solaris-iconv是一种点到点的结构(支持别名),因此添加一个新的编码实在是有些痛苦。

因此,你可能希望使用Gnu-iconv程序库。对Solaris 10/Nevada来说,你可以从www.sunfreeware.com下载并安装gnu-libiconv,对opensolaris你可以用pkg(1)从pkg.opensolaris.org上安装SUNWgnu-libiconv的程序包,不过这个包没有包括头文件。

你可能已经注意到了,gnu-libiconv中的符号名,都被加上了"lib"的前缀,例如iconv_open->libiconv_open。因此LD_PRELOAD和RUNPATH并不能替换libc中的iconv(3)调用。你必须确保include gnu-libiconv中的"iconv.h"头文件。

C++ exceptions and shared objects

In many application systems, shared objects are loaded as sub-modules via dlopen(3C). In case it's written by C++, you need to be careful about throwing exceptions. The client may try unload this module (by dlclose(3C)) in exception handling block, so that the application may crash when C++ runtime routines try to clean the exception (by calling its destructor) before leaving current catching function. In practice, exception class is usually defined in header file, so that its code body maybe included both in app and sub-module. Even in this case, linker or loader may choose the copy in sub-module. (E.g., the -Bdirect flag of ld(1)).

The conclusion is, do not unload a shared object when catching an exception thrown from it, or not to throw exceptions from a shared object. This is the root cause that SCIM x11 frontend would crash for the 2nd launching.

SunPinyin Code Tour (6)

6. Using the generated language model

In previous step, we generated the threaded language model. The next question is how to use this model? Let's look at the header file of CThreadSlm class, slm.h, here are the public methods:

  1. load(): Load the language model to memory
  2. free(): Release the memory allocated for the language model
  3. isUseLogPr(): If the language model use log values for probabilities
  4. transfer(history, wid, result): By given a word id wid, transfer from history state (history) to new state (result), and return P(wid|history).
  5. transferNegLog(history, wid, result): Similar with above method, but returns -log(p(wid|history).
  6. history_state_of(st): Get the h' of st, and return.
  7. historify(st): Set st to its h'.
  8. lastWordId(st): Return the last word id of state st. Assuming st is (lvl, idx), if lvl>=N, then return m_Levels[N][idx]; if 0<lvl<N, return m_level[lvl][idx]; if level==0 and idx==0, return the word id on pseudo root (i.e., 0), if idx>0 just return idx (we will discuss this situation when introducing history cache.

Now, let's look at slmseg to see how to searching by leveraging CThreadSlm class.

$ make slmids3
./slmseg -d ../raw/dict.utf8 -f bin -s 10 -m ../data/lm_sc.t3g ../raw/corpus.utf8 >../swap/lm_sc.ids

The source code of slmseg/slmseg.cpp:

struct TLatticeWord {
  int m_left;
  int m_right;   //[m_left, m_right) is the index range on lattice for this word
  int m_wordId;  //word id
};

struct TLatticeStateValue{
  double             m_pr;       //the probability on this state node
  TLatticeWord*      mp_btword;  //the back-trace word, i.e., the given word causing the transformation
  CThreadSlm::TState m_btstate;  //the back-trace state, i.e., where the current state comes from
};

/* The mapping of SLM state node, and its Lattice state information */
typedef std::map<CThreadSlm::TState, TLatticeStateValue> TLatticeColumnStates;

/* represent a column of lattice */
struct TLatticeColumn {
  TLatticeWordVec m_wordstarting;  //
The words starting from this column, these are actually the input words when expending this column,
                                   //the newly transferred state node, located at lattice[word.m_right].
  TLatticeColumnStates m_states;   //the SLM state nodes and their corresponding Lattice states
};

processSingleFile():

Read the corpus sentence by sentence. For each sentence, call buildLattice() to build the searching lattice, and call searchBest() to search on the lattice, and get the best segmentation by getBestPath(), and call output() to write out the best segmentation.

buildLattice(sentence, lattice):

Set the SLM state on lattice[0] as pseudo root. Initialize i as 0, perform FMM for sntnc[i..n], get the word length -- len. And call getAmbiLen() to get the length of maximum length of crossing-ambiguities -- ambilen. If ambilen <= len, then there is no crossing-ambiguities, then call insertLatticeWord() to insert ([i, i+len), Wid) to the word list of lattice[i]; set i=i+len, continue the loop. If there is crossing-ambiguities, call fullSegbuildLattice() to perform a full segmentation for sub-sentence sntnc[i..i+ambilen), for all segmented words (including all single-character word), assuming one's location is [left, right), insert it to lattice[left]; and set i=i+ambilen, continue the loop. After the iteration is done, add an end-of-sentence id (we use "。" for this id in following examples) at the end.

Use the example in section 2,

Perform the FMM for this sample sentence, the first word we get is "发扬", and no ambiguity, so insert "发扬" to lattice[0]. Then i=2, continue the loop. The FMM result is "为人民", but the ambilen is 6, so we need perform full segmentation for sntnc[2..7]. Insert all possible segments on idx (2<=idx<7) to latttice[idx]. Then i=2+6=8, continue the loop. The FMM result is "的", and no ambiguity. Then i=9, the FMM result is "精神" and no ambiguity. After the loop is finished, add an end-of-sentence ID at the tail. We could see, there is no word on lattice[1], and the words on lattice[2] are "为", "为人", and "为人民". And so on...

searchBest(lattice):

Iterate the lattice from 0. For all lattice state nodes on lattcie[i], use the words starting from this lattice ([left, right), Wid) to expand, save the newly transferred state node h' (CThreadSlm::historify(his)) to lattice[word.m_right].m_states. Note, given a word D, two different state nodes (A, C) and (B, C) may result in the same h' (i.e., (C, D)), so we only need to keep the path who has bigger probability.

Let's start from lattice[0], we have set it as pseudo root (0, 0) in buildLatice() method. There is only one word on lattice[0], "发扬"; call CThreadSlm::trnasferNegLog((0,0), wid_of("发扬")), get the new SLM state node and the probability P("发扬"), save its h' (lvl=1, idx) and its lattice state information (probability and back-trace information) to lattice[2]. Then look at lattice[1], there is no state node to be expanded. Then going forward to lattice[2], there is one state node, and three words ("为", "为人", "为人民"), save the newly transferred three nodes to lattice[3], lattice[4] and lattice[5]. And so on...

This is a dynamic programming problem, and what we used here is Viterbi Lattice algorithm.

getBestPath(lattice, segResult):

From end to start, back-trace the lattice, save the word id on best path to segResult. By reversing segResult, we got the best segmentation. For our example, it's the blue path on above figure.

完美中的小小遗憾

总体来说,北京奥运会的开幕式很完美,一开始有些拙劣的山水画作,在中途由孩子们添上了色彩和活力(特别是太阳变笑脸的细节),再最后由各国运动员的脚踩踏出绚烂的彩虹,的确是很好的创意。主题曲节奏悠缓,但是给人的印象不深,未能使人过耳不忘。中国代表团入场的时候,我一直期待姚明会将四川的那个小朋友抱起来放到肩膀上,不过未能看到。另外关于主火炬点燃,我本来猜想会是由海峡两岸三地的中国运动员共同点燃,那样也许更有意义且震撼人心。李宁空中跑动的姿态很潇洒,只是最后点燃的当口不知什么原因停顿了一小段时间。

中国,加油!北京,加油!!

SunPinyin Code Tour (5)

5. Threading the language model

Let's look at the last step, threading (or adding back-off
indices) to the pruned language model.

$ make m3_thread
./slmthread ../swap/lm_sc.3gm ../data/lm_sc.t3g

The goal of threading is to accelerate the searching in language model. Take trigram as an example, if we want to calculate the
probability of sentence "ABCDE。", P(S) = P(A).P(B|A).P(C|AB).P(D|BC).P(E|CD).P(<EOS>|DE), and assume
no back-off needed when evaluating all these conditional probabilities.

First, we get P(A) from level[1] (unigram) via a binary search; and find B from A's children via a binary search, to get P(B|A); and then find C from B's children via another binary search, to get P(C|AB). The next step is to calculate P(D|BC). Then we need go back to unigram to find B, and then get C from B's children, then get P(D|BC), all these done by binary searches. Obviously, the performance is low. When we evaluating P(C|AB), if we could directly locate to (B, C) from node C, we could get the search much faster.

We use (level, idx) to denote a node, the information on this node is described as (W, h', p, b, c):

  • W:word id. The location of this node implies its history information.
  • h':The back-off node, described as (level', idx').
  • p:p(W|h)
  • b:bow(h)
  • c:The start index of child node in next level.

Certainly, there is no b and c on leaf node. Now, the back-off structure becomes a graph from a tree. Its basic operation is: on
current history state hi(level,idx), by given an input word W, it transfer to another history state hj(level',
idx'), and return P(W|h). The level' on hj, not always equals to level-1. E.g., C(ABC)>0, but (B, C) is not seen in
training, then the h' on this node is the C node on level[1]. If C is not seen either, h' is then the pseudo root (level[0][0]).

Let's look at the following snippet,

                                  level+1
h (level,idx)    ----------->   W1,p1,h'1,[b,c]
    h',bow       \              ... ...
                  \             Wi,pi,h'i,[b,c]
                   \            ... ...
                    -------->   Wk,pk,h'k,[b,c]

Given an input word W. If W is one child of h, denoted as N (level+1, idx'), its p value is just P(W|h).

  • If N has children, then N is the new history node;
  • If not, the h' on N becomes the new history node.

If W is not child of h, execute the same processing from h' on (level, idx), find the new history node, and times the probability
with bow(h), return it as P(W|h). Note, it's a recursive process.

Let's look at the code:

slmthread.cpp:

CSIMSlmWithIteration::getIdString(it, history):

The m_history member (std::vector<int>) holds the indices for each level, save the word ID on this node to passed in argument history. Method next() increase the last element in m_history (i.e., the index in current level), and call adjustIterator() to find the preceding nodes, save the indices to m_history.

CSIMSlmWithIteration::findBackOffState(n, hw, bol, bon):

Find the back-off node (h') of a specified n-gram (held in hw[1..n]), and return its level and idx. Call findState() to get the idx of hw[2..n] on level[n-1], if the index is not less than 0 then this node (n-1, idx) does have child node, return the location of h'. Otherwise, call findState() to get the idx of hw[3..n] on level[n-2], ... If the loop reach hw[n], return the pseudo root.

E.g., find the back-off node for trigram (A, B, C). Find out if (B, C) exists, if so, return (2, idx_BC). Otherwise, find out if (C) exists, if so return (1, idx_C). Otherwise, return (0, 0).

CSIMSlmWithIteration::findState(n, hw):

Find the index of specified n-gram (held in hw[1..n]) on level[n]. If it does not exist, return -1.

We also perform the compression for 32-bits floating numbers when threading. bow values are compressed to 14 bits, pr values are compressed to 16 bits. The basic idea is to count all bow (or pr) values, combine the clustering floating numbers, make the total number is under 1<<BITS_BOW (or 1<<BITS_PR); in the generated language model binary file, there are two floating
tables, we need to search the table to get the original value. I'm not so clear about this algorithm, hopes the original author
Phill Zhang to introduce more details.

Now, we get the final language model -- lm_sc.t3g. You could use tslminfo to look at the data in plain text format:

$ make m3_tslminfo

./tslminfo -p -v -l ../raw/dict.utf8 ../data/lm_sc.t3g
>../swap/lm_sc.t3g.arpa

$ tail ../swap/lm_sc.t3g.arpa

点击 率 也   0.036363638937 (2,839922)
点击 率 最高   0.081818044186 (2,840054)
点击 率 的   0.096969485283 (2,840080)
点击 率 达到   0.036363638937 (2,840122)
点击 量 达   0.485714286566 (2,1132198)
点击 鼠标 ,   0.400000005960 (1,1)
点击 鼠标 左   0.309090882540 (2,1186378)
率队 取得 <unknown>   0.479999989271 (2,366031)
率队 在 <unknown>   0.130769222975 (2,431213)
率队 打 了   0.479999989271 (2,642183)

We could use  CThreadSlm to access or use the language model. In next section, we will use slmseg as an example, to see how we construct the lattice and search by leveraging the generated language model.

SunPinyin Code Tour (4)

4. Entropy-based Pruning

In this section, we will discuss the pruning for back-off based n-gram language model.

$ make m3_prune
./slmprune ../swap/lm_sc.3gram ../swap/lm_sc.3gm R 100000 1250000 1000000

Let's look at the definition of Entropy, assume p(x) is the probability density function for random variable X, then the entropy of X is:

Note: the logarithm base is 2, and 0log 0 = 0.

Entropy is used to measure the determinability of random events. The entropy value is larger, the determinability is lower. The last part of the above equation indicates that entropy is the expectation of log(1/p(X)). If p(X=xi) equals to 1/256, then log(1/p(xi))=log(256)=8, which means it requires 8 bits to encode the event xi. And H(p) is its expectation (weighted-average), i.e., the average information length.

Let's look at the relative entropy (i.e., Kullback-Leibler distance):

It means if the distribution of random variable X is p(x), while we use q(x) to encode X, the extra bits we are going to use.

The equation to calculate conditional relative entropy is:

Note: for the proving of this equation, please refer to "Elements of Information Theory".

Let's look at how to use relative entropy to prune back-off language model. The following analysis comes from "Entropy-based Pruning of Backoff Language Models".

Recall the general form of a back-off model,

Ps(Wi|h) = Pd(Wi|h)         -- C(h,Wi) > 0
           bow(h).Ps(Wi|h') -- C(h,Wi) == 0

The goal of pruning, is to remove some n-grams which have explicit estimation (i.e., C(h,Wi)>0), to reduce the size of parameters, while minimizing the performance loss. (Note, after the pruning, the probabilities of left ones which have explicit estimations do not change).

Pruning Steps:

  1. Select a threshold,
  2. Calculate the relative entropy by pruning each n-gram individually,
  3. Remove all N-grams that raise the relative entropy by less than the threshold, then re-calculate the back-off weights.

Assume we removed (h, w) from the model, when we want to calculate p(w|h), it requires backed-off (or implicit) estimation. To guarantee the formality of the model, i.e., to make sure sum_wi p(wi|h) = 1, bow(h) must be re-calculated, named as bow'(h), so p'(w|h) = bow'(h).p(w|h'). While the meantime, all backed-off estimations involving h are impacted, we use notation BO(wi,h) to denote this case (it's actually the circumstances that C(h,wi) == 0). We could get:

Note, h and w are given and concrete. And besides (h, w), the probabilities of left n-grams which have explicit estimations do not change, so some parts of the summation are counteracted.

Finally (please refer to the quoted paper for details), we could get:

In above equation, p(h={h1,h2...}) = p(h1)p(h2|h1)..., and we also know:

Next step is to calculate bow'(h):

To calculate bow'(h), is to drop the term for the pruned N-gram (w, h) from the summation in both numerator and denominator. Since bow(h) is known, if we got the numerator, then we could get the denominator, then add p(w|h) and p(w|h') separately. Then, we could get bow'(h).

Let's look at the code:

slmprune.cpp:

CSlmPruner::Prune()
:

Call PrunLevel(lvl) from level[N] to level[1], to prune each level. After the pruning, call CalcBOW() to re-calculate the back-off weights for each node in level[0..N-1].

CSlmPruner::PruneLevel(lvl):

sz[lvl] holds the node numbers in level[lvl], which include the pseudo tail, so the actually node number is sz[lvl]-1, assign it to n. cut[lvl] is the amount to be removed for this level. Allocate TNodeInfo[n] array, and assign to pbuf. Iterate each node in level[lvl], get its preceding nodes by a for loop, and put this n-gram to hw[0..lvl], keep the indices for each level to idx[0..lvl]. Figure out if this node has children ((pn+1)->child > pn->child), if yes, this node could not be pruned. If no, call CalcDistance(lvl, idx, hw) to calculate the increased relative entropy by removing this node. Save the information to pbuf. Continue to the next node in level[lvl].

After the iteration, perform a heap sorting to TNodeInfo array. Then iterate the top cut[lvl] elements in TNodeInfo array, set the probability in level[lvl][pinfo->idx] to 1.0. Then call CutLevel(), clean up the nodes whose probabilities are 1.0 from level[lvl], and change the child index of its parent node in level[lvl-1]. Assign the new size of level[lvl] to sz[lvl]. Clean up the allocated memory then return.

CSlmPruner::CalcDistance(lvl, idx[], hw[0..lvl]):

Firstly, get the bow(h) from its parent node, and save it in variable BOW. Then calculate p(h), p(h) = p(hw1).p(hw2|hw1)...p(hwlvl-1|hw1hw2...hwlvl-2), save it in variable PH. Then the probability on node level[lvl][idx[lvl]] is just p(w|h), save it to PHW. Call getPr(lvl-1, hw+2) to get the probability of p(w|h'), save it in variable PH_W.

If cache_level is not lvl-1 (then is -1), or cache_idx is not idx[lvl-1] (then is -1), then assign the proper indices to cache_level and cache_idx separately, and initialize cache_PA and cache_PB to 1.0. Iterate its child nodes [parent->child, (parent+1)->child), get the probability on each child node (i.e., p(wi|h)), and subtracted from cache_PA; and set hw[lvl] with the word id on current child node, call getPr(lvl-1, hw+2) to get probability p_r (i.e., p(wi|h')), and subtracted from cache_PB. After iteration, PA = sum_{w_i:BO(w_i,h)} P(w_i|h), and PB = sum_{w_i:BO(w_i,h)} P(w_i|h'). Add p(w|h) and p(w|h') to PA and PB separately, then get the quotient of PA/PB, i.e., bow'(h), save it in variable _BOW.

At last, follow the above equation to calculate D(p||p'), and return it.

CSlmPruner::CalcBOW(),CalcNodeBow(...):

It's similar with the CSlmBuilder::CalcBOW() and CalcNodeBow(...) in previous section. Not repeat here.