SunPinyin Code Tour (6)

6. Using the generated language model

In previous step, we generated the threaded language model. The next question is how to use this model? Let's look at the header file of CThreadSlm class, slm.h, here are the public methods:

  1. load(): Load the language model to memory
  2. free(): Release the memory allocated for the language model
  3. isUseLogPr(): If the language model use log values for probabilities
  4. transfer(history, wid, result): By given a word id wid, transfer from history state (history) to new state (result), and return P(wid|history).
  5. transferNegLog(history, wid, result): Similar with above method, but returns -log(p(wid|history).
  6. history_state_of(st): Get the h' of st, and return.
  7. historify(st): Set st to its h'.
  8. lastWordId(st): Return the last word id of state st. Assuming st is (lvl, idx), if lvl>=N, then return m_Levels[N][idx]; if 0<lvl<N, return m_level[lvl][idx]; if level==0 and idx==0, return the word id on pseudo root (i.e., 0), if idx>0 just return idx (we will discuss this situation when introducing history cache.

Now, let's look at slmseg to see how to searching by leveraging CThreadSlm class.

$ make slmids3
./slmseg -d ../raw/dict.utf8 -f bin -s 10 -m ../data/lm_sc.t3g ../raw/corpus.utf8 >../swap/lm_sc.ids

The source code of slmseg/slmseg.cpp:

struct TLatticeWord {
  int m_left;
  int m_right;   //[m_left, m_right) is the index range on lattice for this word
  int m_wordId;  //word id

struct TLatticeStateValue{
  double             m_pr;       //the probability on this state node
  TLatticeWord*      mp_btword;  //the back-trace word, i.e., the given word causing the transformation
  CThreadSlm::TState m_btstate;  //the back-trace state, i.e., where the current state comes from

/* The mapping of SLM state node, and its Lattice state information */
typedef std::map<CThreadSlm::TState, TLatticeStateValue> TLatticeColumnStates;

/* represent a column of lattice */
struct TLatticeColumn {
  TLatticeWordVec m_wordstarting;  //
The words starting from this column, these are actually the input words when expending this column,
                                   //the newly transferred state node, located at lattice[word.m_right].
  TLatticeColumnStates m_states;   //the SLM state nodes and their corresponding Lattice states


Read the corpus sentence by sentence. For each sentence, call buildLattice() to build the searching lattice, and call searchBest() to search on the lattice, and get the best segmentation by getBestPath(), and call output() to write out the best segmentation.

buildLattice(sentence, lattice):

Set the SLM state on lattice[0] as pseudo root. Initialize i as 0, perform FMM for sntnc[i..n], get the word length -- len. And call getAmbiLen() to get the length of maximum length of crossing-ambiguities -- ambilen. If ambilen <= len, then there is no crossing-ambiguities, then call insertLatticeWord() to insert ([i, i+len), Wid) to the word list of lattice[i]; set i=i+len, continue the loop. If there is crossing-ambiguities, call fullSegbuildLattice() to perform a full segmentation for sub-sentence sntnc[i..i+ambilen), for all segmented words (including all single-character word), assuming one's location is [left, right), insert it to lattice[left]; and set i=i+ambilen, continue the loop. After the iteration is done, add an end-of-sentence id (we use "。" for this id in following examples) at the end.

Use the example in section 2,

Perform the FMM for this sample sentence, the first word we get is "发扬", and no ambiguity, so insert "发扬" to lattice[0]. Then i=2, continue the loop. The FMM result is "为人民", but the ambilen is 6, so we need perform full segmentation for sntnc[2..7]. Insert all possible segments on idx (2<=idx<7) to latttice[idx]. Then i=2+6=8, continue the loop. The FMM result is "的", and no ambiguity. Then i=9, the FMM result is "精神" and no ambiguity. After the loop is finished, add an end-of-sentence ID at the tail. We could see, there is no word on lattice[1], and the words on lattice[2] are "为", "为人", and "为人民". And so on...


Iterate the lattice from 0. For all lattice state nodes on lattcie[i], use the words starting from this lattice ([left, right), Wid) to expand, save the newly transferred state node h' (CThreadSlm::historify(his)) to lattice[word.m_right].m_states. Note, given a word D, two different state nodes (A, C) and (B, C) may result in the same h' (i.e., (C, D)), so we only need to keep the path who has bigger probability.

Let's start from lattice[0], we have set it as pseudo root (0, 0) in buildLatice() method. There is only one word on lattice[0], "发扬"; call CThreadSlm::trnasferNegLog((0,0), wid_of("发扬")), get the new SLM state node and the probability P("发扬"), save its h' (lvl=1, idx) and its lattice state information (probability and back-trace information) to lattice[2]. Then look at lattice[1], there is no state node to be expanded. Then going forward to lattice[2], there is one state node, and three words ("为", "为人", "为人民"), save the newly transferred three nodes to lattice[3], lattice[4] and lattice[5]. And so on...

This is a dynamic programming problem, and what we used here is Viterbi Lattice algorithm.

getBestPath(lattice, segResult):

From end to start, back-trace the lattice, save the word id on best path to segResult. By reversing segResult, we got the best segmentation. For our example, it's the blue path on above figure.

SunPinyin Code Tour (5)

5. Threading the language model

Let's look at the last step, threading (or adding back-off
indices) to the pruned language model.

$ make m3_thread
./slmthread ../swap/lm_sc.3gm ../data/lm_sc.t3g

The goal of threading is to accelerate the searching in language model. Take trigram as an example, if we want to calculate the
probability of sentence "ABCDE。", P(S) = P(A).P(B|A).P(C|AB).P(D|BC).P(E|CD).P(<EOS>|DE), and assume
no back-off needed when evaluating all these conditional probabilities.

First, we get P(A) from level[1] (unigram) via a binary search; and find B from A's children via a binary search, to get P(B|A); and then find C from B's children via another binary search, to get P(C|AB). The next step is to calculate P(D|BC). Then we need go back to unigram to find B, and then get C from B's children, then get P(D|BC), all these done by binary searches. Obviously, the performance is low. When we evaluating P(C|AB), if we could directly locate to (B, C) from node C, we could get the search much faster.

We use (level, idx) to denote a node, the information on this node is described as (W, h', p, b, c):

  • W:word id. The location of this node implies its history information.
  • h':The back-off node, described as (level', idx').
  • p:p(W|h)
  • b:bow(h)
  • c:The start index of child node in next level.

Certainly, there is no b and c on leaf node. Now, the back-off structure becomes a graph from a tree. Its basic operation is: on
current history state hi(level,idx), by given an input word W, it transfer to another history state hj(level',
idx'), and return P(W|h). The level' on hj, not always equals to level-1. E.g., C(ABC)>0, but (B, C) is not seen in
training, then the h' on this node is the C node on level[1]. If C is not seen either, h' is then the pseudo root (level[0][0]).

Let's look at the following snippet,

h (level,idx)    ----------->   W1,p1,h'1,[b,c]
    h',bow       \              ... ...
                  \             Wi,pi,h'i,[b,c]
                   \            ... ...
                    -------->   Wk,pk,h'k,[b,c]

Given an input word W. If W is one child of h, denoted as N (level+1, idx'), its p value is just P(W|h).

  • If N has children, then N is the new history node;
  • If not, the h' on N becomes the new history node.

If W is not child of h, execute the same processing from h' on (level, idx), find the new history node, and times the probability
with bow(h), return it as P(W|h). Note, it's a recursive process.

Let's look at the code:


CSIMSlmWithIteration::getIdString(it, history):

The m_history member (std::vector<int>) holds the indices for each level, save the word ID on this node to passed in argument history. Method next() increase the last element in m_history (i.e., the index in current level), and call adjustIterator() to find the preceding nodes, save the indices to m_history.

CSIMSlmWithIteration::findBackOffState(n, hw, bol, bon):

Find the back-off node (h') of a specified n-gram (held in hw[1..n]), and return its level and idx. Call findState() to get the idx of hw[2..n] on level[n-1], if the index is not less than 0 then this node (n-1, idx) does have child node, return the location of h'. Otherwise, call findState() to get the idx of hw[3..n] on level[n-2], ... If the loop reach hw[n], return the pseudo root.

E.g., find the back-off node for trigram (A, B, C). Find out if (B, C) exists, if so, return (2, idx_BC). Otherwise, find out if (C) exists, if so return (1, idx_C). Otherwise, return (0, 0).

CSIMSlmWithIteration::findState(n, hw):

Find the index of specified n-gram (held in hw[1..n]) on level[n]. If it does not exist, return -1.

We also perform the compression for 32-bits floating numbers when threading. bow values are compressed to 14 bits, pr values are compressed to 16 bits. The basic idea is to count all bow (or pr) values, combine the clustering floating numbers, make the total number is under 1<<BITS_BOW (or 1<<BITS_PR); in the generated language model binary file, there are two floating
tables, we need to search the table to get the original value. I'm not so clear about this algorithm, hopes the original author
Phill Zhang to introduce more details.

Now, we get the final language model -- lm_sc.t3g. You could use tslminfo to look at the data in plain text format:

$ make m3_tslminfo

./tslminfo -p -v -l ../raw/dict.utf8 ../data/lm_sc.t3g

$ tail ../swap/

点击 率 也   0.036363638937 (2,839922)
点击 率 最高   0.081818044186 (2,840054)
点击 率 的   0.096969485283 (2,840080)
点击 率 达到   0.036363638937 (2,840122)
点击 量 达   0.485714286566 (2,1132198)
点击 鼠标 ,   0.400000005960 (1,1)
点击 鼠标 左   0.309090882540 (2,1186378)
率队 取得 <unknown>   0.479999989271 (2,366031)
率队 在 <unknown>   0.130769222975 (2,431213)
率队 打 了   0.479999989271 (2,642183)

We could use  CThreadSlm to access or use the language model. In next section, we will use slmseg as an example, to see how we construct the lattice and search by leveraging the generated language model.

SunPinyin Code Tour (4)

4. Entropy-based Pruning

In this section, we will discuss the pruning for back-off based n-gram language model.

$ make m3_prune
./slmprune ../swap/lm_sc.3gram ../swap/lm_sc.3gm R 100000 1250000 1000000

Let's look at the definition of Entropy, assume p(x) is the probability density function for random variable X, then the entropy of X is:

Note: the logarithm base is 2, and 0log 0 = 0.

Entropy is used to measure the determinability of random events. The entropy value is larger, the determinability is lower. The last part of the above equation indicates that entropy is the expectation of log(1/p(X)). If p(X=xi) equals to 1/256, then log(1/p(xi))=log(256)=8, which means it requires 8 bits to encode the event xi. And H(p) is its expectation (weighted-average), i.e., the average information length.

Let's look at the relative entropy (i.e., Kullback-Leibler distance):

It means if the distribution of random variable X is p(x), while we use q(x) to encode X, the extra bits we are going to use.

The equation to calculate conditional relative entropy is:

Note: for the proving of this equation, please refer to "Elements of Information Theory".

Let's look at how to use relative entropy to prune back-off language model. The following analysis comes from "Entropy-based Pruning of Backoff Language Models".

Recall the general form of a back-off model,

Ps(Wi|h) = Pd(Wi|h)         -- C(h,Wi) > 0
           bow(h).Ps(Wi|h') -- C(h,Wi) == 0

The goal of pruning, is to remove some n-grams which have explicit estimation (i.e., C(h,Wi)>0), to reduce the size of parameters, while minimizing the performance loss. (Note, after the pruning, the probabilities of left ones which have explicit estimations do not change).

Pruning Steps:

  1. Select a threshold,
  2. Calculate the relative entropy by pruning each n-gram individually,
  3. Remove all N-grams that raise the relative entropy by less than the threshold, then re-calculate the back-off weights.

Assume we removed (h, w) from the model, when we want to calculate p(w|h), it requires backed-off (or implicit) estimation. To guarantee the formality of the model, i.e., to make sure sum_wi p(wi|h) = 1, bow(h) must be re-calculated, named as bow'(h), so p'(w|h) = bow'(h).p(w|h'). While the meantime, all backed-off estimations involving h are impacted, we use notation BO(wi,h) to denote this case (it's actually the circumstances that C(h,wi) == 0). We could get:

Note, h and w are given and concrete. And besides (h, w), the probabilities of left n-grams which have explicit estimations do not change, so some parts of the summation are counteracted.

Finally (please refer to the quoted paper for details), we could get:

In above equation, p(h={h1,h2...}) = p(h1)p(h2|h1)..., and we also know:

Next step is to calculate bow'(h):

To calculate bow'(h), is to drop the term for the pruned N-gram (w, h) from the summation in both numerator and denominator. Since bow(h) is known, if we got the numerator, then we could get the denominator, then add p(w|h) and p(w|h') separately. Then, we could get bow'(h).

Let's look at the code:



Call PrunLevel(lvl) from level[N] to level[1], to prune each level. After the pruning, call CalcBOW() to re-calculate the back-off weights for each node in level[0..N-1].


sz[lvl] holds the node numbers in level[lvl], which include the pseudo tail, so the actually node number is sz[lvl]-1, assign it to n. cut[lvl] is the amount to be removed for this level. Allocate TNodeInfo[n] array, and assign to pbuf. Iterate each node in level[lvl], get its preceding nodes by a for loop, and put this n-gram to hw[0..lvl], keep the indices for each level to idx[0..lvl]. Figure out if this node has children ((pn+1)->child > pn->child), if yes, this node could not be pruned. If no, call CalcDistance(lvl, idx, hw) to calculate the increased relative entropy by removing this node. Save the information to pbuf. Continue to the next node in level[lvl].

After the iteration, perform a heap sorting to TNodeInfo array. Then iterate the top cut[lvl] elements in TNodeInfo array, set the probability in level[lvl][pinfo->idx] to 1.0. Then call CutLevel(), clean up the nodes whose probabilities are 1.0 from level[lvl], and change the child index of its parent node in level[lvl-1]. Assign the new size of level[lvl] to sz[lvl]. Clean up the allocated memory then return.

CSlmPruner::CalcDistance(lvl, idx[], hw[0..lvl]):

Firstly, get the bow(h) from its parent node, and save it in variable BOW. Then calculate p(h), p(h) = p(hw1).p(hw2|hw1)...p(hwlvl-1|hw1hw2...hwlvl-2), save it in variable PH. Then the probability on node level[lvl][idx[lvl]] is just p(w|h), save it to PHW. Call getPr(lvl-1, hw+2) to get the probability of p(w|h'), save it in variable PH_W.

If cache_level is not lvl-1 (then is -1), or cache_idx is not idx[lvl-1] (then is -1), then assign the proper indices to cache_level and cache_idx separately, and initialize cache_PA and cache_PB to 1.0. Iterate its child nodes [parent->child, (parent+1)->child), get the probability on each child node (i.e., p(wi|h)), and subtracted from cache_PA; and set hw[lvl] with the word id on current child node, call getPr(lvl-1, hw+2) to get probability p_r (i.e., p(wi|h')), and subtracted from cache_PB. After iteration, PA = sum_{w_i:BO(w_i,h)} P(w_i|h), and PB = sum_{w_i:BO(w_i,h)} P(w_i|h'). Add p(w|h) and p(w|h') to PA and PB separately, then get the quotient of PA/PB, i.e., bow'(h), save it in variable _BOW.

At last, follow the above equation to calculate D(p||p'), and return it.


It's similar with the CSlmBuilder::CalcBOW() and CalcNodeBow(...) in previous section. Not repeat here.

SunPinyin Code Tour (3)

2. Counting Trigrams

After we tokenized the corpus, the task is to count the occurrence numbers of each trigram:

$ make m3_idngram
./ids2ngram -n 3 -s ../swap/swap -o ../swap/lm_sc.id3gram -p 5000000 ../swap/lm_sc.ids



Read the ID stream file generated in previous step. Read N-1 (in trigram case, it's 2) IDs firstly, save to the ids array (a member inngram object). Then read another ID, save it as the third element (.e., ids[N-1]) in ids array. Invoke the operator [] of std::map, by using array ids[0..2] as the key, retrieve the occurrence number for this trigram, (if never seen in the map, operator [] will insert the <key,0> pair to map), increase the number. Shift the ids array to left by one cell, continue the above processing.

In the iterations, we also monitor if the size of map reaches the maximum number (paraMax). It's to prevent the map from occupying too much memory space. If reaches the maximum number, output the map to a swap file, record the offset for this paragraph, then clear the map, continue the counting. Note, the std::map is an ordered container, the inner structure is a balanced-ordered-binary-tree (usually is a red-black tree). We then need perform a merge sort for all paragraphs.


Merging sort. To perform a merging sort for several ordered data sources, is to find the minimal (or maximum) one in the head elements and output it, then the head of the affected data source moves to its next element, repeat the above processing, till the set of head elements is empty. If we have many data sources to be sorted, sorting the head elements becomes a critical step for performance. InSunPinyin, there is a template class to deal the multiple-way merging, i.e., slm/sim_fmerge.h. It uses the heap sorting algorithm to sort the head elements. getBest() returns the paragraph who has the minimal head element, next() adds the next element in this paragraph to heap.

Finally, we got a raw trigram model without any smoothing, i.e., all trigrams with their occurrence numbers.

3. Build the back-off based n-gram language model

Next step is to build a back-off based n-gram model:

$ make m3_slm
./slmbuild -n 3 -o ../swap/lm_sc.3gram  -w 120000 -c 0,2,2 -d ABS,0.0005 -d ABS -d ABS,0.6 -b 10 -e 9 ../swap/lm_sc.id3gram

Firstly, let's have a look at the definition of n-gram.

For the statistical based NLP (natural language processing), the probability of a sentence (S=W1,W2,...Wn), according to the chain rule, is:

P(S) = P(W1).P(W2|W1).P(W3|W1,W2).P(W4|W1,W2,W3)...P(Wn|W1,W2,...Wn-1)
     = P(W1).prod_i^n (P(Wi|W1,W2,...Wi-1))

To express mathematics equations in text mode, we use Latex-like language here. To express the above equation with Latex, it's P(S) = P(W_1)\prod_i^nP(W_i|W_1,W_2,...W_{i-1}), the visualizing form is:

While in reality, due to the data sparseness, it's impossible to calculate the probability in such a way. A particle method is, to assume the P(Wi|W1,W2,...Wi-1) only depends on the previous N words, i.e., Wi-N+1,Wi-N+2,...Wi-1. In particular, we have unigram (N=0,context-free grammar), bigram (N=1), trigram (N=2), and fourgram (N=3). The most commonly used is trigram.

Let's have a look at several useful terms:

  • types: the size of vocabulary (or dictionary)
  • tokens: the size of corpus
  • vector/solution space: for a N-gram model, V = types^N
  • data sparseness: tokens << V, tokens is much smaller than V
  • Maximum likelihood: Pml(Wi|Wi-2,Wi-1) = C(Wi-2,Wi-1,Wi)/C(Wi-2,Wi-1), C stands for the occurrence numbers. This means the maximum likelihood estimation of P(Wi|Wi-2,Wi-1) equals to the ratio of occurrence numbers of (Wi-2,Wi-1,Wi) and (Wi-2,Wi-1).

Many possible word sequences, may not be collected in the training corpus. If Pml(Wi|h) is 0, then the probability of entire sentence becomes 0. E.g., if we only see "Tom read a book" in corpus, while we don't see anything like "Yong read xxx", then Pml(Yong read a book) = 0. So, we need to smooth the model. Smoothing is to allocate some probabilities from known events to unknown events (i.e., the events whose occurrence number is 0).

Some smoothing methodologies:

  • Simple Smoothing: add-one (or add delta), poor performance if used alone
  • Discounting smoothing: E.g., Absolute Discounting, liner smoothing, Witten-Bell, Good-Turing
  • Composite smoothing: back-off smoothing, and interpolated smoothing

A general back-off model could be expressed as following:

Ps(Wi|h) = Pd(Wi|h)         -- C(h,Wi) > 0
           bow(h).Ps(Wi|h') -- C(h,Wi) == 0

  • h' is the history h truncated by the first word. For trigram (A,B,Wi), h is (A,B), so h' is B.
  • Pd(Wi|h) < Pml(Wi|h), to discount the ML estimation, many discounting methods could be used
  • bow(h) is the back-off weight, to a given h, bow(h) is a const number, and could be determined by Ps(W1|h)+Ps(W2|h)+...+Ps(Wn|h) = sum_i (Ps(Wi|h)) = 1
  • this is a recursive expression, if the occurrence number of (h',Wi) is 0, then back-off continues, it may back-off to Wi, even to a average distribution (if Wi is not seen).
  • if h is not seen in training stage, P(Wi|h) = P(Wi|h')

Katz's back-off model used Good-Turing. Kneser-Ney is another back-off model, whose performance is better than Katz's model. Stanley Chen and Joshua Goodman once wrote a thesis, discussed the pros and cons of different smoothing methods, you could refer to it for more details.

Let's look at the code.



To initialize the level array according to parameter n, level[0] is pseudo root, used for average distribution. level[1] is forunigram,level[2] is for bigram , and level[3] is for trigram... For trigram, the nodes on level[3] are leaf nodes. The leaf nodes do not have bow information, since they belong to circumstances that C(h,Wi)>0.

CSlmBuilder::AddNGram(ngram, fr)

Call isExcludeId() to check if the first word of ngram is an excluded word (in this case it's 9, the AMBI-ID). For every vector in array level[1..n], check if its reserved space is used up, if true, allocate more memory for it. If the 1st word is not a excluded word, add its occurrence number to level[0][0] (i.e., pseudo root). Add each word id inngram to level[i] (0<i<=n), and accumulate the occurrence number in parent node. For level[n] (i.e., the leaf level), only the ngram whose occurrence number is bigger than the specified cut number, would be added; it's to avoid the leaf node numbers are too high (maybe tens of millions), so that we could not save them in memory.

In this function, we check, if ngram[i] (0<i<=n-1) is an excluded word, then only the ngram[0..i-1] would be counted; if ngram[i] (0<=i<=n-1) is a sentence stop (as we specified, it's 10), then only ngram[0..i] is counted. E.g., for trigram (9, x, y), it's ignored directly; for (x, 9, y), only unigram (x) is counted; for (10, x, y), unigram (10) is counted; for (x, 10, y), unigram (x) and bigram (x, 10) are counted.


The entry point to build the back-off based n-gram model.


Counting the total occurrence numbers of ngrams whose occurence number is less than SLM_MAX_R (i.e., 16). E.g., nr[3][0] is the tatal number of trigrams, nr[3][10] is the total occurrence numbers for the trigrams which occurs 10 times (can only be times of 10, e.g., 500). These data would be used when initializing discounters.


To add a tail node for each level, just for convenience in iterating.


Delete the ngrams whose occurrence number is less than specified threshold. The thresholds we specified is (-c 0,2,2), which means we will ignore thebigrams and trigrams whose occurrence number is less than 2, but do not cut any unigrams.


Initialize the discounter for each level. Call DiscountOneLevel() to perform discounting for each level, from higher to lower. And set the level[0] (pseudo root) as average distribution, its probability is 1 divided by the number of types (specified by -w flag, in this example is 120000).

CSlmBuilder::DiscountOneLevel(v, ch, disc ...)

v is the previous level, e.g., if we are discounting level[3], then the 1st argument is level[2]. To discount level[m], iterate every node in level[m-1], then discount its child nodes ([it->child,itnext ->child)). The number subtracted here are frequencies, not probabilities. Then divided by the frequency in parent node, get the conditional probability, save it in the original node.


Calculate back-off weights, from higher to lower level. base[i] refers to the 1st element in level[i], idx[i] is the cursor in level[i], so (base[i])[idx[i]) is the visiting node. (Here we rely on the fact that the memory space allocated in std::vector is continuous, we could use pointer or array to access the elements.)

Take calculating BOWs on level[2] as an example. Then, lvl is 2, and we are going to iterate every word in base[2]. At first, try to find the parent node for this word with a for loop, .e.g,idx [2] refers to the level[2][1] (C) in above diagram, then its parent is level[1][0] (A), put them in the word[0..4] array, as a result it's {0, A, C, .}. Then callCalcNodeBow () to get the BOW for this node, the last two parameters in this function, is the range of children nodes [ch+node.child, ch+nodenext.child).


Iterate the node in [chh, cht), accumulate every probability to variable sumnext; and call builder->getPr(lvl, words+2) to get a probability, accumulate to variable sum. The actual effect of words+2, is to truncate the 1st word in history. Use above example, on the 1st iteration, words[0.4]={0, A, C, D}, so words+2 is {C, D}, the probability returned bygetPr() is P s(D|C) (Note, getPr() itself is a recursive function, when lvl is 0, return the average distribution probability). BOW is then calculated as (1.0-sumnext)/(1.0-sum).

As we explained, BOW is determined by equation sum_i (Ps(Wi|h)) = 1. From the following transformations, you could see the meanings of sumnext and sum.

SunPinyin Code Tour (2)

1. Dictionary and Word Segmentation

As we described previously, Sunpinyin uses FMM (forward maximum matching) for the 1 pass word segmentation. The dictionary are loaded to a trie structure, the key is the Chinese character in UCS-4. Some nodes come into valid words, e.g., on the most left path in following diagram, "中国" is a valid word; while on the path of "中华人民共和国", the node "共" could not stands for a valid word. (This structure is similar with the one in reference 3.)

We could see, all child nodes of root are valid words, since every character could be considered as "single character word".


parseText ()   : load the dictionary in text file, and construct a trie.
matchLongest (): to do the longest match, and return the length.
step ()        : by given a character, transfer to a child node, and return it.


In function processSingleFile(), read the corpus sentence by sentence. For example, "为人民办实事的精神", and then call SIMDict::matchLongest() to do the longest match.

Because there is the entry "为人民" in dictionary,so the 1st segmented word we got is "为人民",its length is 3, then call getAmbiLen() to see if there are crossing-ambiguities, and return the maximum length of crossing-ambiguities. The following is the result for each iteration:

  1. 人民, 办实事的精神 -> i=1, len=2, word_len=3, do FMM for "人民办实事", get the matched word "人民", its length is 2.
  2. 民办, 实事的精神 -> i=2, len=2, word_len=4, do FMM for "民办实事",get the matched word "民办",its length is 2. Since the i+len is longer than the original parameter word_len, set the word_len to i+len (ie. 4), continue the iteration. You could see, we have detect the ambiguity.
  3. 办实事, 的精神 -> i=3, len=3, word_len=6
  4. 实事, 的精神 -> i=4, len=2, word_len=6
  5. 事, 的精神 -> i=5, len=1, word_len=6, the ambiguity length is now determined as 6.
  6. break the loop, and return the ambiguity length, ie. 6

If an ambiguous-id is specified (by the option -a), these 6 characters would be ignored, and output the specified AMBI-ID to the result, then continue the processing. When reach the end of a sentence, if we require the binary output format, a sentence ending ID (specified by option -s) would be output. Here shows a result in text format:

$ echo "为人民办实事的精神" | ./mmseg -d ../raw/dict.utf8 -f text -s 10  -a 9
<ambi>为人民办实事</ambi> 的 精神

So, in our segmentation result, all crossing-ambiguities are marked as single one word ID (AMBI-ID). That means we ignored that part of information (the ratio is considerable). Thus in our later processing, most of the trigrams are meaningful and valuable. The language model we got, could avoid the cross-ambiguities' impact somehow. Then, useslmseg on this language model, to re-segment the raw corpus and get the new language model. This time, the original ambigious information is also leveraged.

$ echo "为人民办实事的精神" | ./slmseg -d ../raw/dict.utf8 -f text -s 10 -m ../data/lm_sc.t3g
为人民 办实事 的 精神

Note, dueto the copyright of raw corpus, we may not be able to open it, however, to use slmseg, you could create a symbol link to ime/data/lm_sc.t3g.<arch> in slm/data directory.


  1. 数学之美 系列二 -- 谈谈中文分词
  2. 中文搜索引擎技术揭密:中文分词
  3. “天堂的阶梯”设计的中文分词算法

SunPinyin Code Tour (1)

0. Overview

At first, let's check out the sources from inputmethod repository,
$ hg clone ssh://

Or you could navigate to the input-method project on, and download the latest repository snapshot. (While it does not include the large binary data files in ime/data, you could download them from here.)

SunPinyin's source is located under inputmethod/sunpinyin, which contains two parts: the sources under slm directory are the training utilities for statistical language model; the ones under ime are related to input method engine usage, including the core logic and various portings for different input method frameworks.

The code in slm directory is used to build a back-off based n-gram statistical language model,

You could build the utilities as following steps:

$ ./ --prefix=/usr
$ make

build: all compiled utilities (as their object files) are located in this directory, e.g., ids2ngram, slmbuild etc. By checking the in this folder, you could know how to build the slm and lexicon files step by step.

$ export LC_ALL=zh_CN.UTF-8
Since the dictionaries and raw corpus are in UTF-8 encoding, so please set your locale to any UTF-8 locale before we proceed.

$ make test_corpus (or real_corpus)
This target is to create a symbol link to corpus.utf8. You could manually create it.

$ make trigram
This target is to produce a trigram language model.

  1. Use mmseg (the forward maximum matching segmenting), to tokenize the raw corpus basing the dictionary, and output the word IDs to a file (../swap/lm_sc.ids).
  2. Use ids2ngram to count all occurrence numbers for every trigram, and output the result to a file. Take the following sentence as an example, 'ABCEBCA。', the trigrams we could get are (<S> A B), (A B C), (B C E) ... ( B C A), (C A 。), (A '。' </S>). (<S> and </S> stand for start and end of sentence separately.)
  3. Use slmbuild to construct a back-off trigram language model, but not pruned yet.
  4. Use slmprune to prune the trigram model, with an Entropy-based algorithm.
  5. Use slmthread to thread the language model, to speed up the locating of back-off states. And save the final output to ../data/lm_sc.t3g。

$ make bs_trigram
This target is similar with the above one, the only difference is it use slmseg to do the segmentation, it utilizes the language model we just got (segmented by FMM), to re-segment the raw corpus. That could improve the accuracy of language model.

$ make lexicon
According to the unigram information in the generated language model, process the dictionary file, and get a lexicon file (a trie indexed by pinyin characters) that supports incomplete pinyin.

We will introduce the code in details later.