SunPinyin Code Tour (2)

1. Dictionary and Word Segmentation

As we described previously, Sunpinyin uses FMM (forward maximum matching) for the 1 pass word segmentation. The dictionary are loaded to a trie structure, the key is the Chinese character in UCS-4. Some nodes come into valid words, e.g., on the most left path in following diagram, "中国" is a valid word; while on the path of "中华人民共和国", the node "共" could not stands for a valid word. (This structure is similar with the one in reference 3.)

We could see, all child nodes of root are valid words, since every character could be considered as "single character word".


parseText ()   : load the dictionary in text file, and construct a trie.
matchLongest (): to do the longest match, and return the length.
step ()        : by given a character, transfer to a child node, and return it.


In function processSingleFile(), read the corpus sentence by sentence. For example, "为人民办实事的精神", and then call SIMDict::matchLongest() to do the longest match.

Because there is the entry "为人民" in dictionary,so the 1st segmented word we got is "为人民",its length is 3, then call getAmbiLen() to see if there are crossing-ambiguities, and return the maximum length of crossing-ambiguities. The following is the result for each iteration:

  1. 人民, 办实事的精神 -> i=1, len=2, word_len=3, do FMM for "人民办实事", get the matched word "人民", its length is 2.
  2. 民办, 实事的精神 -> i=2, len=2, word_len=4, do FMM for "民办实事",get the matched word "民办",its length is 2. Since the i+len is longer than the original parameter word_len, set the word_len to i+len (ie. 4), continue the iteration. You could see, we have detect the ambiguity.
  3. 办实事, 的精神 -> i=3, len=3, word_len=6
  4. 实事, 的精神 -> i=4, len=2, word_len=6
  5. 事, 的精神 -> i=5, len=1, word_len=6, the ambiguity length is now determined as 6.
  6. break the loop, and return the ambiguity length, ie. 6

If an ambiguous-id is specified (by the option -a), these 6 characters would be ignored, and output the specified AMBI-ID to the result, then continue the processing. When reach the end of a sentence, if we require the binary output format, a sentence ending ID (specified by option -s) would be output. Here shows a result in text format:

$ echo "为人民办实事的精神" | ./mmseg -d ../raw/dict.utf8 -f text -s 10  -a 9
<ambi>为人民办实事</ambi> 的 精神

So, in our segmentation result, all crossing-ambiguities are marked as single one word ID (AMBI-ID). That means we ignored that part of information (the ratio is considerable). Thus in our later processing, most of the trigrams are meaningful and valuable. The language model we got, could avoid the cross-ambiguities' impact somehow. Then, useslmseg on this language model, to re-segment the raw corpus and get the new language model. This time, the original ambigious information is also leveraged.

$ echo "为人民办实事的精神" | ./slmseg -d ../raw/dict.utf8 -f text -s 10 -m ../data/lm_sc.t3g
为人民 办实事 的 精神

Note, dueto the copyright of raw corpus, we may not be able to open it, however, to use slmseg, you could create a symbol link to ime/data/lm_sc.t3g.<arch> in slm/data directory.


  1. 数学之美 系列二 -- 谈谈中文分词
  2. 中文搜索引擎技术揭密:中文分词
  3. “天堂的阶梯”设计的中文分词算法

3 thoughts on “SunPinyin Code Tour (2)

  1. 我想我们要改善改善拼音切分了,像下面网址的图片中显示的结果依然需要倒回去人工切分,但其实所需只是一个拼音歧义切分而已。
    我没有认真研究底层的代码(只是粗略看了一下),虽然全切分会导致指数级的 Syllables Group 扩容,但应该可考虑 MM+RM 再结合递归方式适当增加切分的范畴。

  2. Hi, Anthony, 非常感谢,汉语拼音的切分歧义要比分词的歧义处理简单一些,而且Syllable Graph的厚度(或者宽度)不会超过2。我目前遇到的比较棘手的问题是,描述Syllable Graph的数据结构,以及采用Syllable Graph的方式后对用户的编辑处理会非常繁琐。有时间我们交流一下各自的想法 :)

  3. 因为上次和你说的那个实际处理上非常多的拼音切分歧义,除非是按罗马字的书写习惯进行输入。

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

To submit your comment, click the image below where it asks you to...
Clickcha - The One-Click Captcha