SunPinyin Code Tour (2)

1. Dictionary and Word Segmentation

As we described previously, Sunpinyin uses FMM (forward maximum matching) for the 1 pass word segmentation. The dictionary are loaded to a trie structure, the key is the Chinese character in UCS-4. Some nodes come into valid words, e.g., on the most left path in following diagram, "中国" is a valid word; while on the path of "中华人民共和国", the node "共" could not stands for a valid word. (This structure is similar with the one in reference 3.)

We could see, all child nodes of root are valid words, since every character could be considered as "single character word".

sim_dict.[h|cpp]:

parseText ()   : load the dictionary in text file, and construct a trie.
matchLongest (): to do the longest match, and return the length.
step ()        : by given a character, transfer to a child node, and return it.

mmseg/mmseg.cpp:

In function processSingleFile(), read the corpus sentence by sentence. For example, "为人民办实事的精神", and then call SIMDict::matchLongest() to do the longest match.

Because there is the entry "为人民" in dictionary,so the 1st segmented word we got is "为人民",its length is 3, then call getAmbiLen() to see if there are crossing-ambiguities, and return the maximum length of crossing-ambiguities. The following is the result for each iteration:

  1. 人民, 办实事的精神 -> i=1, len=2, word_len=3, do FMM for "人民办实事", get the matched word "人民", its length is 2.
  2. 民办, 实事的精神 -> i=2, len=2, word_len=4, do FMM for "民办实事",get the matched word "民办",its length is 2. Since the i+len is longer than the original parameter word_len, set the word_len to i+len (ie. 4), continue the iteration. You could see, we have detect the ambiguity.
  3. 办实事, 的精神 -> i=3, len=3, word_len=6
  4. 实事, 的精神 -> i=4, len=2, word_len=6
  5. 事, 的精神 -> i=5, len=1, word_len=6, the ambiguity length is now determined as 6.
  6. break the loop, and return the ambiguity length, ie. 6

If an ambiguous-id is specified (by the option -a), these 6 characters would be ignored, and output the specified AMBI-ID to the result, then continue the processing. When reach the end of a sentence, if we require the binary output format, a sentence ending ID (specified by option -s) would be output. Here shows a result in text format:

$ echo "为人民办实事的精神" | ./mmseg -d ../raw/dict.utf8 -f text -s 10  -a 9
<ambi>为人民办实事</ambi> 的 精神

So, in our segmentation result, all crossing-ambiguities are marked as single one word ID (AMBI-ID). That means we ignored that part of information (the ratio is considerable). Thus in our later processing, most of the trigrams are meaningful and valuable. The language model we got, could avoid the cross-ambiguities' impact somehow. Then, useslmseg on this language model, to re-segment the raw corpus and get the new language model. This time, the original ambigious information is also leveraged.

$ echo "为人民办实事的精神" | ./slmseg -d ../raw/dict.utf8 -f text -s 10 -m ../data/lm_sc.t3g
为人民 办实事 的 精神

Note, dueto the copyright of raw corpus, we may not be able to open it, however, to use slmseg, you could create a symbol link to ime/data/lm_sc.t3g.<arch> in slm/data directory.

References:

  1. 数学之美 系列二 -- 谈谈中文分词
  2. 中文搜索引擎技术揭密:中文分词
  3. “天堂的阶梯”设计的中文分词算法