实验CRF++

使用赵海博士的6 tags + 6 templates法,对bakeoff2005公开的语料进行实验。用python写了个简单的转换脚本,将UTF-8编码的训练语料转换为CRF++支持的格式。MSR的语料库转换之后是24M,训练模型花了大概26个小时,得到的模型为25M,对MSR的测试数据F-score可以达到96%(python的评估脚本),对PKU的测试数据只有82%多。PKU的语料库转换之后是11M,训练模型花了近13个小时,得到的模型有14M,对PKU的测试数据F-score有92%多,对MSR的测试数据也只有82%左右。看来MSR和PKU训练语料的分词风格有较大的差异,导致交叉测试的分数比较低。

另外,大概是C++的STL线程安全有问题,在Linux、Solaris和Mac OS上使用多线程都SEGFAULT了,所以都是单线程训练的。不敢想象如果用数百兆的语料,会花多长时间、用多少内存...

下面是特征模板的定义:

# Unigram
U00:%x[-1,0]
U01:%x[0,0]
U02:%x[1,0]
U03:%x[-1,0]/%x[0,0]
U04:%x[0,0]/%x[1,0]
U05:%x[-1,0]/%x[1,0]
# Bigram
B

A Beginner's Note of CRF++

Thanks for Yandong's help and guidance, that I got some basic ideas about CRF (Conditional Random Filed) and how the CRF model looks like. The encoder of CRF++, crf_learn, could generate a model in text format with the '-t' option. Take the Japanese word segmentation demonstration (example/seg) as an example, the following is the model in text format:

ersion: 100
cost-factor: 1
maxid: 1386      /* the number of feature functions */
xsize: 1

B                /* the tag lists, in this case, we have two tags */
I

U00:%x[-2,0]     /* unigram feature templates */
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-2,0]/%x[-1,0]/%x[0,0]
U06:%x[-1,0]/%x[0,0]/%x[1,0]
U07:%x[0,0]/%x[1,0]/%x[2,0]
U08:%x[-1,0]/%x[0,0]
U09:%x[0,0]/%x[1,0]
B               /* bigram feature template */

0 B             /* bigram of the tags for C_{-1} and C_0,  */
                /* number of features are 2^(# of tags).   */

4 U00:_B-1      /* _B-1 is the starting of a sentence */
                /* _B+1 is the ending of a sentence   */

6 U00:_B-2      /* _B-2 is the pre-token of _B-1  */
                /* _B+2 is the post-token of _B+1 */

8 U00:
10 U00:、       /* feature function id, template id, and observation */
12 U00:〇       /* since we only have two tags, each entry could     */
14 U00:「       /* be expanded to 2 feature functions                */
20 U00:う
... ...
... ...
1382 U09:3/年
1384 U09:9/3

-0.0799963416235706     /* the weight for each feature function */
0.4346315510326526      /* the negative value indicates the     */
-0.1044728887459596     /* feature is rarely seen, and we have  */
-0.2501623206703318     /* 1386 weights in total.               */
... ...