<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
> <channel><title>素心如何天上月 &#187; CRF</title> <atom:link href="http://yongsun.me/tag/crf/feed/" rel="self" type="application/rss+xml" /><link>http://yongsun.me</link> <description>Yong Sun&#039;s Blog</description> <lastBuildDate>Mon, 19 Mar 2012 02:29:22 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3.2</generator> <item><title>实验CRF++</title><link>http://yongsun.me/2008/03/%e5%ae%9e%e9%aa%8ccrf/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=%25e5%25ae%259e%25e9%25aa%258ccrf</link> <comments>http://yongsun.me/2008/03/%e5%ae%9e%e9%aa%8ccrf/#comments</comments> <pubDate>Thu, 20 Mar 2008 08:22:50 +0000</pubDate> <dc:creator>yongsun</dc:creator> <category><![CDATA[Input Method]]></category> <category><![CDATA[CRF]]></category> <category><![CDATA[NLP]]></category> <guid
isPermaLink="false">http://yongsun.wordpress.com/2008/03/20/%e5%ae%9e%e9%aa%8ccrf/</guid> <description><![CDATA[使用赵海博士的6 tags + 6 templates法，对bakeoff2005公开的语料进行实验。用python写了个简单的转换脚本，将UTF-8编码的训练语料转换为CRF++支持的格式。MSR的语料库转换之后是24M，训练模型花了大概26个小时，得到的模型为25M，对MSR的测试数据F-score可以达到96%（python的评估脚本），对PKU的测试数据只有82%多。PKU的语料库转换之后是11M，训练模型花了近13个小时，得到的模型有14M，对PKU的测试数据F-score有92%多，对MSR的测试数据也只有82%左右。看来MSR和PKU训练语料的分词风格有较大的差异，导致交叉测试的分数比较低。 另外，大概是C++的STL线程安全有问题，在Linux、Solaris和Mac OS上使用多线程都SEGFAULT了，所以都是单线程训练的。不敢想象如果用数百兆的语料，会花多长时间、用多少内存... 下面是特征模板的定义： # Unigram U00:%x[-1,0] U01:%x[0,0] U02:%x[1,0] U03:%x[-1,0]/%x[0,0] U04:%x[0,0]/%x[1,0] U05:%x[-1,0]/%x[1,0] # Bigram B]]></description> <content:encoded><![CDATA[<p>使用<a
class="snap_shots" href="http://cwseg.spaces.live.com/blog/">赵海博士</a>的6 tags + 6 templates法，对<a
class="snap_shots" href="http://www.sighan.org/bakeoff2005/">bakeoff2005</a>公开的语料进行实验。用python写了个简单的<a
href="http://yongsun.me/wp-content/uploads/2009/08/crfconv.py">转换脚本</a>，将UTF-8编码的训练语料转换为<a
href="http://crfpp.sourceforge.net/">CRF++</a>支持的格式。MSR的语料库转换之后是24M，训练模型花了大概26个小时，得到的模型为25M，对MSR的测试数据F-score可以达到96%（python的<a
href="http://yongsun.me/wp-content/uploads/2009/08/crfeval.py">评估脚本</a>），对PKU的测试数据只有82%多。PKU的语料库转换之后是11M，训练模型花了近13个小时，得到的模型有14M，对PKU的测试数据F-score有92%多，对MSR的测试数据也只有82%左右。看来MSR和PKU训练语料的分词风格有较大的差异，导致交叉测试的分数比较低。</p><p>另外，大概是C++的STL线程安全有问题，在Linux、Solaris和Mac OS上使用多线程都SEGFAULT了，所以都是单线程训练的。不敢想象如果用数百兆的语料，会花多长时间、用多少内存...</p><p>下面是特征模板的定义：</p><p><code># Unigram<br
/> U00:%x[-1,0]<br
/> U01:%x[0,0]<br
/> U02:%x[1,0]<br
/> U03:%x[-1,0]/%x[0,0]<br
/> U04:%x[0,0]/%x[1,0]<br
/> U05:%x[-1,0]/%x[1,0]<br
/> # Bigram<br
/> B<br
/> </code></p> ]]></content:encoded> <wfw:commentRss>http://yongsun.me/2008/03/%e5%ae%9e%e9%aa%8ccrf/feed/</wfw:commentRss> <slash:comments>13</slash:comments> </item> <item><title>A Beginner&#039;s Note of CRF++</title><link>http://yongsun.me/2008/03/a-beginners-note-of-crf/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=a-beginners-note-of-crf</link> <comments>http://yongsun.me/2008/03/a-beginners-note-of-crf/#comments</comments> <pubDate>Mon, 17 Mar 2008 14:02:05 +0000</pubDate> <dc:creator>yongsun</dc:creator> <category><![CDATA[Input Method]]></category> <category><![CDATA[CRF]]></category> <category><![CDATA[NLP]]></category> <guid
isPermaLink="false">http://yongsun.wordpress.com/2008/03/17/a-beginners-note-of-crf/</guid> <description><![CDATA[Thanks for Yandong's help and guidance, that I got some basic ideas about CRF (Conditional Random Filed) and how the CRF model looks like. The encoder of CRF++, crf_learn, could generate a model in text format with the '-t' option. &#8230; <a
href="http://yongsun.me/2008/03/a-beginners-note-of-crf/">Continue reading <span
class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p>Thanks for <a
class="snap_shots" href="http://yongsun.me/yydzero">Yandong's</a> help and guidance, that I got some basic ideas about CRF (Conditional Random Filed) and how the CRF model looks like. The encoder of <a
class="snap_shots" href="http://crfpp.sourceforge.net/">CRF++</a>, crf_learn, could generate a model in text format with the '-t' option. Take the Japanese word segmentation demonstration (example/seg) as an example, the following is the model in text format:</p><pre>ersion: 100
cost-factor: 1
maxid: 1386      /* the number of feature functions */
xsize: 1
B                /* the tag lists, in this case, we have two tags */
I
U00:%x[-2,0]     /* unigram feature templates */
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-2,0]/%x[-1,0]/%x[0,0]
U06:%x[-1,0]/%x[0,0]/%x[1,0]
U07:%x[0,0]/%x[1,0]/%x[2,0]
U08:%x[-1,0]/%x[0,0]
U09:%x[0,0]/%x[1,0]
B               /* bigram feature template */
0 B             /* bigram of the tags for C_{-1} and C_0,  */
                /* number of features are 2^(# of tags).   */
4 U00:_B-1      /* _B-1 is the starting of a sentence */
                /* _B+1 is the ending of a sentence   */
6 U00:_B-2      /* _B-2 is the pre-token of _B-1  */
                /* _B+2 is the post-token of _B+1 */
8 U00:
10 U00:、       /* feature function id, template id, and observation */
12 U00:〇       /* since we only have two tags, each entry could     */
14 U00:「       /* be expanded to 2 feature functions                */
20 U00:う
... ...
... ...
1382 U09:３/年
1384 U09:９/３
-0.0799963416235706     /* the weight for each feature function */
0.4346315510326526      /* the negative value indicates the     */
-0.1044728887459596     /* feature is rarely seen, and we have  */
-0.2501623206703318     /* 1386 weights in total.               */
... ...</pre>]]></content:encoded> <wfw:commentRss>http://yongsun.me/2008/03/a-beginners-note-of-crf/feed/</wfw:commentRss> <slash:comments>6</slash:comments> </item> </channel> </rss>
