I wrote a simple python script to extract the contents from Sogou corpus.
#!/usr/bin/python
import codecs
import sys
usage = """
Usage:
sogou_corpus_conv.py corpus_in_xml > contents_in_txt
"""
try:
file = codecs.open(sys.argv[1], "r", "GB18030" )
except:
print usage
exit(1)
for line in file:
if line.startswith(""):
start, end = len(" "), -len("")-1
line = line[start:end].replace(u'\ue525', '')
print line.encode("UTF-8")
With the extracted contents, you could continue to build the SunPinyin SLM.
this input-method seems more and more interesting...
how is the progress of porting SunPinyin to SCIM?
cant wait.
Hi, BlueF, the scim porting is almost finished, the only missing feature is the configuration UI.