I wrote a simple python script to extract the contents from Sogou corpus.
#!/usr/bin/python
import codecs
import sys
usage = """
Usage:
sogou_corpus_conv.py corpus_in_xml > contents_in_txt
"""
try:
file = codecs.open(sys.argv[1], "r", "GB18030" )
except:
print usage
exit(1)
for line in file:
if line.startswith(""):
start, end = len(" "), -len("")-1
line = line[start:end].replace(u'\ue525', '')
print line.encode("UTF-8")
With the extracted contents, you could continue to build the SunPinyin SLM.
Recent Comments