A simple stript to extract the contents from Sogou corpus

I wrote a simple python script to extract the contents from Sogou corpus.

#!/usr/bin/python
import codecs
import sys
usage = """
Usage:
    sogou_corpus_conv.py corpus_in_xml > contents_in_txt
"""
try:
    file = codecs.open(sys.argv[1], "r", "GB18030" )
except:
    print usage
    exit(1)
for line in file:
    if line.startswith(""):
        start, end = len(""), -len("")-1
        line = line[start:end].replace(u'\ue525', '')
        print line.encode("UTF-8")

With the extracted contents, you could continue to build the SunPinyin SLM.