A simple stript to extract the contents from Sogou corpus

I wrote a simple python script to extract the contents from Sogou corpus.

#!/usr/bin/python

import codecs
import sys

usage = """
Usage:
    sogou_corpus_conv.py corpus_in_xml > contents_in_txt
"""

try:
    file = codecs.open(sys.argv[1], "r", "GB18030" )
except:
    print usage
    exit(1)

for line in file:
    if line.startswith(""):
        start, end = len(""), -len("")-1
        line = line[start:end].replace(u'\ue525', '')
        print line.encode("UTF-8")

With the extracted contents, you could continue to build the SunPinyin SLM.

2 thoughts on “A simple stript to extract the contents from Sogou corpus

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

To submit your comment, click the image below where it asks you to...
Clickcha - The One-Click Captcha