SunPinyin Code Tour (6)

6. Using the generated language model

In previous step, we generated the threaded language model. The next question is how to use this model? Let's look at the header file of CThreadSlm class, slm.h, here are the public methods:

  1. load(): Load the language model to memory
  2. free(): Release the memory allocated for the language model
  3. isUseLogPr(): If the language model use log values for probabilities
  4. transfer(history, wid, result): By given a word id wid, transfer from history state (history) to new state (result), and return P(wid|history).
  5. transferNegLog(history, wid, result): Similar with above method, but returns -log(p(wid|history).
  6. history_state_of(st): Get the h' of st, and return.
  7. historify(st): Set st to its h'.
  8. lastWordId(st): Return the last word id of state st. Assuming st is (lvl, idx), if lvl>=N, then return m_Levels[N][idx]; if 0<lvl<N, return m_level[lvl][idx]; if level==0 and idx==0, return the word id on pseudo root (i.e., 0), if idx>0 just return idx (we will discuss this situation when introducing history cache.

Now, let's look at slmseg to see how to searching by leveraging CThreadSlm class.

$ make slmids3
./slmseg -d ../raw/dict.utf8 -f bin -s 10 -m ../data/lm_sc.t3g ../raw/corpus.utf8 >../swap/lm_sc.ids

The source code of slmseg/slmseg.cpp:

struct TLatticeWord {
  int m_left;
  int m_right;   //[m_left, m_right) is the index range on lattice for this word
  int m_wordId;  //word id
};

struct TLatticeStateValue{
  double             m_pr;       //the probability on this state node
  TLatticeWord*      mp_btword;  //the back-trace word, i.e., the given word causing the transformation
  CThreadSlm::TState m_btstate;  //the back-trace state, i.e., where the current state comes from
};

/* The mapping of SLM state node, and its Lattice state information */
typedef std::map<CThreadSlm::TState, TLatticeStateValue> TLatticeColumnStates;

/* represent a column of lattice */
struct TLatticeColumn {
  TLatticeWordVec m_wordstarting;  //
The words starting from this column, these are actually the input words when expending this column,
                                   //the newly transferred state node, located at lattice[word.m_right].
  TLatticeColumnStates m_states;   //the SLM state nodes and their corresponding Lattice states
};

processSingleFile():

Read the corpus sentence by sentence. For each sentence, call buildLattice() to build the searching lattice, and call searchBest() to search on the lattice, and get the best segmentation by getBestPath(), and call output() to write out the best segmentation.

buildLattice(sentence, lattice):

Set the SLM state on lattice[0] as pseudo root. Initialize i as 0, perform FMM for sntnc[i..n], get the word length -- len. And call getAmbiLen() to get the length of maximum length of crossing-ambiguities -- ambilen. If ambilen <= len, then there is no crossing-ambiguities, then call insertLatticeWord() to insert ([i, i+len), Wid) to the word list of lattice[i]; set i=i+len, continue the loop. If there is crossing-ambiguities, call fullSegbuildLattice() to perform a full segmentation for sub-sentence sntnc[i..i+ambilen), for all segmented words (including all single-character word), assuming one's location is [left, right), insert it to lattice[left]; and set i=i+ambilen, continue the loop. After the iteration is done, add an end-of-sentence id (we use "。" for this id in following examples) at the end.

Use the example in section 2,

Perform the FMM for this sample sentence, the first word we get is "发扬", and no ambiguity, so insert "发扬" to lattice[0]. Then i=2, continue the loop. The FMM result is "为人民", but the ambilen is 6, so we need perform full segmentation for sntnc[2..7]. Insert all possible segments on idx (2<=idx<7) to latttice[idx]. Then i=2+6=8, continue the loop. The FMM result is "的", and no ambiguity. Then i=9, the FMM result is "精神" and no ambiguity. After the loop is finished, add an end-of-sentence ID at the tail. We could see, there is no word on lattice[1], and the words on lattice[2] are "为", "为人", and "为人民". And so on...

searchBest(lattice):

Iterate the lattice from 0. For all lattice state nodes on lattcie[i], use the words starting from this lattice ([left, right), Wid) to expand, save the newly transferred state node h' (CThreadSlm::historify(his)) to lattice[word.m_right].m_states. Note, given a word D, two different state nodes (A, C) and (B, C) may result in the same h' (i.e., (C, D)), so we only need to keep the path who has bigger probability.

Let's start from lattice[0], we have set it as pseudo root (0, 0) in buildLatice() method. There is only one word on lattice[0], "发扬"; call CThreadSlm::trnasferNegLog((0,0), wid_of("发扬")), get the new SLM state node and the probability P("发扬"), save its h' (lvl=1, idx) and its lattice state information (probability and back-trace information) to lattice[2]. Then look at lattice[1], there is no state node to be expanded. Then going forward to lattice[2], there is one state node, and three words ("为", "为人", "为人民"), save the newly transferred three nodes to lattice[3], lattice[4] and lattice[5]. And so on...

This is a dynamic programming problem, and what we used here is Viterbi Lattice algorithm.

getBestPath(lattice, segResult):

From end to start, back-trace the lattice, save the word id on best path to segResult. By reversing segResult, we got the best segmentation. For our example, it's the blue path on above figure.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

To submit your comment, click the image below where it asks you to...
Clickcha - The One-Click Captcha