5. Threading the language model
Let's look at the last step, threading (or adding back-off
indices) to the pruned language model.
$ make m3_thread
./slmthread ../swap/lm_sc.3gm ../data/lm_sc.t3g
The goal of threading is to accelerate the searching in language model. Take trigram as an example, if we want to calculate the
probability of sentence "ABCDE。", P(S) = P(A).P(B|A).P(C|AB).P(D|BC).P(E|CD).P(<EOS>|DE), and assume
no back-off needed when evaluating all these conditional probabilities.
First, we get P(A) from level (unigram) via a binary search; and find B from A's children via a binary search, to get P(B|A); and then find C from B's children via another binary search, to get P(C|AB). The next step is to calculate P(D|BC). Then we need go back to unigram to find B, and then get C from B's children, then get P(D|BC), all these done by binary searches. Obviously, the performance is low. When we evaluating P(C|AB), if we could directly locate to (B, C) from node C, we could get the search much faster.
We use (level, idx) to denote a node, the information on this node is described as (W, h', p, b, c):
- W：word id. The location of this node implies its history information.
- h'：The back-off node, described as (level', idx').
- c：The start index of child node in next level.
Certainly, there is no b and c on leaf node. Now, the back-off structure becomes a graph from a tree. Its basic operation is: on
current history state hi(level,idx), by given an input word W, it transfer to another history state hj(level',
idx'), and return P(W|h). The level' on hj, not always equals to level-1. E.g., C(ABC)>0, but (B, C) is not seen in
training, then the h' on this node is the C node on level. If C is not seen either, h' is then the pseudo root (level).
Let's look at the following snippet,
level+1 h (level,idx) -----------> W1,p1,h'1,[b,c] h',bow \ ... ... \ Wi,pi,h'i,[b,c] \ ... ... --------> Wk,pk,h'k,[b,c]
Given an input word W. If W is one child of h, denoted as N (level+1, idx'), its p value is just P(W|h).
- If N has children, then N is the new history node;
- If not, the h' on N becomes the new history node.
If W is not child of h, execute the same processing from h' on (level, idx), find the new history node, and times the probability
with bow(h), return it as P(W|h). Note, it's a recursive process.
Let's look at the code:
The m_history member (std::vector<int>) holds the indices for each level, save the word ID on this node to passed in argument history. Method next() increase the last element in m_history (i.e., the index in current level), and call adjustIterator() to find the preceding nodes, save the indices to m_history.
CSIMSlmWithIteration::findBackOffState(n, hw, bol, bon):
Find the back-off node (h') of a specified n-gram (held in hw[1..n]), and return its level and idx. Call findState() to get the idx of hw[2..n] on level[n-1], if the index is not less than 0 then this node (n-1, idx) does have child node, return the location of h'. Otherwise, call findState() to get the idx of hw[3..n] on level[n-2], ... If the loop reach hw[n], return the pseudo root.
E.g., find the back-off node for trigram (A, B, C). Find out if (B, C) exists, if so, return (2, idx_BC). Otherwise, find out if (C) exists, if so return (1, idx_C). Otherwise, return (0, 0).
Find the index of specified n-gram (held in hw[1..n]) on level[n]. If it does not exist, return -1.
We also perform the compression for 32-bits floating numbers when threading. bow values are compressed to 14 bits, pr values are compressed to 16 bits. The basic idea is to count all bow (or pr) values, combine the clustering floating numbers, make the total number is under 1<<BITS_BOW (or 1<<BITS_PR); in the generated language model binary file, there are two floating
tables, we need to search the table to get the original value. I'm not so clear about this algorithm, hopes the original author
Phill Zhang to introduce more details.
Now, we get the final language model -- lm_sc.t3g. You could use tslminfo to look at the data in plain text format:
$ make m3_tslminfo
./tslminfo -p -v -l ../raw/dict.utf8 ../data/lm_sc.t3g
$ tail ../swap/lm_sc.t3g.arpa
点击 率 也 0.036363638937 (2,839922)
点击 率 最高 0.081818044186 (2,840054)
点击 率 的 0.096969485283 (2,840080)
点击 率 达到 0.036363638937 (2,840122)
点击 量 达 0.485714286566 (2,1132198)
点击 鼠标 ， 0.400000005960 (1,1)
点击 鼠标 左 0.309090882540 (2,1186378)
率队 取得 <unknown> 0.479999989271 (2,366031)
率队 在 <unknown> 0.130769222975 (2,431213)
率队 打 了 0.479999989271 (2,642183)
We could use CThreadSlm to access or use the language model. In next section, we will use slmseg as an example, to see how we construct the lattice and search by leveraging the generated language model.