###
2. Counting Trigrams

After we tokenized the corpus, the task is to count the occurrence numbers of each trigram:

`$ make m3_idngram`

./ids2ngram -n 3 -s ../swap/swap -o ../swap/lm_sc.id3gram -p 5000000 ../swap/lm_sc.ids

Read the ID stream file generated in previous step. Read N-1 (in trigram case, it's 2) IDs firstly, save to the ids array (a member inngram object). Then read another ID, save it as the third element (.e., ids[N-1]) in ids array. Invoke the operator [] of std::map, by using array ids[0..2] as the key, retrieve the occurrence number for this trigram, (if never seen in the map, operator [] will insert the <key,0> pair to map), increase the number. Shift the ids array to left by one cell, continue the above processing.

In the iterations, we also monitor if the size of map reaches the maximum number (paraMax). It's to prevent the map from occupying too much memory space. If reaches the maximum number, output the map to a swap file, record the offset for this paragraph, then clear the map, continue the counting. Note, the std::map is an ordered container, the inner structure is a balanced-ordered-binary-tree (usually is a red-black tree). We then need perform a merge sort for all paragraphs.

Merging sort. To perform a merging sort for several ordered data sources, is to find the minimal (or maximum) one in the head elements and output it, then the head of the affected data source moves to its next element, repeat the above processing, till the set of head elements is empty. If we have many data sources to be sorted, sorting the head elements becomes a critical step for performance. InSunPinyin, there is a template class to deal the multiple-way merging, i.e., slm/sim_fmerge.h. It uses the heap sorting algorithm to sort the head elements. getBest() returns the paragraph who has the minimal head element, next() adds the next element in this paragraph to heap.

Finally, we got a raw trigram model without any smoothing, i.e., all trigrams with their occurrence numbers.

###
3. Build the back-off based n-gram language model

Next step is to build a back-off based n-gram model:

`$ make m3_slm`

./slmbuild -n 3 -o ../swap/lm_sc.3gram -w 120000 -c 0,2,2 -d ABS,0.0005 -d ABS -d ABS,0.6 -b 10 -e 9 ../swap/lm_sc.id3gram

Firstly, let's have a look at the definition of n-gram.

For the statistical based NLP (natural language processing), the probability of a sentence (S=W_{1},W_{2},...W_{n}), according to the chain rule, is:

`P(S) = P(W`

_{1}).P(W_{2}|W_{1}).P(W_{3}|W_{1},W_{2}).P(W_{4}|W_{1},W_{2},W_{3})...P(W_{n}|W_{1},W_{2},...W_{n-1})

= P(W_{1}).prod_i^n (P(W_{i}|W_{1},W_{2},...W_{i-1}))

To express mathematics equations in text mode, we use Latex-like language here. To express the above equation with Latex, it's P(S) = P(W_1)\prod_i^nP(W_i|W_1,W_2,...W_{i-1}), the visualizing form is:

While in reality, due to the data sparseness, it's impossible to calculate the probability in such a way. A particle method is, to assume the P(W_{i}|W_{1},W_{2},...W_{i-1}) only depends on the previous N words, i.e., W_{i-N+1},W_{i-N+2},...W_{i-1}. In particular, we have unigram (N=0，context-free grammar), bigram (N=1), trigram (N=2), and fourgram (N=3). The most commonly used is trigram.

Let's have a look at several useful terms:

- types: the size of vocabulary (or dictionary)
- tokens: the size of corpus
- vector/solution space: for a N-gram model, V = types^N
- data sparseness: tokens << V, tokens is much smaller than V
- Maximum likelihood: P
_{ml}(W_{i}|W_{i-2},W_{i-1}) = C(W_{i-2},W_{i-1},W_{i})/C(W_{i-2},W_{i-1}), C stands for the occurrence numbers. This means the maximum likelihood estimation of P(W_{i}|W_{i-2},W_{i-1}) equals to the ratio of occurrence numbers of (W_{i-2},W_{i-1},W_{i}) and (W_{i-2},W_{i-1}).

Many possible word sequences, may not be collected in the training corpus. If P_{ml}(Wi|h) is 0, then the probability of entire sentence becomes 0. E.g., if we only see "Tom read a book" in corpus, while we don't see anything like "Yong read xxx", then P_{ml}(Yong read a book) = 0. So, we need to smooth the model. Smoothing is to allocate some probabilities from known events to unknown events (i.e., the events whose occurrence number is 0).

Some smoothing methodologies：

- Simple Smoothing: add-one (or add delta), poor performance if used alone
- Discounting smoothing: E.g., Absolute Discounting, liner smoothing, Witten-Bell, Good-Turing
- Composite smoothing: back-off smoothing, and interpolated smoothing

A general back-off model could be expressed as following:

`P`

_{s}(W_{i}|h) = P_{d}(W_{i}|h) -- C(h,W_{i}) > 0

bow(h).P_{s}(W_{i}|h') -- C(h,W_{i}) == 0

- h' is the history h truncated by the first word. For trigram (A,B,Wi), h is (A,B), so h' is B.
- P
_{d}(W_{i}|h) < P_{ml}(W_{i}|h), to discount the ML estimation, many discounting methods could be used - bow(h) is the back-off weight, to a given h, bow(h) is a
**const number**, and could be determined by P_{s}(W_{1}|h)+P_{s}(W_{2}|h)+...+P_{s}(W_{n}|h) = sum_i (P_{s}(W_{i}|h)) = 1 - this is a recursive expression, if the occurrence number of (h',W
_{i}) is 0, then back-off continues, it may back-off to W_{i}, even to a average distribution (if W_{i}is not seen). - if h is not seen in training stage, P(W
_{i}|h) = P(W_{i}|h')

Katz's back-off model used Good-Turing. Kneser-Ney is another back-off model, whose performance is better than Katz's model. Stanley Chen and Joshua Goodman once wrote a thesis, discussed the pros and cons of different smoothing methods, you could refer to it for more details.

Let's look at the code.

To initialize the level array according to parameter n, level[0] is pseudo root, used for average distribution. level[1] is forunigram，level[2] is for bigram , and level[3] is for trigram... For trigram, the nodes on level[3] are leaf nodes. The leaf nodes do not have bow information, since they belong to circumstances that C(h,Wi)>0.

**CSlmBuilder::AddNGram(ngram, fr)**：

Call isExcludeId() to check if the first word of ngram is an excluded word (in this case it's 9, the AMBI-ID). For every vector in array level[1..n], check if its reserved space is used up, if true, allocate more memory for it. If the 1st word is not a excluded word, add its occurrence number to level[0][0] (i.e., pseudo root). Add each word id inngram to level[i] (0<i<=n), and accumulate the occurrence number in parent node. For level[n] (i.e., the leaf level), only the ngram whose occurrence number is bigger than the specified cut number, would be added; it's to avoid the leaf node numbers are too high (maybe tens of millions), so that we could not save them in memory.

In this function, we check, if ngram[i] (0<i<=n-1) is an excluded word, then only the ngram[0..i-1] would be counted; if ngram[i] (0<=i<=n-1) is a sentence stop (as we specified, it's 10), then only ngram[0..i] is counted. E.g., for trigram (9, x, y), it's ignored directly; for (x, 9, y), only unigram (x) is counted; for (10, x, y), unigram (10) is counted; for (x, 10, y), unigram (x) and bigram (x, 10) are counted.

The entry point to build the back-off based n-gram model.

Counting the total occurrence numbers of ngrams whose occurence number is less than SLM_MAX_R (i.e., 16). E.g., nr[3][0] is the tatal number of trigrams, nr[3][10] is the total occurrence numbers for the trigrams which occurs 10 times (can only be times of 10, e.g., 500). These data would be used when initializing discounters.

To add a tail node for each level, just for convenience in iterating.

Delete the ngrams whose occurrence number is less than specified threshold. The thresholds we specified is (-c 0,2,2), which means we will ignore thebigrams and trigrams whose occurrence number is less than 2, but do not cut any unigrams.

Initialize the discounter for each level. Call DiscountOneLevel() to perform discounting for each level, from higher to lower. And set the level[0] (pseudo root) as average distribution, its probability is 1 divided by the number of types (specified by -w flag, in this example is 120000).

**CSlmBuilder::DiscountOneLevel(v, ch, disc ...)**：

v is the previous level, e.g., if we are discounting level[3], then the 1st argument is level[2]. To discount level[m], iterate every node in level[m-1], then discount its child nodes ([it->child,itnext ->child)). The number subtracted here are frequencies, not probabilities. Then divided by the frequency in parent node, get the conditional probability, save it in the original node.

Calculate back-off weights, from higher to lower level. base[i] refers to the 1st element in level[i], idx[i] is the cursor in level[i], so (base[i])[idx[i]) is the visiting node. (Here we rely on the fact that the memory space allocated in std::vector is continuous, we could use pointer or array to access the elements.)

Take calculating BOWs on level[2] as an example. Then, lvl is 2, and we are going to iterate every word in base[2]. At first, try to find the parent node for this word with a for loop, .e.g,idx [2] refers to the level[2][1] (C) in above diagram, then its parent is level[1][0] (A), put them in the word[0..4] array, as a result it's {0, A, C, .}. Then callCalcNodeBow () to get the BOW for this node, the last two parameters in this function, is the range of children nodes [ch+node.child, ch+nodenext.child).

**CalcNodeBow(...)**：

Iterate the node in [chh, cht), accumulate every probability to variable sumnext; and call builder->getPr(lvl, words+2) to get a probability, accumulate to variable sum. The actual effect of words+2, is to truncate the 1st word in history. Use above example, on the 1st iteration, words[0.4]={0, A, C, D}, so words+2 is {C, D}, the probability returned bygetPr() is P

_{s}(D|C) (Note, getPr() itself is a recursive function, when lvl is 0, return the average distribution probability). BOW is then calculated as (1.0-sumnext)/(1.0-sum).

As we explained, BOW is determined by equation sum_i (P

_{s}(W_{i}|h)) = 1. From the following transformations, you could see the meanings of sumnext and sum.