### 4. Entropy-based Pruning

In this section, we will discuss the pruning for back-off based n-gram language model.

`$ make m3_prune`

./slmprune ../swap/lm_sc.3gram ../swap/lm_sc.3gm R 100000 1250000 1000000

Let's look at the definition of Entropy, assume p(x) is the probability density function for random variable X, then the entropy of X is:

Note: the logarithm base is 2, and 0log 0 = 0.

Entropy is used to measure the determinability of random events. The entropy value is larger, the determinability is lower. The last part of the above equation indicates that entropy is the expectation of log(1/p(X)). If p(X=x_{i}) equals to 1/256, then log(1/p(x_{i}))=log(256)=8, which means it requires 8 bits to encode the event x_{i}. And H(p) is its expectation (weighted-average), i.e., the average information length.

Let's look at the relative entropy (i.e., Kullback-Leibler distance):

It means if the distribution of random variable X is p(x), while we use q(x) to encode X, the extra bits we are going to use.

The equation to calculate conditional relative entropy is:

Note: for the proving of this equation, please refer to "Elements of Information Theory".

Let's look at how to use relative entropy to prune back-off language model. The following analysis comes from "Entropy-based Pruning of Backoff Language Models".

Recall the general form of a back-off model,

Ps(Wi|h) = Pd(Wi|h) -- C(h,Wi) > 0 bow(h).Ps(Wi|h') -- C(h,Wi) == 0

The goal of pruning, is to remove some n-grams which have explicit estimation (i.e., C(h,Wi)>0), to reduce the size of parameters, while minimizing the performance loss. (Note, after the pruning, the probabilities of left ones which have explicit estimations do not change).

Pruning Steps:

- Select a threshold,
- Calculate the relative entropy by pruning each n-gram individually,
- Remove all N-grams that raise the relative entropy by less than the threshold, then re-calculate the back-off weights.

Assume we removed (h, w) from the model, when we want to calculate p(w|h), it requires backed-off (or implicit) estimation. To guarantee the formality of the model, i.e., to make sure sum_w_{i} p(w_{i}|h) = 1, bow(h) must be re-calculated, named as bow'(h), so p'(w|h) = bow'(h).p(w|h'). While the meantime, all backed-off estimations involving h are impacted, we use notation BO(wi,h) to denote this case (it's actually the circumstances that C(h,w_{i}) == 0). We could get:

Note, h and w are given and concrete. And besides (h, w), the probabilities of left n-grams which have explicit estimations do not change, so some parts of the summation are counteracted.

Finally (please refer to the quoted paper for details), we could get:

In above equation, p(h={h1,h2...}) = p(h_{1})p(h_{2}|h_{1})..., and we also know:

Next step is to calculate bow'(h):

To calculate bow'(h), is to drop the term for the pruned N-gram (w, h) from the summation in both numerator and denominator. Since bow(h) is known, if we got the numerator, then we could get the denominator, then add p(w|h) and p(w|h') separately. Then, we could get bow'(h).

Let's look at the code:

slmprune.cpp:

** CSlmPruner::Prune()**:

Call PrunLevel(lvl) from level[N] to level[1], to prune each level. After the pruning, call CalcBOW() to re-calculate the back-off weights for each node in level[0..N-1].

**CSlmPruner::PruneLevel(lvl)**:

sz[lvl] holds the node numbers in level[lvl], which include the pseudo tail, so the actually node number is sz[lvl]-1, assign it to n. cut[lvl] is the amount to be removed for this level. Allocate TNodeInfo[n] array, and assign to pbuf. Iterate each node in level[lvl], get its preceding nodes by a for loop, and put this n-gram to hw[0..lvl], keep the indices for each level to idx[0..lvl]. Figure out if this node has children ((pn+1)->child > pn->child), if yes, this node could not be pruned. If no, call CalcDistance(lvl, idx, hw) to calculate the increased relative entropy by removing this node. Save the information to pbuf. Continue to the next node in level[lvl].

After the iteration, perform a heap sorting to TNodeInfo array. Then iterate the top cut[lvl] elements in TNodeInfo array, set the probability in level[lvl][pinfo->idx] to 1.0. Then call CutLevel(), clean up the nodes whose probabilities are 1.0 from level[lvl], and change the child index of its parent node in level[lvl-1]. Assign the new size of level[lvl] to sz[lvl]. Clean up the allocated memory then return.

**CSlmPruner::CalcDistance(lvl, idx[], hw[0..lvl])**:

Firstly, get the bow(h) from its parent node, and save it in variable BOW. Then calculate p(h), p(h) = p(hw

_{1}).p(hw_{2}|hw_{1})...p(hw_{lvl-1}|hw_{1}hw_{2}...hw_{lvl-2}), save it in variable PH. Then the probability on node level[lvl][idx[lvl]] is just p(w|h), save it to PHW. Call getPr(lvl-1, hw+2) to get the probability of p(w|h'), save it in variable PH_W.If cache_level is not lvl-1 (then is -1), or cache_idx is not idx[lvl-1] (then is -1), then assign the proper indices to cache_level and cache_idx separately, and initialize cache_PA and cache_PB to 1.0. Iterate its child nodes [parent->child, (parent+1)->child), get the probability on each child node (i.e., p(w

_{i}|h)), and subtracted from cache_PA; and set hw[lvl] with the word id on current child node, call getPr(lvl-1, hw+2) to get probability p_r (i.e., p(w_{i}|h')), and subtracted from cache_PB. After iteration, PA = sum_{w_i:BO(w_i,h)} P(w_i|h), and PB = sum_{w_i:BO(w_i,h)} P(w_i|h'). Add p(w|h) and p(w|h') to PA and PB separately, then get the quotient of PA/PB, i.e., bow'(h), save it in variable _BOW.At last, follow the above equation to calculate D(p||p'), and return it.

**CSlmPruner::CalcBOW()，CalcNodeBow(...)**:

It's similar with the

CSlmBuilder::CalcBOW()andCalcNodeBow(...)in previous section. Not repeat here.