Dealing with underflow in joint probability calculations
Been meaning to post this for a while, but it kept dropping to the bottom of the to-do list. There is a subtle bug in the code I posted earlier for splitting long strings into words.
The problem is that for most words, their probability of occurrence is extremely small. This means that when we have a sequence of several words, the probability of all of them occurring is
P(a, b, c) = P(a) * P(b) * P(c)
which is three very small numbers multiplied by each other, which is a far smaller number. If we get enough of those, we quickly encounter the undeflow problem — internal computer representation of these small numbers does not have enough bits to represent the enormous smallness of them, and rounds them to roughly zero. This makes word sequences that have very different likelihoods appear equally likely, since probailities of almost all word sequences are very small due to the large size of the dictionary.
How do we avoid this? The answer turns out to be very simple. We don’t multiply these probabilities. All we want is to find the combination of words that has the maximum likelyhood, right? Well, what we can do is find the maximum of the sum of the probabilities’ logs. This is safe because we know that if
p(a) > p(b)
log(p(a)) > log(p(b))
Why sum? because we can reach back all the way to 9th grade and recall that
log(a*b) = log(a) + log(b)
So in the previous code, all we need to do is make the find_word_seq_score() routine calculate the sum of logs of probabilities instead of the product of probabilities, and we are done!
Photo by Flickr user springm used under CC license