SquareCog's SquareBlog

Dealing with underflow in joint probability calculations

Posted in programming by squarecog on January 10, 2009

Been meaning to post this for a while, but it kept dropping to the bottom of the to-do list. There is a subtle bug in the code I posted earlier for splitting long strings into words.

The problem is that for most words, their probability of occurrence is extremely small.  This means that when we have a sequence of several words, the probability of all of them occurring is

P(a, b, c) = P(a) * P(b) * P(c)

which is three very small numbers multiplied by each other, which is a far smaller number. If we get enough of those, we quickly encounter the undeflow problem — internal computer representation of these small numbers does not have enough bits to represent the enormous smallness of them, and rounds them to roughly zero. This makes word sequences that have very different likelihoods appear equally likely, since probailities of almost all word sequences are very small due to the large size of the dictionary.

How do we avoid this? The answer turns out to be very simple.  We don’t multiply these probabilities.  All we want is to find the combination of words that has the maximum likelyhood, right? Well, what we can do is find the maximum of the sum of the probabilities’ logs.  This is safe because we know that if

p(a) > p(b)

then

log(p(a)) > log(p(b))

Why sum? because we can reach back all the way to 9th grade and recall that

log(a*b) = log(a) + log(b)

So in the previous code, all we need to do is make the find_word_seq_score() routine calculate the sum of logs of probabilities instead of the product of probabilities, and we are done!

Photo by Flickr user springm used under CC license

One Response

Subscribe to comments with RSS.

  1. morgan said, on September 24, 2010 at 6:34 pm

    Thanks.. so would the change be something like this:

    $score += log($DICT{$_}/$TOTAL);

    ?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: