SquareCog's SquareBlog

Building an Inverted Index with Hadoop and Pig

Posted in programming by squarecog on January 17, 2009

Note: For some reason, this post appears to be pretty popular. Here's the thing. This was the first thing I wrote when learning Pig. Literally -- I wrote it down the evening I sat down to play with Pig's syntax. You wouldn't really ever construct an inverted index this way. The point was that you can, not that you should. It is, however, kind of neat.

boarPig is a system for processing very large datasets, developed mostly at Yahoo and now an Apache Hadoop sub-project.  Pig aims to provide massive scalability by translating code written in a new data processing language called Pig Latin into Hadoop (map/reduce) plans.

In this post, I present a (very) brief description of the Pig project and demonstrate how one can construct an inverted index from a collection of text files using just a few lines of PigLatin. (more…)

Great post on databases and map/reduce

Posted in Uncategorized by squarecog on January 14, 2009

Anand Rajaraman has a great post on Datawocky with an overview of the various approaches to data analysis using Map/Reduce, and they ways in which this paradigm is bridged with RDBMSes by AsterData and Greenplum, and the Pig project. Don’t miss the comments from people directly responsible for these technologies, as well as Facebook’s Hive.

Tagged with: , , ,

Dealing with underflow in joint probability calculations

Posted in programming by squarecog on January 10, 2009

Been meaning to post this for a while, but it kept dropping to the bottom of the to-do list. There is a subtle bug in the code I posted earlier for splitting long strings into words.

The problem is that for most words, their probability of occurrence is extremely small.  This means that when we have a sequence of several words, the probability of all of them occurring is

P(a, b, c) = P(a) * P(b) * P(c)

which is three very small numbers multiplied by each other, which is a far smaller number. If we get enough of those, we quickly encounter the undeflow problem — internal computer representation of these small numbers does not have enough bits to represent the enormous smallness of them, and rounds them to roughly zero. (more…)

%d bloggers like this: