SquareCog's SquareBlog

Building an Inverted Index with Hadoop and Pig

Posted in programming by squarecog on January 17, 2009

Note: For some reason, this post appears to be pretty popular. Here's the thing. This was the first thing I wrote when learning Pig. Literally -- I wrote it down the evening I sat down to play with Pig's syntax. You wouldn't really ever construct an inverted index this way. The point was that you can, not that you should. It is, however, kind of neat.

boarPig is a system for processing very large datasets, developed mostly at Yahoo and now an Apache Hadoop sub-project.  Pig aims to provide massive scalability by translating code written in a new data processing language called Pig Latin into Hadoop (map/reduce) plans.

In this post, I present a (very) brief description of the Pig project and demonstrate how one can construct an inverted index from a collection of text files using just a few lines of PigLatin. (more…)

Advertisements

Great post on databases and map/reduce

Posted in Uncategorized by squarecog on January 14, 2009

Anand Rajaraman has a great post on Datawocky with an overview of the various approaches to data analysis using Map/Reduce, and they ways in which this paradigm is bridged with RDBMSes by AsterData and Greenplum, and the Pig project. Don’t miss the comments from people directly responsible for these technologies, as well as Facebook’s Hive.

Tagged with: , , ,
%d bloggers like this: