SquareCog's SquareBlog

Upcoming Features in Pig 0.8: Dynamic Invokers

Posted in programming by squarecog on August 20, 2010

Pig release 0.8 is scheduled to be feature-frozen and branched at the end of August 2010. This release has many, many useful new features, mostly addressing usability. In this series of posts, I will demonstrate some of my favorites from this release.

Pig 0.8 will have a family of built-in UDFs called Dynamic Invokers. The idea is simple: frequently, Pig users need to use a simple function that is already provided by standard Java libraries, but for which a UDF has not been written. Dynamic Invokers allow a Pig programmer to refer to Java functions without having to wrap them in custom Pig UDFs, at the cost of doing some Java reflection on every function call.

Tagged with: , ,

Pig, HBase, Hadoop, and Twitter: HUG talk slides

Posted in programming by squarecog on May 20, 2010

I presented tonight at the Bay Area Hadoop User Group, talking briefly about Twitter’s use of Hadoop and Pig. Here are the slides:

View this document on Scribd

GROUP operator in Apache Pig

Posted in programming by squarecog on May 11, 2010

I’ve been doing a fair amount of helping people get started with Apache Pig. One common stumbling block is the GROUP operator. Although familiar, as it serves a similar function to SQL’s GROUP operator, it is just different enough in the Pig Latin language to be confusing. Hopefully this brief post will shed some light on what exactly is going on.

Tagged with: , ,

Presentation on Apache Pig at Pittsburgh Hadoop User Group

Posted in programming by squarecog on November 3, 2009

Ashutosh and I presented at the Pittsburgh Hadoop User Group on Apache Pig. The slide deck goes through a brief into to Pig Latin, then jumps into an explanation of the different join algorithms, and finishes up with some research ideas. A pretty wide-ranging talk, for a diverse audience.

Scribd messed up some of the colors, so if you can’t read some of the text, try downloading the original.

View this document on Scribd

Free Hadoop Training

Posted in Uncategorized by squarecog on March 13, 2009

Cloudera has made training videos, screencasts, and excercises from their basic Hadoop training available on the web.

Check it out here: http://www.cloudera.com/hadoop-training-basic

They even include a VM image that you can use to get started without messing with installation details.  

But how is it different from the word count tutorial I went through on the Hadoop Wiki, you ask?  There is a section on “algorithms in Map-Reduce” and a screencast on using Hive.  More on this once I actually do the excercises…

Tagged with:

Building an Inverted Index with Hadoop and Pig

Posted in programming by squarecog on January 17, 2009

Note: For some reason, this post appears to be pretty popular. Here's the thing. This was the first thing I wrote when learning Pig. Literally -- I wrote it down the evening I sat down to play with Pig's syntax. You wouldn't really ever construct an inverted index this way. The point was that you can, not that you should. It is, however, kind of neat.

boarPig is a system for processing very large datasets, developed mostly at Yahoo and now an Apache Hadoop sub-project.  Pig aims to provide massive scalability by translating code written in a new data processing language called Pig Latin into Hadoop (map/reduce) plans.

In this post, I present a (very) brief description of the Pig project and demonstrate how one can construct an inverted index from a collection of text files using just a few lines of PigLatin. (more…)

Great post on databases and map/reduce

Posted in Uncategorized by squarecog on January 14, 2009

Anand Rajaraman has a great post on Datawocky with an overview of the various approaches to data analysis using Map/Reduce, and they ways in which this paradigm is bridged with RDBMSes by AsterData and Greenplum, and the Pig project. Don’t miss the comments from people directly responsible for these technologies, as well as Facebook’s Hive.

Tagged with: , , ,
%d bloggers like this: