SquareCog's SquareBlog

Upcoming Features in Pig 0.8: Dynamic Invokers

Posted in programming by squarecog on August 20, 2010

Pig release 0.8 is scheduled to be feature-frozen and branched at the end of August 2010. This release has many, many useful new features, mostly addressing usability. In this series of posts, I will demonstrate some of my favorites from this release.

Pig 0.8 will have a family of built-in UDFs called Dynamic Invokers. The idea is simple: frequently, Pig users need to use a simple function that is already provided by standard Java libraries, but for which a UDF has not been written. Dynamic Invokers allow a Pig programmer to refer to Java functions without having to wrap them in custom Pig UDFs, at the cost of doing some Java reflection on every function call.

An example.

Let’s start off with a quick motivation example. Imagine we have a bunch of URL-encoded strings which we want to decode. In Java, this is done by simply calling:

String decoded = URLDecoder.decode(encoded, "UTF-8");

In Pig, there is no built-in function to do this, but it’s easy enough to write your own, wrapping the URLDecoder function:

package org.squarecog.pig;

import java.io.IOException;
import java.net.URLDecoder;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class UrlDecode extends EvalFunc<String> {

    @Override
    public String exec(Tuple input) throws IOException {
        String encoded = (String) input.get(0);
        String encoding = (String) input.get(1);
        return URLDecoder.decode(encoded, encoding);
    }
}

This is about the least amount of code you can get away with — it doesn’t check for failing casts, non-existing fields, and all kinds of other problems, but it does the job most of the time. Having written this class, the next step would be to compile it, test it, package it into a jar, and now the decoder is ready to be used in Pig:

REGISTER squarecogs_pig_stuff.jar;

encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE org.squarecog.pig.UrlDecode(encoded, 'UTF-8');

What a pain. There must be an easier way, right? Well, now there is. With Pig 0.8 all you have to do is put this in your Pig script:

DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');

That’s it. No Java, no compilation. Just use it.

Usage

Currently, Dynamic Invokers can be used for any static function that accepts no arguments or some combination of Strings, ints, longs, doubles, floats, or arrays of same, and returns a String, an int, a long, a double, or a float. Primitives only for the numbers, no capital-letter numeric classes as arguments. Depending on the return type, a specific kind of Invoker must be used: InvokeForString, InvokeForInt, InvokeForLong, InvokeForDouble, or InvokeForFloat.

The DEFINE keyword is used to bind a keyword to a Java method, as above. The first argument to the InvokeFor* constructor is the full path to the desired method. The second argument is a space-delimited ordered list of the classes of the method arguments. This can be omitted or an empty string if the method takes no arguments. Valid class names are String, Long, Float, Double, and Int. Invokers can also work with array arguments, represented in Pig as DataBags of single-tuple elements. Simply refer to string[], for example. Class names are not case-sensitive.

Speed

I tested the speed of these Invokers by using them to take log of the numbers from 0 to 1,000,000 in a tight loop. For this experiment, using the dynamic InvokeForDouble UDF was about twice as slow as using the Log UDF directly. I find this to be an acceptable cost to pay for the speed and convenience of development when writing prototypes and one-off exploratory scripts. Naturally, if you are trying to squeeze all the performance that’s possible out of your scripts, you should use regular UDFs.

Arrays

As mentioned, Pig 0.8 invokers will support array arguments. This makes methods like those in org.apache.commons.math.stat.StatUtils available for processing the results of grouping your datasets, for example. This is very nice, but a word of caution: the resulting UDF will of course not be optimized for Hadoop, and the very significant benefits one gains from implementing the Algebraic and Accumulative interfaces are lost here. Be careful with this one.

Future Work

If people find these Invokers useful, more features can be added, such as support for booleans, bytes, and the various Number classes (rather than just primitives). Let me know what you would like to see, either in the comments, or, even better, on the Pig user mailing list.

Tagged with: , ,

2 Responses

Subscribe to comments with RSS.

  1. David Ciemiewicz said, on October 16, 2010 at 11:21 am

    Hey, thanks or writing this capability.

    Unfortunately, I need more than just static functions. For instance, I needed to compute the Incomplete Beta – P(x>value) given a Beta distribution with parameters of Beta(alpha, beta). And the Incomplete Beta computation parameters of x, alpha, and beta vary per record of computation.

    The way the Apache Commons library is implemented, first I must create a BetaDistributionImpl(alpha, beta), then I must invoke the dynamic (non-static) method cumulativeProbability(x).

    I submitted a JIRA proposal with some sketches of how I’d like this to work:

    https://issues.apache.org/jira/browse/PIG-1678

    I don’t know if the Invoker system can be easily extended to handle the case of first constructing the class object and then invoking the associated dynamic method, or not.

    Also, regarding the performance overhead, is the overhead of the Invoker the same as writing a native Pig wrapper eval function? It wasn’t clear. Would making invoker’s first class implementations in Pig remove any of the overhead?

    • squarecog said, on October 16, 2010 at 11:30 am

      Hi David,
      I saw your ticket, it’s certainly the next logical thing to do with invokers. They can probably be extended to handle using a cached object to call methods on, though the syntax might get a little funky unless we modify Pig Latin to accomodate doing that sort of thing easier. I’ll think about it, your request is a common one.

      As far as performance overhead, writing a native Pig wrapper eval will be faster.


Leave a reply to David Ciemiewicz Cancel reply