Incrementing Hadoop Counters in Apache Pig
Information about incrementing Hadoop counters from inside Pig UDFs is not currently well-documented, judging by the user list traffic, so this is a brief note showing how to do that.
Hadoop counters are a way to report basic statistics of a job in Hadoop. I won’t go into a detailed discussion what they are and when to use them here — there’s plenty of information about that on the internet (for starters, see the Cloud9 intro to Counters, and some guidelines for appropriate usage in “Apache Hadoop Best Practices and Anti-Patterns”).
Pig 0.6 and before
Counters were not explicitly supported in Pig 0.6 and before, but you could get at them with this hack (inside a UDF):
Reporter reporter = PigHadoopLogger.getInstance().getReporter()
if (reporter != null) {
reporter.incrCounter(myEnum, 1L);
}
Pig 0.8
Pig 0.8 has an “official” method for getting and incrementing counters from a UDF:
PigStatusReporter reporter = PigStatusReporter.getInstance();
if (reporter != null) {
reporter.getCounter(key).increment(incr);
}
You can also get Counters programmatically if you are invoking Pig using PigRunner, and getting a PigStats object on completion. It’s a bit involved:
PigStats.JobGraph jobGraph = pigStats.getJobGraph();
for (JobStats jobStats : jobGraph) {
Counters counters = jobStats.getHadoopCounters();
}
Pig 0.7
Unfortunately I don’t know of a way to do this in 0.7, as the old hack went away and the new PigStatusReporter hadn’t been added yet. If you have a trick, please comment.
Watch out for nulls
We’ve observed that sometimes the reporter is null for a bit even when a UDF is executing on the MR side. To deal with this, we added a little helper class PigCounterHelper to Elephant-Bird that buffers the writes in a Map, and flushes them when it gets a non-null counter.
So there. If someone asks about counters in Pig, send them here.
leave a comment