SquareCog's SquareBlog

Great Database Performance Presentations and Videos

Posted in programming by squarecog on September 11, 2009

Percona is posting presentations from their Performance Conference at this blog: http://www.percona.tv/ , both the videos and the PPTs.  There is some great info in there, not just on MySQL, but also on Hive, PostgreSQL, performance monitoring and more. Well worth checking out.

Tagged with:

Jeeves and Diller — separated at birth?

Posted in Uncategorized by squarecog on April 20, 2009

IAC-owned Ask.com, formerly known as Ask Jeeves, is bringing Jeeves back in the UK (disclaimer: I used to work for Ask, and still moonlight there occasionally).

Jeeves is back after his 3-year leave of absence with a total makeover. Is it just me or does he now look a whole lot like his boss Barry Diller?

Jeeves and Diller

Jeeves and Diller

Free Hadoop Training

Posted in Uncategorized by squarecog on March 13, 2009

Cloudera has made training videos, screencasts, and excercises from their basic Hadoop training available on the web.

Check it out here: http://www.cloudera.com/hadoop-training-basic

They even include a VM image that you can use to get started without messing with installation details.  

But how is it different from the word count tutorial I went through on the Hadoop Wiki, you ask?  There is a section on “algorithms in Map-Reduce” and a screencast on using Hive.  More on this once I actually do the excercises…

Tagged with:

Building an Inverted Index with Hadoop and Pig

Posted in programming by squarecog on January 17, 2009

Note: For some reason, this post appears to be pretty popular. Here's the thing. This was the first thing I wrote when learning Pig. Literally -- I wrote it down the evening I sat down to play with Pig's syntax. You wouldn't really ever construct an inverted index this way. The point was that you can, not that you should. It is, however, kind of neat.

boarPig is a system for processing very large datasets, developed mostly at Yahoo and now an Apache Hadoop sub-project.  Pig aims to provide massive scalability by translating code written in a new data processing language called Pig Latin into Hadoop (map/reduce) plans.

In this post, I present a (very) brief description of the Pig project and demonstrate how one can construct an inverted index from a collection of text files using just a few lines of PigLatin. (more…)

Great post on databases and map/reduce

Posted in Uncategorized by squarecog on January 14, 2009

Anand Rajaraman has a great post on Datawocky with an overview of the various approaches to data analysis using Map/Reduce, and they ways in which this paradigm is bridged with RDBMSes by AsterData and Greenplum, and the Pig project. Don’t miss the comments from people directly responsible for these technologies, as well as Facebook’s Hive.

Tagged with: , , ,

Dealing with underflow in joint probability calculations

Posted in programming by squarecog on January 10, 2009

Been meaning to post this for a while, but it kept dropping to the bottom of the to-do list. There is a subtle bug in the code I posted earlier for splitting long strings into words.

The problem is that for most words, their probability of occurrence is extremely small.  This means that when we have a sequence of several words, the probability of all of them occurring is

P(a, b, c) = P(a) * P(b) * P(c)

which is three very small numbers multiplied by each other, which is a far smaller number. If we get enough of those, we quickly encounter the undeflow problem — internal computer representation of these small numbers does not have enough bits to represent the enormous smallness of them, and rounds them to roughly zero. (more…)

Splitting words joined into a single string (compound-splitter)

Posted in programming by squarecog on October 19, 2008

Chopping wood.

Someone posted a question on StackOverflow.com asking how to split words that have been concatenated together (likethis).   This sounded like fun, so I spent an hour or two putting together a solution.

As it turns out, this is a common problem in Information Retrieval, where you might be dealing with, say, German (and Germansconcatenateeverything), and you need to split strings in order to get out your terms.  So this is a naive “compound splitter” (that’s the technical term). For how the Pro’s do it, consider reading the following description of a Compound Splitter for Swedish: http://www.nada.kth.se/theory/projects/xcheck/rapporter/sjoberghkann04.pdf

But for a quick and dirty “my evening with Perl” approach, read on.


Tagged with: ,
%d bloggers like this: