SquareCog's SquareBlog

Splitting words joined into a single string (compound-splitter)

Posted in programming by squarecog on October 19, 2008

Chopping wood.

Someone posted a question on StackOverflow.com asking how to split words that have been concatenated together (likethis).   This sounded like fun, so I spent an hour or two putting together a solution.

As it turns out, this is a common problem in Information Retrieval, where you might be dealing with, say, German (and Germansconcatenateeverything), and you need to split strings in order to get out your terms.  So this is a naive “compound splitter” (that’s the technical term). For how the Pro’s do it, consider reading the following description of a Compound Splitter for Swedish: http://www.nada.kth.se/theory/projects/xcheck/rapporter/sjoberghkann04.pdf

But for a quick and dirty “my evening with Perl” approach, read on.


Tagged with: ,
%d bloggers like this: