SquareCog's SquareBlog

Splitting words joined into a single string (compound-splitter)

Posted in programming by squarecog on October 19, 2008

Chopping wood.

Someone posted a question on StackOverflow.com asking how to split words that have been concatenated together (likethis).   This sounded like fun, so I spent an hour or two putting together a solution.

As it turns out, this is a common problem in Information Retrieval, where you might be dealing with, say, German (and Germansconcatenateeverything), and you need to split strings in order to get out your terms.  So this is a naive “compound splitter” (that’s the technical term). For how the Pro’s do it, consider reading the following description of a Compound Splitter for Swedish: http://www.nada.kth.se/theory/projects/xcheck/rapporter/sjoberghkann04.pdf

But for a quick and dirty “my evening with Perl” approach, read on.

The tricky part, of course, is that multiple combinations of words may apply to any given string.  So we need to:

a) produce all valid combinations

b) rank them somehow

I chose to do the ranking based on the combined probability of the component words occurring in English.  I didn’t do anything clever like evaluating the likelihood of the words appearing next to each other, but it’s pretty clear how that could be added to the code below.

As an aside, I am using the WP code highlighter here, which doesn’t support Perl, so I am telling it that the code is PHP..  that works halfway decently.

The resulting code seems to me to strike the balance between being reasonable enough to understand, while remaining non-trivial, so I hope this will be of use to folks trying to get a little beyond simple maintenance scripts in Perl-land.

Let’s step through this at a leisurely pace.

#!/usr/bin/env perl

use strict;
use warnings;

my ($dict_file, $word_file) = @ARGV;
($dict_file && $word_file) or die(Usage());

Nothing new here. We use strict and warnings to make sure our program is sane (always do this unless you have a real reason not to!).
Check that both of our parameters, a dictionary file, and a merged-words file, were provided.
Note that the way I am doing this here is not quite correct — really, you should check them with “defined($foo)” just in case the falue of $foo is 0. Now, on to the main body:

## being lazy, using globals
my($DICT, $TOTAL) = get_word_stats($dict_file);

open(WORDS, "<$word_file") or die "unable to open $word_file\n";
foreach my $word (<WORDS>) {
  chomp $word;
  my $arr = find_matches($word);
  my @sorted_arr = sort { find_word_seq_score($b) <=>
                            find_word_seq_score($a) } @$arr;
  print_results($word, @sorted_arr);
}
close WORDS;

That wasn’t so bad, right?
We read in the dictionary, and derive some stats, using a function we’ll explore later on.
We then read the word file line by line, find all matching parses (valid splits of the string that form “real” words), and order them by a custom ordering function.
The resulting sorted array is printed out.

Now, on to the meaty bits.

sub find_matches {
    my $found_parses = [];
    find_matches_rec(shift, [], $found_parses);
    return $found_parses;
}

sub find_matches_rec {
    my ($string, $words_sofar, $found_parses) = @_;

    if ($string eq '') {
        push @$found_parses, $words_sofar;
        return;
    }

    my $length = length $string;
    foreach my $i (2..$length) {
        my $prefix = substr($string, 0, $i);
        my $suffix = substr($string, $i, $length-$i);
        if (exists $DICT->{$prefix}) {
            find_matches_rec($suffix, [@$words_sofar, $prefix], $found_parses);
        }
    }
}

This is the heart of the solution.
The problem naturally lends itself to a recursive approach — for a string, check all of its prefixes to see if they are words; for every prefix that forms a valid word, try to see if you can decompose the rest of the string into valid words, too. Rinse, repeat.
One can visualize the execution of this procedure as a traversal of a tree composed of all possible string decompositions, with the caveat that we don’t bother descending into branches that follow substrings that are not recognized as known words.
Every time we get to the bottom of the tree, we know that all the words so far are legitimate, so we add the breadcrumbs we collected to a list of all valid decompositions.

## Just a simple joint probability
## assumes independence between words, which is obviously untrue
## that's why this is broken out -- feel free to add better brains
sub find_word_seq_score {
    my ($words) = @_;
    my $score = 1;
    foreach (@$words) {
        $score = $score * $DICT->{$_}/$TOTAL;
    }
    return $score;
}

This code is used to rank different decompositions of the same string. We simply compute the joint probability of seeing all the words in the real words (so the word ‘a’ has a high probability, while ‘obsequious’ is fairly low).
This is really quick and dirty — first, the total score gets very small very fast, so it might be better to transform the scores so that they are a bit more spread out.
More importantly, a better model would account not just for the probability of seeing a word anywhere in the text, but the probability of seeing the words together.
That’s called ‘n-grams’, and Google has made a large data set of n-grams available to the public — for details, go here: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
If general English language data is not appropriate, you may want to calculate your own based on some corpus that is more reflective of the data you are working with. There is a perl solution (there is a Perl solution to everything…): the module Text::Ngrams

sub get_word_stats {
  my ($dict_file) = @_;
  open(DICT, "<$dict_file") or die "unable to open $dict_file\n";
  local $/= undef;
  my %dict=();
  my $total = 0;
  while (<DICT>) {
    foreach (split(/\b/, $_)) {
      $dict{$_} +=1;
      $total++;
    }
  }
  close DICT;
  return (\%dict, $total);
}

This is about as simple as it gets — read in the file, split it by word boundaries, create a hash with a count of all occurrences for every word.
We also count the total number of words while we are at it, since we’ll need it later for the probabilities, and this is a convenient place to do it.

All that's left is the window dressing:
sub print_results {
  my ($word,  @combos) = @_;
  print $word, ": ", scalar @combos, " possibilities\n";
  foreach (@combos) {
    print join(" ", " - ", @$_), "\n";
  }
  print "\n";
}

sub Usage {
  return "$0 /path/to/dictionary /path/to/your_words";
}

The output looks something like this:

perl splitwords.pl big.txt words
answerveal: 2 possibilities
 -  answer veal
 -  answer ve al

wickedweather: 4 possibilities
 -  wicked weather
 -  wicked we at her
 -  wick ed weather
 -  wick ed we at her

liquidweather: 6 possibilities
 -  liquid weather
 -  liquid we at her
 -  li quid weather
 -  li quid we at her
 -  li qu id weather
 -  li qu id we at her

driveourtrucks: 1 possibilities
 -  drive our trucks

gocompact: 1 possibilities
 -  go compact

slimprojector: 2 possibilities
 -  slim projector
 -  slim project or

orcore: 3 possibilities
 -  or core
 -  or co re
 -  orc ore

Image by Carol Esther, used under the Creative Commons License.

Advertisements
Tagged with: ,

5 Responses

Subscribe to comments with RSS.

  1. […] There is a subtle bug in the code I posted earlier for splitting long strings into words. […]

  2. Rodrigo said, on February 20, 2009 at 11:40 am

    Nice article, i was looking something like this.

  3. Kevin said, on August 27, 2009 at 6:46 pm

    Excellent work! I haven’t tested the script myself as I don’t have Perl installed, but what kind of results do the following strings produce?

    expertsexchange
    penisland
    choosespain
    kidsexpress
    childrenswear
    dicksonweb

    Here’s my StackOverflow thread outlining the challenge:
    http://stackoverflow.com/questions/1315373/programmatically-extract-keywords-from-domain-names

  4. Al said, on February 8, 2011 at 12:23 am

    If you add a scoring mechanism to square the length of each word and sum the squares it will favor longer word solutions. In most of the test cases, this will pick the desired word combination, except for the choosespain one.

    Soundex might be useful in word matching from a database.

  5. Nate said, on April 14, 2011 at 11:31 pm

    Here are some tools for those wanting to improve the relevance and accuracy of the parser:
    http://www.d.umn.edu/~tpederse/code.html


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: