NAME

Lingua::Anagrams - pure Perl anagram finder

VERSION

version 0.012

SYNOPSIS

use v5.10;
use Lingua::Anagrams;

open my $fh, '<', 'words.txt' or die "Aargh! $!";         # some 100,000 words
my @words = map { ( my $w = $_ ) =~ s/\W+//g; $w } <$fh>;
close $fh;

my @enormous = grep { length($_) > 6 } @words;
my @huge     = grep { length($_) == 6 } @words;
my @big      = grep { length($_) == 5 } @words;
my @medium   = grep { length($_) == 4 } @words;
my @small    = grep { length($_) == 3 } @words;
my @tiny     = grep { length($_) < 3 } @words;

my $anagramizer = Lingua::Anagrams->new(
    [ \@enormous, \@huge, \@big, \@medium, \@small, \@tiny ],
    limit => 30 );

my $t1 = time;
my @anagrams =
  $anagramizer->anagrams( 'Ada Hyacinth Melton-Houghton', sorted => 1, min => 100 );
my $t2 = time;

say join ' ', @$_ for @anagrams;
say '';
say scalar(@anagrams) . ' anagrams';
say 'it took ' . ( $t2 - $t1 ) . ' seconds';

Giving you

...
manned ohioan thatch toughly
menial noonday thatch though
monthly ohioan thatch unaged
moolah nighty notched utahan
moolah tannin though yachted
moolah thatch toeing unhandy

1582 anagrams
it took 229 seconds

DESCRIPTION

Lingua::Anagrams constructs tries out of a lists of words you give it. It then uses these tries to find all the anagrams of a phrase you give to its anagrams method. A dynamic programming algorithm is used to accelerate the search at the cost of memory. See new for how one may modify this algorithm.

Be aware that the anagram algorithm has been golfed down pretty far to squeeze more speed out of it. It isn't the prettiest.

METHODS

CLASS->new( $word_list, %params )

Construct a new anagram engine from a word list, or a list of word lists. If you provide multiple word lists, each successive list will be understood as an augmentation of those preceding it. If you search for the anagrams of a phrase, the algorithm will abandon one list and try the next if it is unable to find sufficient anagrams with the current list. You can use cascading word lists like this to find interesting anagrams of long phrases as well as short ones in a reasonable amount of time. If on the other hand you use only one comprehensive list you will find that long phrases have many millions of anagrams the calculation of which take vast amounts of memory and time. In particular you will want to limit the number of short words in the earlier lists as these multiply the possible anagrams much more quickly.

The optional construction parameters may be provided either as a list of key-value pairs or as a hash reference. The understood parameters are:

limit

The character count limit used by the dynamic programming algorithm to throttle memory consumption somewhat. If you wish to find the anagrams of a very long phrase you may find the caching in the dynamic programming algorithm consumes too much memory. Set this limit lower to protect yourself from memory exhaustion (and slow things down).

The default limit is set by the global $LIMIT variable. It will be 20 unless you tinker with it.

clean

A code reference specifying how text is to be cleaned of extraneous characters and normalized. The default cleaning function is

sub _clean {
    $_[0] =~ s/\W+//g;
    $_[0] = lc $_[0];
}

Note that this function, like _clean, must modify its argument directly.

sorted

A boolean. If true, the anagram list will be returned sorted.

min

The minimum number of anagrams to look for. This value is only consulted if the anagram engine has more than one word list. If the first word list returns too few anagrams, the second is applied. If no minimum is provided the effective minimum is one.

$self->anagrams( $phrase, %opts )

Returns a list of array references, each reference containing a list of words which together constitute an anagram of the phrase.

Options may be passed in as a list of key value pairs or as a hash reference. The following options are supported at this time:

sorted

As with the constructor option, this determines whether the anagrams are sorted internally and with respect to each other. It overrides the constructor parameter, which provides the default.

min

The minimum number of anagrams to look for. This value is only consulted if the anagram engine has more than one word list. This overrides any value from the constructor parameter min.

SOME CLEVER BITS

One trick I use to speed things up is to convert all characters to integers immediately. If you're using integers, you can treat arrays are really fast hashes.

Another is, in the trie, to build the trie only of arrays. The character identity is encoded in the array position, the distance from the start by depth. So the trie contains nothing but arrays. For a little memory efficiency the terminal symbols is always the same empty array.

The natural way to walk the trie is with recursion, but I use a stack and a loop to speed things up.

I use a jump table to keep track of the actual characters under consideration so when walking the trie I only consider characters that might be in anagrams.

A particular step of anagram generation consists of pulling out all words that can be formed with the current character counts. If a particular character count is not decremented in the formation of any word in a given step we know we've reached a dead end and we should give up.

Similarly, if we do touch every character in a particular step we can collect all the words extracted which touch that character and descend only into the remaining possibilities for those character counts because the other words one might exctract are necessarily contained in the remaining character counts.

The dynamic programming bit consists of memoizing the anagram lists keyed to the character counts so we never extract the anagrams for a particular set of counts twice (of course, we have to calculate this key many times, which is not free).

I localize a bunch of variables on the first method call so that thereafter these values can be treated as global. This saves a lot of copying.

After the initial method calls I use functions, which saves a lot of lookup time.

In stack operations I use push and pop in lieu of unshift and shift. The former are more efficient, especially with short arrays.

AUTHOR

David F. Houghton <dfhoughton@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2014 by David F. Houghton.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.