NAME
Lingua::Anagrams - pure Perl anagram finder
VERSION
version 0.009
SYNOPSIS
use Lingua::Anagrams;
open my $fh, '<', 'wordsEn.txt' or die "Aargh! $!";
my @words = <$fh>;
close $fh;
my $anagramizer = Lingua::Anagrams->new( \@words ); # NOT a good word list for this purpose
my $t1 = time;
my @anagrams = $anagramizer->anagrams('Find anagrams!');
my $t2 = time;
print join ' ', @$_ for @anagrams;
print "\n\n";
print scalar(@anagrams) , " anagrams\n"";
print 'it took ' , ( $t2 - $t1 ) , " seconds\n"";
Giving you
...
naif nm rag sad
naif nm raga sd
naif nm rd saga
naif ragman sd
20906 anagrams
it took 3 seconds
DESCRIPTION
Lingua::Anagrams constructs a trie out of a list of words you give it. It then uses this trie to find all the anagrams of a phrase you give to its anagrams
method. A dynamic programming algorithm is used to accelerate at the cost of memory. See new
for how one may modify this algorithm.
Be aware that the anagram algorithm has been golfed down pretty far to squeeze more speed out of it. It isn't the prettiest.
METHODS
CLASS->new( $word_list, %params )
Construct a new anagram engine from a word list. The parameters understood by the constructor are
- limit
-
The character count limit used by the dynamic programming algorithm to throttle memory consumption somewhat. If you wish to find the anagrams of a very long phrase you may find the caching in the dynamic programming algorithm consumes too much memory. Set this limit lower to protect yourself from memory exhaustion (and slow things down).
The default limit is set by the global
$LIMIT
variable. It will be 20 unless you tinker with it. - clean
-
A code reference specifying how text is to be cleaned of extraneous characters and normalized. The default cleaning function is
sub _clean { $_[0] =~ s/\W+//g; $_[0] = lc $_[0]; }
Note that this function, like
_clean
, must modify its argument directly. - sorted
-
A boolean. If true, the anagram list will be returned sorted.
$self->anagrams( $phrase, %opts )
Returns a list of array references, each reference containing a list of words which together constitute an anagram of the phrase.
Options may be passed in as a list of key value pairs or as a hash reference. The following options are supported at this time:
- sorted
-
As with the constructor option, this determines whether the anagrams are sorted internally and with respect to each other. It overrides the constructor parameter, which provides the default.
SOME CLEVER BITS
One trick I use to speed things up is to convert all characters to integers immediately. If you're using integers, you can treat arrays are really fast hashes.
Another is, in the trie, to build the trie only of arrays. The character identity is encoded in the array position, the distance from the start by depth. So the trie contains nothing but arrays. For a little memory efficiency the terminal symbols is always the same empty array.
The natural way to walk the trie is with recursion, but I use a stack and a loop to speed things up.
I use a jump table to keep track of the actual characters under consideration so when walking the trie I only consider characters that might be in anagrams.
A particular step of anagram generation consists of pulling out all words that can be formed with the current character counts. If a particular character count is not decremented in the formation of any word in a given step we know we've reached a dead end and we should give up.
Similarly, if we do touch every character in a particular step we can collect all the words extracted which touch that character and descend only into the remaining possibilities for those character counts because the other words one might exctract are necessarily contained in the remaining character counts.
The dynamic programming bit consists of memoizing the anagram lists keyed to the character counts so we never extract the anagrams for a particular set of counts twice (of course, we have to calculate this key many times, which is not free).
I localize a bunch of variables on the first method call so that thereafter these values can be treated as global. This saves a lot of copying.
After the initial method calls I use functions, which saves a lot of lookup time.
In stack operations I use push and pop in lieu of unshift and shift. The former are more efficient, especially with short arrays.
AUTHOR
David F. Houghton <dfhoughton@gmail.com>
COPYRIGHT AND LICENSE
This software is copyright (c) 2014 by David F. Houghton.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.