NAME
WordLists::Sort
SYNOPSIS
Provides a structure for comparison functions, generally for complex sort.
# The following sorts "No6" "No.7" "no 8" in that order - ignoring punctuation.
@sorted = sort { atomic_compare (
$a,$b,{ n => sub{ $_[0]=~s/[^[:alnum:]]//g; lc $_[0]; } }
) } @unsorted;
# The following sorts A9 before A10.
@sorted = sort { atomic_compare (
$a,$b,{ t => [ { re => qr/[0-9]+/, c => sub { $_[0] <=> $_[1]; } }, ], } }
) } @unsorted;
DESCRIPTION
This is by far and away the most evil member of the Wordlists family (it's also pretty much unrelated to all the others). It is basically a terse way of writing complex comparison/sort functions as one liners (if you want to).
The intention is to be able to sort by several different criteria, e.g. so "the UN" sorts after "un-" and before "unabashed", and/or so that "F\x{E9}" sorts after "Fe" but before "FE".
Once you've written/cribbed a sort algorithm, it's easy to use - just put it in a subroutine and call it. (Actually, what you're writing is a comparison algrithm, which perl's sort
then calls).
Writing it is a bit harder, though: the framework involves (potentially) anonymous coderefs sprinkled amidst the hashrefs - it's much easier with indentation.
FUNCTIONS
atomic_compare
atomic_compare
: This provides most of the functionality in the module. It allows normalisation of the arguments, tokenisation so that different sections can be compared with different criteria, and, if so desired, flipping of the result.
Function arguments
n
: Normalise. This should be a coderef. If present, runs the code on each argument before comparison. Note that this only happens locally to the function, so lowercasing in functions[1]{n} will not prevent functions[2] putting VAT before vat. (If you want to keep them, nest the original function in the c). If n
is an arrayref, it runs the first code on $a
, the second on $b
.
t
: Tokenize. An arrayref containing hashrefs, each of which is attempted in order. In each hashref should be a regex keyed to re
which will match in case you want do different comparisons on different types of data. Permitted values, other than coderefs, are 0 (e.g. {re=>qr/\d/, 'c'=>0}
means 1 and 9 are equivalent), -1 or 1 (meaning that if this token is discovered at the same location, $a or $b always wins - NB that this is to be avoided in sort functions).
f
: Flip. if set to 1, then the result is reversed (-1 becomes 1 and vice versa but 0 stays the same).
c
: Comparison to use for text which doesn't match a token. Default behaviour is to use a wrapper for cmp
.
Below is an atomic_compare
function for sorting names of units in a workbook. You want them to appear in the order Welcome, Unit 1, Unit 2, ... Unit 11, but perl's cmp
would put "Welcome" at the end and sorts "Unit 11" before "Unit 2". The normalisation is a hack to pretend 'Welcome' is equivalent to "Unit 0", and the tokenisation instructs that series of digits should be compared as numbers and not as strings, so 10 and 11 now sort after 9.
atomic_compare (
$a, $b,
{
n => sub{
$_[0] =~ s/^Welcome$/Unit 0/i;
lc $_[0];
},
t =>
[
{
re => qr/[0-9]+/,
c => sub { $_[0] <=> $_[1]; }
},
],
}
);
complex_compare
complex_compare
allows a user to perform several successive comparisons, returning a value as soon as but only when a nonzero result is achieved. This is useful for situations such as:
For user-facing sorting, such as dictionary headwords where "the Internet" should normally sort after "internet" and not after "theft".
For sorting which requires heavy processing only in some cases, e.g. an identifier
a_492
always sorts beforeb_1
, whatever the numerical values, but to comparea_491
anda_492
an external resource (lookup table, AJAX data, etc.) must be consulted.Where certain values have high priority, e.g. if you'd like to see the string 'undef' appear before the sting '0'.
complex_compare
is pretty much equivalent to atomic_compare(...) || atomic_compare(...) || ...
, but is potentially less confusing than using the ||
or or
operators, and may be easier to code than repeating the atomic_compare
, $a
, $b
.
Function arguments
override_eq
: Prevents the function returning 0 immediately when the strings are identical. (The arguments are ordinarily tested for string equality at the beginning, in order to prevent unnecessary processing.) Setting this flag is only necessary if you have a condition which forces $a
or $b
to win for certain strings which are equal.
functions
: In <complex_compare>, an arrayref containing a hashref equivalent to the third value in atomic_compare
. Each function performs a comparison which executes and returns 0, 1, or -1. If the result of any function is nonzero, that is the return value. If it is zero, the next function is tried.
complex_compare ($a, $b, {
functions =>
[
{
n => sub{lc $_[0];}, #is \&lc possible?
},
{
n => sub{$_[0] =~ s/^the\s//; $_[0];},
t =>
[
{ qr/[^[:alpha:]]/ => 0 },
],
f => 1,
c => sub { $_[0] cmp $ [1] },
},
]
})
TODO
Add lots more dwimmery so that
qr//
expressions can be used wherever they're likely to be useful and coderefs can be substituted in for regexes where their function is to test.Make priority lists a bit easier, e.g. by allowing regexes - or even strings - amongst the hashrefs in
t
.Write some good, sensible examples for
complex_compare
.Gather useful common comparison functions which can be imported, studied, borrowed, etc. and offer them as a module.
Write more test cases.
Possibly remove assumptions about the comparanda, i.e. permit comparison of objects, references, etc. (But then: how does tokenisation work? Maybe it only works if
n=>sub{$_[0]->toString}
? Wouldn't we want to compare hashrefs and arrayrefs more intelligently?)Figure out how to get this to work as a Schwartzian transform.
BUGS
Please use the Github issues tracker.
LICENSE
Copyright 2011-2012 © Cambridge University Press. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.