NAME

Sort::External::Cookbook - Sort huge lists efficiently.

DESCRIPTION

Sort::External has a special affinity for the GRT or Guttman-Rosler Transform, also known as the "packed-default" sort. The algorithm is examined at length in "A Fresh Look at Efficient Perl Sorting", a paper by Uri Guttman and Larry Rosler, located as of this writing at http://www.sysarch.com/perl/sort_paper.html.

For many applications, the GRT is the most efficient Perl sorting transform. This document explores how to use it to solve common coding problems in conjunction with either Perl's built-in sort() function or Sort::External.

GRT Fundamentals

Eliminate the sortsub!

Perl's sortsubs are extremely flexible and powerful, but they are also computationally expensive. If you can eliminate the sortsub entirely from your algo, major optimizations kick in as Perl decides that it can perform the whole sort at the C level without leaving to execute Perl code.

Like Perl's native sort() function, Sort::External really, really likes to use the default ascending lexical sort. If you are chasing efficiency, you should do everything you can to avoid supplying Sort::External with a sortsub.

Encode, sort, decode.

The standard technique for sorting a hash by value in Perl is to use a sortsub:

my %colors = (
    "Granny Smith"     => "green",
    "Golden Delicious" => "yellow",
    "Pink Lady"        => "pink",
);
my @sorted_by_color = sort { $colors{$a} cmp $colors{$b} } keys %colors;

# Result:
# @sorted_by_color = ( "Granny Smith", "Pink Lady", "Golden Delicious" );

To accomplish the same goal with a GRT, we use a three-step process:

# 1. Encode
my @sortkeys;
while ( my ( $variety, $color ) = each %colors ) {
    push @sortkeys, sprintf( "%-6s%-16s", $color, $variety );
}
# @sortkeys = (
#     "green Granny Smith    ",
#     "yellowGolden Delicious",
#     "pink  Pink Lady       ",
# );

# 2. Sort
my @sorted_by_color = sort @sortkeys;    # no sortsub!

# 3. Decode
@sorted_by_color = map { substr( $_, 6 ) } @sorted_by_color;

When substituting Sort::External for sort(), steps 2 and 3 need to be modified:

# 1. Encode
my @sortkeys;
while ( my ( $variety, $color ) = each %colors ) {
    push @sortkeys, sprintf( "%-6s%-16s", $color, $variety );
}

# 2. Sort
my $sortex = Sort::External->new( mem_threshold => 2**24 );
$sortex->feed(@sortkeys);
$sortex->finish;

# 3. Decode
my @sorted_by_color;
while ( defined( $_ = $sortex->fetch ) ) {
    push @sorted_by_color, substr( $_, 6 );
}

For the sake of brevity, we'll use sort() instead of Sort::External for most of our examples.

Encoding techniques

sprintf()

The sprintf technique for encoding sortkeys is the most straightforward and flexible. In addition to preparing multi-field text data for a lexical sort, it can be used to put numerical data into a string form where the lexical sort doubles as a numerical sort.

my %prices = (
    "Granny Smith"     => 1.69,
    "Pink Lady"        => 1.99,
    "Golden Delicious" => .79,    # on sale!
);
my @sortkeys;
while ( my ( $variety, $price ) = each %prices ) {
    push @sortkeys, sprintf( "%1.2f%-16s", $price, $variety );
}
# @sortkeys = (
#     "1.69Granny Smith    ",
#     "0.79Golden Delicious",
#     "1.99Pink Lady       ",
# );

There are two potential problems with the sprintf technique.

First, you must know the maximum width of all fields in advance. If you don't, you'll have to run through your data in a wasteful "characterization pass" solely for the purpose of measuring field widths.

Second, if the fields have a long maximum width but the content is sparse, sprintf-generated sortkeys take up a lot of unnecessary space. With Sort::External, that means extra disk i/o and reduced performance.

null-termination

An alternative to using sprintf to create fixed-width data is to separate fields with null bytes.

while ( my ( $variety, $color ) = each %colors ) {
    push @sortkeys, "$color\0$variety";
}
my @sorted = sort @sortkeys;
s/.*?\0// for @sorted;

Null-termination often saves space over sprintf -- however, if your data may contain null bytes, you have to be very careful about how you use it.

pack

pack() can serve as a less-flexible alternative to sprintf() for encoding fixed-width strings. More usefully, it can also encode positive integers into a compact sortable form. The four integer pack formats which sort correctly are:

# c => signed char                    maximum value 127
# C => unsigned char                  maximum value 255
# n => 16-bit network short           maximum value 65_535
# N => 32-bit network unsigned long   maximum value 4_294_967_295

my %first_introduced = (
    "Granny Smith"     => 1868,
    "Pink Lady"        => 1970,
    "Golden Delicious" => 1890,
);
while ( my ( $variety, $year ) = each %first_introduced ) {
    push @sortkeys, ( pack( 'n', $year ) . $variety );
}
my $sortex = Sort::External->new( mem_threshold => 2**24 );
$sortex->feed(@sortkeys);
$sortex->finish;
my @sorted;
while ( defined( $_ = $sortex->fetch ) ) {
    push @sorted, substr( $_, 2 );
}

If you have a lot of large integers to sort, pack() is considerably faster and more space-efficient than sprintf().

Combining encoding techniques

It is often the case that a combination of techniques yields the most efficient sortkey encoding and decoding. Here is a contrived example:

my %apple_stats = (
    "Granny Smith" => {
        color => "green",
        price => 1.69,
        year  => 1868,
    },
    "Pink Lady" => {
        color => "pink",
        price => 1.99,
        year  => 1970,
    },
    "Golden Delicious" => {
        color => "yellow",
        price => .79,
        year  => 1890,
    },
);
my @sortkeys;
while ( my ( $variety, $stats ) = each %apple_stats ) {
    push @sortkeys,
        (     $stats->{color} . "\0\0"
            . sprintf( "%1.2f", $stats->{price} )
            . pack( 'n', $stats->{year} ) . "\0\0"
            . $variety );
}
my @sorted = sort @sortkeys;
my @sorted_varieties = map { s/.*\0\0//; $_ } @sorted;

sprintf() would be a much better choice in this particular case, since it makes it easier to recover the all the fields using unpack():

while ( my ( $variety, $stats ) = each %apple_stats ) {
    push @sortkeys,
        sprintf( "%-6s%1.2f%04d%-16s",
        @{$stats}{ 'color', 'price', 'year' }, $variety );
}
my @sorted = sort @sortkeys;
for (@sorted) {
    my ( %result, $variety );
    ( @result{ 'color', 'price', 'year' }, $variety )
        = unpack( 'a6 a4 a4 a16', $_ );
    $_ = { $variety => \%result };
}

Extended sorting techniques

Descending sort order using the "bitwise not" operator

You can reverse the order of a lexical sort by flipping all the bits on your sortkeys.

while ( my ( $variety, $color ) = each %colors ) {
    push @sortkeys, sprintf( "%-6s%-16s", $color, $variety );
}
@sortkeys = map { ~$_ } @sortkeys;
my @sorted = sort @sortkeys;
@sorted = map { ~$_ } @sorted;
@sorted = map { substr( $_, 5 ) } @sorted;

Multi-field sorting with mixed asc/desc order

Multi-field GRT sorts are usually as straightforward as concatenating the fields using either sprintf or null-termination. However, if you need both ascending and descending sub-sorts within the same sort, special care is required. Null-termination is generally not an option, because bit-flipped fields may contain null bytes.

# sort by:
#     color descending,
#     price ascending,
#     year  descending

while (my $variety, $stats) = each %apple_stats) {
    push @sortkeys, (
        (~ sprintf( "%-6s",  stats->{color}  ) ) .
           sprintf( "%1.2f", $stats->{price} )   .
        (~ sprintf( "%04d",  stats->{year}   ) ) .
           sprintf( "%-16s", $variety        )
    );
}
my @sorted = sort @sortkeys;

Case-insensitive sorting

Case-insensitive sorting is something of a weakness for the GRT. The only way to avoid a sortsub is to double-encode:

while ( my ( $variety, $color ) = each %colors ) {
    push @sortkeys, ( lc($variety) . "\0\0" . $variety );
    push @sortkeys, ( lc($color) . "\0\0" . $color );
}
my @sorted = sort @sortkeys;
@sorted = map { s/.*?\0\0//; $_ } @sorted;

# @sorted = ( 
#     "Golden Delicious",
#     "Granny Smith",
#     "green",
#     "pink",
#     "Pink Lady",
#     "yellow",
# );

A challenge

Ken Clarke has identified a case where there seems to be no elegant GRT solution:

# Each row contains multiple fields
# AND is impossible to know the field lengths in advance
# AND there are both ascending and descending subsorts

If you can think of a way to perform this sort without a characterization pass, please write the author.

ACKNOWLEDGEMENTS

This document arose out of discussions with Ken Clarke.

Uri Guttman and Larry Rosler, as mentioned above.

SEE ALSO

Sort::External

AUTHOR

Marvin Humphrey <marvin at rectangular dot com> http://www.rectangular.com

COPYRIGHT AND LICENSE

Copyright 2005-2008 Marvin Humphrey. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.