NAME

File::MergeSort - Mergesort ordered files.

SYNOPSIS

use File::MergeSort;

# List of files to merge
my @files = qw( foo bar baz.gz );

# Optional hash with options to modify behaviour of File::MergeSort.
my %opts = ( skip_empty_lines => 1 );

# Function to extract merge keys from lines in files.
my $extract = sub { return substr( $_[0], 0, 3 ) };

# Create the MergeSort object.
my $ms = File::MergeSort->new( \@files, $code_ref, \%opts );

# Retrieve each line for processing
while ( my $line = $ms->next_line() ) {
    # process $line ... or just print
    print $line;
}

# Alternatively, dump all records in sorted order to a file.
$ms->dump( $file );    # Omit $file to default to STDOUT

DESCRIPTION

File::MergeSort is a hopefully straightforward solution for situations where one wishes to merge data files with presorted records, with the option to process records as they are read. An example might be application server logs which record events chronologically from a cluster.

Merge keys are extracted from the input lines using a user defined subroutine. Comparisons on the keys are done lexicographically.

If IO::Zlib is installed, both plaintext and compressed (.z or .gz) files are catered for.

POINTS TO NOTE

ASCII order merging

Comparisons on the merge keys are carried out lexicographically. The user should ensure that the subroutine used to extract merge keys formats the keys if required so that they sort correctly.

Note that earlier versions (< 1.06) of File::MergeSort performed numeric, not lexicographical comparisons.

IO::Zlib is optional

If IO::Zlib is installed, File::MergeSort will use it to handle compressed input files, but it is not necessary to install it if you do not wish to process compressed files.

If IO::Zlib is not installed and compressed files are specified as input files, File::MergeSort will raise an exception.

DETAILS

The user is expected to supply a list of file pathnames and a function to extract an index value from each record line (the merge key).

As arguments, File::MergeSort takes a reference to an anonymous array of file paths/names and a reference to a subroutine that extracts a merge key from a line.

For each file File::MergeSort opens the file using IO::File or IO::Zlib for compressed files. File::MergeSort handles mixed compressed and uncompressed files seamlessly by detecting for files with .z or .gz extensions.

When passed a line (a scalar, passed as the first and only argument, $_[0]) from one of the files, the user supplied subroutine must return the merge key for the line. An exception will be raised if no merge key is returned.

When the next_line method is called, File::MergeSort returns the line with the lowest merge key/value.

The records are then output in ascending order based on the merge keys returned by the user supplied subroutine. Where there are records with identical merge keys in multiple files, the records are returned from the files in same the order the user supplies the files in the constructor.

If a simple merge is required, without any user processing of each line read from the input files, the dump method can be used to read and merge the input files into the specified output file, or to STDOUT if no file is specified.

SUBROUTINES/METHODS

new( ARRAY_REF, CODE_REF [ , HASH_REF ] );

Create a new File::MergeSort object.

There are two required arguments and one optional argument:

A reference to an array of files to read from (required). These files can be either plaintext, or compressed. Any file with a .gz or .z suffix will be assumed to be compressed and will be opened using IO::Zlib.

A code reference (required). When called, the coderef should return the merge key for a line, which is given as the only argument to that subroutine.

A hash reference (optional). Supply additional options to modify the behaviour of File::MergeSort. Currently the only option is skip_empty_lines, which if true will cause File::MergeSort to silently skip over empty lines (those matching m/^$/). By default empty/blank lines will be processed no differently than any other. See EXAMPLES.

next_line();

Returns the next line from the merged input files. Returns false once all files have been exhausted.

dump( [ FILENAME ] );

Reads and merges from the input files to FILENAME, or STDOUT if FILENAME is not given, until all files have been exhausted.

Returns the number of lines output.

EXAMPLES

# This program looks at files found in /logfiles, returns the
# records of the files sorted by the date in mm/dd/yyyy format.
# Empty lines in the input files will be skipped.

use File::MergeSort;

my $files = [ qw( logfiles/log_server_1.log
                  logfiles/log_server_2.log
                  logfiles/log_server_3.log
              ) ];

my $opts = { skip_empty_lines => 1 }; 

my $sort = File::MergeSort->new( $files, \&index_sub, $opts );

while ( my $line = $sort->next_line() ) {
   # some operations on $line
}

sub index_sub {
  # Use this to extract a date of the form mm-dd-yyyy.
  my $line = shift;

  # Be cautious that only the date will be extracted.
  $line =~ /(\d{2})-(\d{2})-(\d{4})/;

  return "$3$1$2";  # Index is an integer, yyyymmdd
                    # Lower number will be read first.
}


# This slightly more compact example performs a simple merge of
# several input files with fixed width merge keys into a single
# output file.

use File::MergeSort;

my $files   = [ qw( input_1 input_2 input_3 ) ];
my $extract = sub { substr($_[0], 15, 10 ) };  # To substr merge key out of line

my $sort = File::MergeSort->new( $files, $extract );

$sort->dump( "output_file" );

EXPORTS

Nothing: OO interface. See SUBROUTINES/METHODS.

AUTHOR

Original Author

Christopher Brown <ctbrown@cpan.org>.

Maintainer

Barrie Bremner http://barriebremner.com/.

Contributors

Laura Cooney.

LICENSE AND COPYRIGHT

Copyright (c) 2001-2003 Christopher Brown
              2003-2010 Barrie Bremner

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

perl, IO::File, IO::Zlib, Compress::Zlib.

File::Sort or Sort::Merge as possible alternatives.