NAME
File::MergeSort - Mergesort ordered files.
SYNOPSIS
use File::MergeSort;
# List of files to merge
my @files = qw( foo bar baz.gz );
# Optional hash with options to modify behaviour of File::MergeSort.
my %opts = ( skip_empty_lines => 1 );
# Function to extract merge keys from lines in files.
my $extract = sub { return substr( $_[0], 0, 3 ) };
# Create the MergeSort object.
my $ms = File::MergeSort->new( \@files, $code_ref, \%opts );
# Retrieve each line for processing
while ( my $line = $ms->next_line() ) {
# process $line ... or just print
print $line;
}
# Alternatively, dump all records in sorted order to a file.
$ms->dump( $file ); # Omit $file to default to STDOUT
DESCRIPTION
File::MergeSort is a hopefully straightforward solution for situations where one wishes to merge data files with presorted records, with the option to process records as they are read. An example might be application server logs which record events chronologically from a cluster.
Merge keys are extracted from the input lines using a user defined subroutine. Comparisons on the keys are done lexicographically.
If IO::Zlib
is installed, both plaintext and compressed (.z or .gz) files are catered for.
POINTS TO NOTE
ASCII order merging
Comparisons on the merge keys are carried out lexicographically. The user should ensure that the subroutine used to extract merge keys formats the keys if required so that they sort correctly.
Note that earlier versions (< 1.06) of File::MergeSort performed numeric, not lexicographical comparisons.
IO::Zlib is optional
If IO::Zlib is installed, File::MergeSort will use it to handle compressed input files, but it is not necessary to install it if you do not wish to process compressed files.
If IO::Zlib is not installed and compressed files are specified as input files, File::MergeSort will raise an exception.
DETAILS
The user is expected to supply a list of file pathnames and a function to extract an index value from each record line (the merge key).
As arguments, File::MergeSort takes a reference to an anonymous array of file paths/names and a reference to a subroutine that extracts a merge key from a line.
For each file File::MergeSort opens the file using IO::File or IO::Zlib for compressed files. File::MergeSort handles mixed compressed and uncompressed files seamlessly by detecting for files with .z or .gz extensions.
When passed a line (a scalar, passed as the first and only argument, $_[0]) from one of the files, the user supplied subroutine must return the merge key for the line. An exception will be raised if no merge key is returned.
When the next_line
method is called, File::MergeSort returns the line with the lowest merge key/value.
The records are then output in ascending order based on the merge keys returned by the user supplied subroutine. Where there are records with identical merge keys in multiple files, the records are returned from the files in same the order the user supplies the files in the constructor.
If a simple merge is required, without any user processing of each line read from the input files, the dump
method can be used to read and merge the input files into the specified output file, or to STDOUT if no file is specified.
SUBROUTINES/METHODS
- new( ARRAY_REF, CODE_REF [ , HASH_REF ] );
-
Create a new
File::MergeSort
object.There are two required arguments and one optional argument:
A reference to an array of files to read from (required). These files can be either plaintext, or compressed. Any file with a .gz or .z suffix will be assumed to be compressed and will be opened using
IO::Zlib
.A code reference (required). When called, the coderef should return the merge key for a line, which is given as the only argument to that subroutine.
A hash reference (optional). Supply additional options to modify the behaviour of File::MergeSort. Currently the only option is skip_empty_lines, which if true will cause File::MergeSort to silently skip over empty lines (those matching m/^$/). By default empty/blank lines will be processed no differently than any other. See EXAMPLES.
- next_line();
-
Returns the next line from the merged input files. Returns false once all files have been exhausted.
- dump( [ FILENAME ] );
-
Reads and merges from the input files to FILENAME, or STDOUT if FILENAME is not given, until all files have been exhausted.
Returns the number of lines output.
EXAMPLES
# This program looks at files found in /logfiles, returns the
# records of the files sorted by the date in mm/dd/yyyy format.
# Empty lines in the input files will be skipped.
use File::MergeSort;
my $files = [ qw( logfiles/log_server_1.log
logfiles/log_server_2.log
logfiles/log_server_3.log
) ];
my $opts = { skip_empty_lines => 1 };
my $sort = File::MergeSort->new( $files, \&index_sub, $opts );
while ( my $line = $sort->next_line() ) {
# some operations on $line
}
sub index_sub {
# Use this to extract a date of the form mm-dd-yyyy.
my $line = shift;
# Be cautious that only the date will be extracted.
$line =~ /(\d{2})-(\d{2})-(\d{4})/;
return "$3$1$2"; # Index is an integer, yyyymmdd
# Lower number will be read first.
}
# This slightly more compact example performs a simple merge of
# several input files with fixed width merge keys into a single
# output file.
use File::MergeSort;
my $files = [ qw( input_1 input_2 input_3 ) ];
my $extract = sub { substr($_[0], 15, 10 ) }; # To substr merge key out of line
my $sort = File::MergeSort->new( $files, $extract );
$sort->dump( "output_file" );
EXPORTS
Nothing: OO interface. See SUBROUTINES/METHODS.
AUTHOR
Original Author
Christopher Brown <ctbrown@cpan.org>.
Maintainer
Barrie Bremner http://barriebremner.com/.
Contributors
Laura Cooney.
LICENSE AND COPYRIGHT
Copyright (c) 2001-2003 Christopher Brown
2003-2010 Barrie Bremner
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
perl, IO::File, IO::Zlib, Compress::Zlib.
File::Sort or Sort::Merge as possible alternatives.