NAME
File::MergeSort - Merge sort ordered data files.
SYNOPSIS
use File::MergeSort;
## Create the MergeSort Object.
my $sort = new File::MergeSort(
$file_list, # Anonymous array of path/files
\&index_extract_function # Reference to a subroutine that
);
## Retrieves the next line for porcessing
my $line = $sort->next_line;
print "$line\n";
## Dumps remaining records in sorted order to a file. Default: <STDOUT>
$sort->dump( [file] );
DESCRIPTION
File::MergeSort provides an easy way to merge, parse, process and analyze data that distributed in presorted files using the well known merge sort algorith. User supplies a list of file pathnames and a function to extract an numeric index value from each record line. By calling the "next_line" or "dump" function, the user can retrieve the records in an ordered manner.
File::MergeSort is a hopefully straight forward solution for situations where one wishes to merge data files with PRE-ORDERED records. An example might be application server logs which record events chronilogically from a cluster. If we want to examine, process or merge several files but retain the chronological order, then MergeSort is for you.
Here's how it works ...
As arguments, MergeSort takes a reference to an anonymous array of filepaths/names and a reference to a subroutine that extracts an index value. The anonymous array of the filenames are the files to be sorted with the subroutine determining the sort order. When passed a line (i.e. a scalar) from one of the files, the user supplied subroutine must return a numeric index value associated with the line. The records are then culled in ascending order based on the index values.
More detail ...
For each file MergeSort opens a IO::File or IO::Zlib object. ( MergeSort handles mixed compressed and uncompressed files seamlessly by detecting for files with .z or .gz extensions. ) Initially the first line is indexed acording to the subroutine. A stack is created based on these values.
When the function 'next_line' is called, MergeSort returns the line with the lowest index value. MergeSort then replenishes the stack, reads a new line from the corresponding file and places it in the proper position for the next call to 'next_line'.
Additional Notes: - A stable sort is implemented, i.e. a single file is read until its index is no longer the lowest value. - If the file ends in .z or .gz then the file is opened with IO::Zlib, instead.
EXAMPLE
# This program does looks at files found
# in /logfiles, returns the records of the
# files sorted by the date in mm/dd/yyyy
# format
use File::MergeSort;
my $files = [ 'logfiles/log_server_1.log' ,
'logfiles/log_server_2.log' ,
'logfiles/log_server_3.log'
]
my $ms = new File::MergeSort($files, \&index_sub);
while (my $line = $ms->next_line) {
.
. some operations on $line
.
}
sub index_sub{
# Use this to extract a date of
# the form mm-dd-yyyy.
my $line = shift;
# Be cautious that only the date will be
# extracted.
$line =~ /(\d{2})-(\d{2})-(\d{4})/;
return "$3$1$2"; # Index is an interger, yyyymmdd
# Lower number will be read first.
}
TODO
Implement a generic test/comparison function to replace text/numeric comparison.
Implement a configurable record seperator.
Allow for optional deletion of duplicate entries.
EXPORT
None by default.
AUTHOR
Chris Brown, chris.brown@cal.berkeley.edu
Copyright(c) 2003 Christopher Brown. All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the License, distributed with PERL. Not intended for evil purposes. Yadda, yadda, yadda ...