NAME

File::MergeSort - Merge sort ordered data files.

SYNOPSIS

use File::MergeSort;


## Create the MergeSort Object.  
my $sort = new File::MergeSort( 
              $file_list,             	# Anonymous array of path/files 
              \&index_extract_function 	# Reference to a subroutine that 
);


## Retrieves the next line for porcessing
my $line = $sort->next_line;  
print "$line\n";


## Dumps remaining records in sorted order to a file.    Default: <STDOUT>	
$sort->dump( [file] ); 	

DESCRIPTION

File::MergeSort provides an easy way to merge, parse, process and analyze data that distributed in presorted files using the well known merge sort algorith. User supplies a list of file pathnames and a function to extract an numeric index value from each record line. By calling the "next_line" or "dump" function, the user can retrieve the records in an ordered manner.

File::MergeSort is a hopefully straight forward solution for situations where one wishes to merge data files with PRE-ORDERED records. An example might be application server logs which record events chronilogically from a cluster. If we want to examine, process or merge several files but retain the chronological order, then MergeSort is for you.

Here's how it works ...

As arguments, MergeSort takes a reference to an anonymous array of filepaths/names and a reference to a subroutine that extracts an index value. The anonymous array of the filenames are the files to be sorted with the subroutine determining the sort order. When passed a line (i.e. a scalar) from one of the files, the user supplied subroutine must return a numeric index value associated with the line. The records are then culled in ascending order based on the index values.

More detail ...

For each file MergeSort opens a IO::File or IO::Zlib object. ( MergeSort handles mixed compressed and uncompressed files seamlessly by detecting for files with .z or .gz extensions. ) Initially the first line is indexed acording to the subroutine. A stack is created based on these values.

When the function 'next_line' is called, MergeSort returns the line with the lowest index value. MergeSort then replenishes the stack, reads a new line from the corresponding file and places it in the proper position for the next call to 'next_line'.

Additional Notes: - A stable sort is implemented, i.e. a single file is read until its index is no longer the lowest value. - If the file ends in .z or .gz then the file is opened with IO::Zlib, instead.

EXAMPLE

   # This program does looks at files found 
   # in /logfiles, returns the records of the
   # files sorted by the date  in mm/dd/yyyy
   # format

  use File::MergeSort;

	
  my $files =  [ 'logfiles/log_server_1.log' , 
			  'logfiles/log_server_2.log' ,
			  'logfiles/log_server_3.log' 
			]	

  my $ms = new File::MergeSort($files, \&index_sub);
	
  while (my $line = $ms->next_line) {
    .
	.	some operations on $line
	.
  }



  sub index_sub{

    # Use this to extract a date of
    # the form mm-dd-yyyy.
	 
    my $line = shift;

    # Be cautious that only the date will be
    # extracted. 
    $line =~ /(\d{2})-(\d{2})-(\d{4})/;
 
    return "$3$1$2";  # Index is an interger, yyyymmdd
                      # Lower number will be read first.
  }	
	

TODO

Implement a generic test/comparison function to replace text/numeric comparison.
Implement a configurable record seperator.
Allow for optional deletion of duplicate entries.

EXPORT

None by default.

AUTHOR

Chris Brown, chris.brown@cal.berkeley.edu

Copyright(c) 2003 Christopher Brown. All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the License, distributed with PERL. Not intended for evil purposes. Yadda, yadda, yadda ...

SEE ALSO

perl. IO::File. IO::Zlib. Compress::Zlib.