The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

File::MergeSort - Merge sort ordered data files.

SYNOPSIS

  use File::MergeSort;


  ## Create the MergeSort Object.  
  my $sort = new File::MergeSort( 
                $file_list,                     # Anonymous array of path/files 
                \&index_extract_function        # Reference to a subroutine that 
  );


  ## Retrieves the next line for porcessing
  my $line = $sort->next_line;  
  print "$line\n";


  ## Dumps remaining records in sorted order to a file.    Default: <STDOUT>    
  $sort->dump( [file] );        

DESCRIPTION

File::MergeSort provides an easy way to merge, parse, process and analyze data that distributed in presorted files using the well known merge sort algorith. User supplies a list of file pathnames and a function to extract an numeric index value from each record line. By calling the "next_line" or "dump" function, the user can retrieve the records in an ordered manner.

File::MergeSort is a hopefully straight forward solution for situations where one wishes to merge data files with PRE-ORDERED records. An example might be application server logs which record events chronilogically from a cluster. If we want to examine, process or merge several files but retain the chronological order, then MergeSort is for you.

Here's how it works ...

As arguments, MergeSort takes a reference to an anonymous array of filepaths/names and a reference to a subroutine that extracts an index value. The anonymous array of the filenames are the files to be sorted with the subroutine determining the sort order. When passed a line (i.e. a scalar) from one of the files, the user supplied subroutine must return a numeric index value associated with the line. The records are then culled in ascending order based on the index values.

More detail ...

For each file MergeSort opens a IO::File or IO::Zlib object. ( MergeSort handles mixed compressed and uncompressed files seamlessly by detecting for files with .z or .gz extensions. ) Initially the first line is indexed acording to the subroutine. A stack is created based on these values.

When the function 'next_line' is called, MergeSort returns the line with the lowest index value. MergeSort then replenishes the stack, reads a new line from the corresponding file and places it in the proper position for the next call to 'next_line'.

Additional Notes: - A stable sort is implemented, i.e. a single file is read until its index is no longer the lowest value. - If the file ends in .z or .gz then the file is opened with IO::Zlib, instead.

EXAMPLE

   # This program does looks at files found 
   # in /logfiles, returns the records of the
   # files sorted by the date  in mm/dd/yyyy
   # format

  use File::MergeSort;

        
  my $files =  [ 'logfiles/log_server_1.log' , 
                          'logfiles/log_server_2.log' ,
                          'logfiles/log_server_3.log' 
                        ]       

  my $ms = new File::MergeSort($files, \&index_sub);
        
  while (my $line = $ms->next_line) {
    .
        .       some operations on $line
        .
  }



  sub index_sub{

    # Use this to extract a date of
    # the form mm-dd-yyyy.
         
    my $line = shift;

    # Be cautious that only the date will be
    # extracted. 
    $line =~ /(\d{2})-(\d{2})-(\d{4})/;
 
    return "$3$1$2";  # Index is an interger, yyyymmdd
                      # Lower number will be read first.
  }     
        

TODO

        Implement a generic test/comparison function to replace text/numeric comparison.
        Implement a configurable record seperator.
        Allow for optional deletion of duplicate entries.

EXPORT

None by default.

AUTHOR

Chris Brown, chris.brown@cal.berkeley.edu

Copyright(c) 2003 Christopher Brown. All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the License, distributed with PERL. Not intended for evil purposes. Yadda, yadda, yadda ...

SEE ALSO

perl. IO::File. IO::Zlib. Compress::Zlib.