NAME
File::Collector - Base class for custom File::Collector classes for classifying files and calling File::Collector::Processor methods for processing files
VERSION
version 0.029
OVERVIEW
File::Collector
and its companion module File::Collector::Processor
are base classes designed to make it easier to create custom modules for classifying and processing a collection of files as well as generating and processing data related to files in the collection.
For example, let's say you need to import raw files from one directory into some kind of repository. Let's say that files in the directory need to be filtered and the content of the files needs to be parsed, validated, rendered and/or changed before getting imported. Complicating things further, let's say that the name and location of the file in the target repository is dependent upon the content of the files in some way and that you also have to check to make sure the file hasn't already been imported into the repository.
This kind of task can be acomplished with a series of one-off scripts that process and import your files in stages. Each script produces output suitable for the next one. But running separate scripts for each processing stage can be slow, tedious, error-prone and a headache to maintain and organize.
The File::Collector
and File::Collector::Processor
base modules make it trivial to chain file processing modules into one logical package to make complicated file processing more robust, testable, and simpler to code.
SYNOPSIS
There are three steps to using File::Collector
. First, create at least one Collector
class for classifying and filtering files. Next, create a Processor
class your Collector
class will use to process the classified files. Finally, write a script to construct a new File::Collector
object to collect and process your files.
Step 1: Create the Collector
classes
package File::Collector::YourCollector;
use strict; use warnings;
# Here we add in the package containing the processing methods associated
# with the Collector (see below).
use File::Collector::YourCollector::Processor;
# Objects can store information about the files in the collection which
# can be accessed by other Collector and Processor classes.
use SomeObject;
# Add categories for file collections with the _init_processors method. These
# categories are used as labels for Processor objects which contain
# information about the files and can run methods on them. In the example
# below, we add two file collection categories, "good" and "bad."
sub _init_processors {
return qw ( good bad );
}
# Next we add a _classify_file method that is called once for each file added
# to our Collector object. The primary job of this method is to add files and
# any associated objects to a Processor for further processing.
sub _classify_file {
my $s = shift;
# First, we create an object and associate it with our file using the
# _add_obj method. There is no requirement that you create objects but they
# will make data about your file easily available to other classes.
# Offloading as much logic as possible to objects will keep classes simple.
# Note how we pass the name of the current file being processed to the
# object by using the "selected" method which intelligently generates the
# full path to the file currently being processed by _classify_file. Also
# note that we don't have to bother passing the name of the file to _add_obj
# method since this method can figure out which file is being processed by
# calling the "selected" method as well.
my $data = SomeObject->new( $s->selected );
$s->_add_obj('data', $data);
# Now that we know something about our file, we can classify the files
# according to any criteria of our choosing.
# to a processor category
if ( $data->{has_good_property} ) {
$s->_classify('good');
} else {
$s->_classify('bad');
}
}
# Finally, the _run_processes method contains method calls to your
# Processor methods.
sub _run_processes {
my $s = shift;
# Below are the methods we can run on the files in our collection. The
# "good_files" method returns the collection of files classified as "good"
# and the "do" method is a method that automatically iterates over the
# files. The "modify" method is one of the methods in our Processor class
# (see below).
$s->good_files->do->modify;
# Run methods on files classified as "bad"
$s->bad_files->do->fix;
$s->bad_files->do->modify;
# You can call methods found in any of the earlier Processor classes in your
# file processing chain.
$s->good_files->do->import;
$s->bad_files->do->import;
}
Step 2: Create your Processor
classes.
# The Processor class must have the same package name as the Collector class
# but with "::Processor" tacked on to the end.
package File::Collector::YourCollector::Processor;
# This line is required to get access to the methods from the base class.
use parent 'File::Collector::Processor';
# This custom method is run once for each file in a collection when we use
# the "do" method.
sub modify {
my $s = shift;
# Skip the file if it has already been processed.
next if ($s->attr_defined ( 'data', 'processed' ));
# Properties of objects added by Collector classes can be easily accessed.
my @values = $s->get_obj_prop ( 'data', 'header_values' );
# You can call methods found inside objects, too. Here we run the
# add_header() method on the data object and pass on values to it.
$s->obj_meth ( 'data', 'add_header', \@values );
}
# We can add as many additional custom methods as we need.
sub fix {
...
}
Step 3: Construct the Collector
Once your classes have been created, you can run all of your collectors and processors simply by constructing a Collector
object.
The constructor takes three types of arguments: a list of the files and/or directories you want to collect; an array of the names of the Collector
classes you wish to use in the order you wish to employ them; and finally, an option hash, which is optional.
my $collector = File::Collector::YourClassifier->new(
# The first arguments are a list of resources to be added
'my/dir', 'a_file.txt'
# The second argument is an array of Collector class names listed in the
# same order you want them to run
[ 'File::Collector::First', 'File::Collector::YourCollector'],
# Finally, an optional hash argument for options can be supplied
{ recurse => 0 });
# The C<$collector> object has some useful methods:
$collector->get_count; # returns the total number of files in the collection
# Convenience methods with a little under-the-hood magic make it painless to
# iterate over files and run methods on them.
while ($collector->next_good_file) {
$collector->print_short_name;
}
# Iterators can be easily created from C<Processor> objects:
my $iterator = $s->get_good_files;
while ( $iterator->next ) {
# run C<Processor> methods and do other stuff to "good" files
$iterator->modify_file;
}
DESCRIPTION
Collector Methods
The methods can be run on Collector
objects after they've been constructed.
new( $dir, $file, ..., [ @custom_collector_classes ], \%opts )
new( $dir, $file, ..., [ @custom_collector_classes ] )
new( $dir, $file, ..., )
my $collector = File::Collector->new( 'my/directory',
[ 'Custom::Classifier' ]
{ recurse => 0 } );
Creates a Collector
object that collects files from the directories and files in the argument list. Once collected, the files are processed by each of the @custom_collector_classes
in the order supplied by the array argument. An option hash can be supplied to turn directory recursion off with by setting recurse
to false.
new
returns an object which contains all the files, their processing classes, and any data you have associated with the files. This object has serveral methods that can be used to inspect the object.
add_resources( $dir, $file, ... )
$collector->add_resources( 'myfile1.txt', '/my/home/dir/files/', ... );
Adds additional file resources to an existing collection and processes them. This method accepts no option hash and the same one supplied to the new constructor is used.
get_count()
$collector->get_count;
Returns the total number of files in the collection.
get_files()
my @all_files = $collector->get_files;
Returns a list of the full path of each file in the collection.
get_file( $file_path )
my $file = $collector->get_file( '/full/path/to/file.txt' );
Returns a reference of the data and objects associated with a file.
list_files_long()
Prints the full path names of each file in the collection, sorted alphabetically, to STDOUT.
list_files()
Same as list_files_long
but prints the files' paths relative to the top level directory shared by all the files in the collections.
next_FILE_CATEGORY_file()
while ($collector->next_good_file) {
my $file = $collector->selected;
...
}
Retrieves the first file from the the collection of files indicated by FILE_CATEGORY
. Each subsequent next
call iterates over the list of files. Returns a boolean false when the file is exhausted. Provides an easy way to iterate over files and perform operations on them.
FILE_CATEGORY
must be a valid processor name as supplied by one of the _init_processors
method.
FILE_CATEGORY_files()
my $processor = $collector->good_files;
Returns the File::Processor
object for the category indicated by FILE_CATEGROY
.
get_FILE_CATEGORY_files()
Similar to FILE_CATEGORY_files()
except a shallow clone of the File::Processor
object is returned. Useful if you require separate iterators for files in the same category.
FILE_CATEGORY
must be a valid processor name as supplied by one of the _init_processors
method.
isa_FILE_CATEGORY_file()
Returns a boolean value reflecting whether the file being iterated over belongs to a category.
Private Methods
The following private methods are used in a child classes of File::Collector
you provide. See the SYNOPSIS for examples of these methods in use.
_init_processors()
sub _init_processors {
return qw ( 'category_1', 'category_2' );
}
Creates new file categories. Internally, this method adds a new Processor object to the Collector
for each category added so that Processor
methods from custom Processor
classes can be run on individual categories of files.
_classify_file()
sub _classify_file {
my $s = shift;
# File classifying and analysis logic goes here
}
Use this method to classify files and to associate objects with your files using the methods provided by the Collector
class. This method is run once for each file in the collection.
_run_processes()
sub _run_processes {
my $s = shift;
# Processor method calls go here
}
In this method, you should place various calls to Processor
s methods.
_classify( $category_name )
This method is typically called from within the _classify_file
method. It adds the file currently getting pocessed to a collection of $category_name
files contained within a Processor
object which, in turn, belongs to the Collector
object. The $category_name
must match one of the processor names provided by the _init_processor
methods.
_add_obj( $object_name, $object )
Like the _classify
method, this method is typically called from within the _classify_file
method. It associates the object specified by $object
to an arbitrary name, specified by $object_name
, with the file currently getting processed.
Iteration Methods
These methods can be used while iterating over a collection of files. See the SYNOPSIS for some examples of these methods in use.
get_obj_prop( $obj_name, $property_name )
Returns the contents of an object's property.
get_obj( $obj_name )
Returns an object associated with a file.
set_obj_prop( $obj_name, $property_name, $value )
Sets an object's property.
obj_meth( $obj_name, $method_name, $method_args );
Runs the $method_name
method on the object specified in $obj_name
. Arguments are passed via $method_args
.
get_filename()
Retrieves the name of file being processed without the path.
selected()
Returns the full path and file name of the file being processed.
has_obj( $obj_name );
Returns a boolean value reflecting the existence of the obj in $obj_name
.
attr_defined( $obj_name, $attr_name )
Returns a boolean value reflecting if the atrribute specified by $attr_name
is defined in the $obj_name
object.
print_short_name()
Returns a shortened path, relative to all the files in the entire collection, and the file name of the file being processed or in an iterator.
REQUIRES
SUPPORT
Perldoc
You can find documentation for this module with the perldoc command.
perldoc File::Collector
Websites
The following websites have more information about this module, and may be of help to you. As always, in addition to those websites please use your favorite search engine to discover more resources.
MetaCPAN
A modern, open-source CPAN search engine, useful to view POD in HTML format.
Source Code
The code is open to the world, and available for you to hack on. Please feel free to browse it and play with it, or whatever. If you want to contribute patches, please send me a diff or prod me to pull from your repository :)
https://github.com/sdondley/File-Collector
git clone git://github.com/sdondley/File-Collector.git
BUGS AND LIMITATIONS
You can make new bug reports, and view existing ones, through the web interface at https://github.com/sdondley/File-Collector/issues.
INSTALLATION
See perlmodinstall for information and options on installing Perl modules.
SEE ALSO
AUTHOR
Steve Dondley <s@dondley.com>
Special Thanks
Thanks to all the generous monks at the PerlMonks community for patiently answering my (sometimes asinine) questions. A very special mention goes to jcb whose advice was invaluable to improving the quality of this module. Another shout out to Hippo for the suggesting Role::Tiny to help make the module more flexible.
COPYRIGHT AND LICENSE
This software is copyright (c) 2019 by Steve Dondley.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.