NAME

File::Rsync::Mirror::Recentfile - mirroring via rsync made efficient

SYNOPSIS

Writer (of a single file):

use File::Rsync::Mirror::Recentfile;
my $fr = File::Rsync::Mirror::Recentfile->new
  (
   interval => q(6h),
   filenameroot => "RECENT",
   comment => "These 'RECENT' files are part of a test of a new CPAN mirroring concept. Please ignore them for now.",
   localroot => "/home/ftp/pub/PAUSE/authors/",
   aggregator => [qw(1d 1W 1M 1Q 1Y Z)],
  );
$rf->update("/home/ftp/pub/PAUSE/authors/id/A/AN/ANDK/CPAN-1.92_63.tar.gz","new");

Reader/mirrorer:

my $rf = File::Rsync::Mirror::Recentfile->new
  (
   filenameroot => "RECENT",
   interval => q(6h),
   localroot => "/home/ftp/pub/PAUSE/authors",
   remote_dir => "",
   remote_host => "pause.perl.org",
   remote_module => "authors",
   rsync_options => {
                     compress => 1,
                     'rsync-path' => '/usr/bin/rsync',
                     links => 1,
                     times => 1,
                     'omit-dir-times' => 1,
                     checksum => 1,
                    },
   verbose => 1,
  );
$rf->mirror;

Aggregator (usually the writer):

my $rf = File::Rsync::Mirror::Recentfile->new_from_file ( $file );
$rf->aggregate;

DESCRIPTION

Lower level than F:R:M:Recent, handles one recentfile. Whereas a tree is always composed of several recentfiles, controlled by the F:R:M:Recent object. The Recentfile object has to do the bookkeeping for a single timeslice.

EXPORT

No exports.

CONSTRUCTORS / DESTRUCTOR

my $obj = CLASS->new(%hash)

Constructor. On every argument pair the key is a method name and the value is an argument to that method name.

If a recentfile for this resource already exists, metadata that are not defined by the constructor will be fetched from there as soon as it is being read by recent_events().

my $obj = CLASS->new_from_file($file)

Constructor. $file is a recentfile.

DESTROY

A simple unlock.

ACCESSORS

aggregator

A list of interval specs that tell the aggregator which recentfiles are to be produced.

canonize

The name of a method to canonize the path before rsyncing. Only supported value is naive_path_normalize. Defaults to that.

comment

A comment about this tree and setup.

dirtymark

A timestamp. The dirtymark is updated whenever an out of band change on the origin server is performed that violates the protocol. Say, they add or remove files in the middle somewhere. Slaves must react with a devaluation of their done structure which then leads to a full re-sync of all files. Implementation note: dirtymark may increase or decrease.

filenameroot

The (prefix of the) filename we use for this recentfile. Defaults to RECENT. The string must not contain a directory separator.

have_mirrored

Timestamp remembering when we mirrored this recentfile the last time. Only relevant for slaves.

If set to true, rsync errors are ignored that complain about link stat errors. These seem to happen only when there are files missing at the origin. In race conditions this can always happen, so it defaults to true.

is_slave

If set to true, this object will fetch a new recentfile from remote when the timespan between the last mirror (see have_mirrored) and now is too large (see ttl).

keep_delete_objects_forever

The default for delete events is that they are passed through the collection of recentfile objects until they reach the Z file. There they get dropped so that the associated file object ceases to exist at all. By setting keep_delete_objects_forever the delete objects are kept forever. This makes the Z file larger but has the advantage that slaves that have interrupted mirroring for a long time still can clean up their copy.

locktimeout

After how many seconds shall we die if we cannot lock a recentfile? Defaults to 600 seconds.

loopinterval

When mirror_loop is called, this accessor can specify how much time every loop shall at least take. If the work of a loop is done before that time has gone, sleeps for the rest of the time. Defaults to arbitrary 42 seconds.

max_files_per_connection

Maximum number of files that are transferred on a single rsync call. Setting it higher means higher performance at the price of holding connections longer and potentially disturbing other users in the pool. Defaults to the arbitrary value 42.

max_rsync_errors

When rsync operations encounter that many errors without any resetting success in between, then we die. Defaults to unlimited. A value of -1 means we run forever ignoring all rsync errors.

minmax

Hashref remembering when we read the recent_events from this file the last time and what the timespan was.

protocol

When the RECENT file format changes, we increment the protocol. We try to support older protocols in later releases.

remote_host

The host we are mirroring from. Leave empty for the local filesystem.

remote_module

Rsync servers have so called modules to separate directory trees from each other. Put here the name of the module under which we are mirroring. Leave empty for local filesystem.

rsync_options

Things like compress, links, times or checksums. Passed in to the File::Rsync object used to run the mirror.

serializer_suffix

Mostly untested accessor. The only well tested format for recentfiles at the moment is YAML. It is used with YAML::Syck via Data::Serializer. But in principle other formats are supported as well. See section SERIALIZERS below.

sleep_per_connection

Sleep that many seconds (floating point OK) after every chunk of rsyncing has finished. Defaults to arbitrary 0.42.

tempdir

Directory to write temporary files to. Must allow rename operations into the tree which usually means it must live on the same partition as the target directory. Defaults to $self->localroot.

ttl

Time to live. Number of seconds after which this recentfile must be fetched again from the origin server. Only relevant for slaves. Defaults to arbitrary 24.2 seconds.

verbose

Boolean to turn on a bit verbosity.

verboselog

Path to the logfile to write verbose progress information to. This is a primitive stop gap solution to get simple verbose logging working. Switching to Log4perl or similar is probably the way to go.

METHODS

(void) $obj->aggregate( %options )

Takes all intervals that are collected in the accessor called aggregator. Sorts them by actual length of the interval. Removes those that are shorter than our own interval. Then merges this object into the next larger object. The merging continues upwards as long as the next recentfile is old enough to warrant a merge.

If a merge is warranted is decided according to the interval of the previous interval so that larger files are not so often updated as smaller ones. If $options{force} is true, all files get updated.

Here is an example to illustrate the behaviour. Given aggregators

1h 1d 1W 1M 1Q 1Y Z

then

1h updates 1d on every call to aggregate()
1d updates 1W earliest after 1h
1W updates 1M earliest after 1d
1M updates 1Q earliest after 1W
1Q updates 1Y earliest after 1M
1Y updates  Z earliest after 1Q

Note that all but the smallest recentfile get updated at an arbitrary rate and as such are quite useless on their own.

$hashref = $obj->delayed_operations

A hash of hashes containing unlink and rmdir operations which had to wait until the recentfile got unhidden in order to not confuse downstream mirrors (in case we have some).

$done = $obj->done

$done is a reference to a File::Rsync::Mirror::Recentfile::Done object that keeps track of rsync activities. Only needed and used when we are a mirroring slave.

$tempfilename = $obj->get_remote_recentfile_as_tempfile ()

Stores the remote recentfile locally as a tempfile. The caller is responsible to remove the file after use.

Note: if you're intending to act as an rsync server for other slaves, then you must prefer this method to fetch that file with get_remotefile(). Otherwise downstream mirrors would expect you to already have mirrored all the files that are in the recentfile before you have them mirrored.

$localpath = $obj->get_remotefile ( $relative_path )

Rsyncs one single remote file to local filesystem.

Note: no locking is done on this file. Any number of processes may mirror this object.

Note II: do not use for recentfiles. If you are a cascading slave/server combination, it would confuse other slaves. They would expect the contents of these recentfiles to be available. Use get_remote_recentfile_as_tempfile() instead.

$obj->interval ( $interval_spec )

Get/set accessor. $interval_spec is a string and described below in the section INTERVAL SPEC.

$secs = $obj->interval_secs ( $interval_spec )

$interval_spec is described below in the section INTERVAL SPEC. If empty defaults to the inherent interval for this object.

$obj->localroot ( $localroot )

Get/set accessor. The local root of the tree. Guaranteed without trailing slash.

$ret = $obj->local_path($path_found_in_recentfile)

Combines the path to our local mirror and the path of an object found in this recentfile. In other words: the target of a mirror operation.

Implementation note: We split on slashes and then use File::Spec::catfile to adjust to the local operating system.

(void) $obj->lock

Locking is implemented with an mkdir on a locking directory (.lock appended to $rfile).

(void) $obj->merge ($other)

Bulk update of this object with another one. It's used to merge a smaller and younger $other object into the current one. If this file is a Z file, then we normally do not merge in objects of type delete; this can be overridden by setting keep_delete_objects_forever. But if we encounter an object of type delete we delete the corresponding new object if we have it.

If there is nothing to be merged, nothing is done.

merged

Hashref denoting when this recentfile has been merged into some other at which epoch.

$hashref = $obj->meta_data

Returns the hashref of metadata that the server has to add to the recentfile.

$success = $obj->mirror ( %options )

Mirrors the files in this recentfile as reported by recent_events. Options named after, before, max are passed through to the recent_events call. The boolean option piecemeal, if true, causes mirror to only rsync max_files_per_connection and keep track of the rsynced files so that future calls will rsync different files until all files are brought to sync.

$success = $obj->mirror_path ( $arrref | $path )

If the argument is a scalar it is treated as a path. The remote path is mirrored into the local copy. $path is the path found in the recentfile, i.e. it is relative to the root directory of the mirror.

If the argument is an array reference then all elements are treated as a path below the current tree and all are rsynced with a single command (and a single connection).

$path = $obj->naive_path_normalize ($path)

Takes an absolute unix style path as argument and canonicalizes it to a shorter path if possible, removing things like double slashes or /./ and removes references to ../ directories to get a shorter unambiguos path. This is used to make the code easier that determines if a file passed to upgrade() is indeed below our localroot.

$ret = $obj->read_recent_1 ( $data )

Delegate of recent_events() on protocol 1

$array_ref = $obj->recent_events ( %options )

Note: the code relies on the resource being written atomically. We cannot lock because we may have no write access. If the caller has write access (eg. aggregate() or update()), it has to care for any necessary locking and it MUST write atomically.

If $options{after} is specified, only file events after this timestamp are returned.

If $options{before} is specified, only file events before this timestamp are returned.

If $options{max} is specified only a maximum of this many most recent events is returned.

If $options{'skip-deletes'} is specified, no files-to-be-deleted will be returned.

If $options{contains} is specified the value must be a hash reference containing a query. The query may contain the keys epoch, path, and type. Each represents a condition that must be met. If there is more than one such key, the conditions are ANDed.

If $options{info} is specified, it must be a hashref. This hashref will be filled with metadata about the unfiltered recent_events of this object, in key first there is the first item, in key last is the last.

$ret = $obj->rfilename

Just the basename of our recentfile, composed from filenameroot, a dash, interval, and serializer_suffix. E.g. RECENT-6h.yaml

$str = $self->remote_dir

The directory we are mirroring from.

$str = $obj->remoteroot

(void) $obj->remoteroot ( $set )

Get/Set the composed prefix needed when rsyncing from a remote module. If remote_host, remote_module, and remote_dir are set, it is composed from these.

(void) $obj->split_rfilename ( $recentfilename )

Inverse method to rfilename. $recentfilename is a plain filename of the pattern

$filenameroot-$interval$serializer_suffix

e.g.

RECENT-1M.yaml

This filename is split into its parts and the parts are fed to the object itself.

my $rfile = $obj->rfile

Returns the full path of the recentfile

$rsync_obj = $obj->rsync

The File::Rsync object that this object uses for communicating with an upstream server.

(void) $obj->register_rsync_error(@err)

(void) $obj->un_register_rsync_error()

Register_rsync_error is called whenever the File::Rsync object fails on an exec (say, connection doesn't succeed). It issues a warning and sleeps for an increasing amount of time. Un_register_rsync_error resets the error count. See also accessor max_rsync_errors.

$clone = $obj->_sparse_clone

Clones just as much from itself that it does not hurt. Experimental method.

Note: what fits better: sparse or shallow? Other suggestions?

$boolean = OBJ->ttl_reached ()

(void) $obj->unlock()

Unlocking is implemented with an rmdir on a locking directory (.lock appended to $rfile).

unseed

Sets this recentfile in the state of not 'seeded'.

$ret = $obj->update ($path, $type)

$ret = $obj->update ($path, "new", $dirty_epoch)

$ret = $obj->update ()

Enter one file into the local recentfile. $path is the (usually absolute) path. If the path is outside our tree, then it is ignored.

$type is one of new or delete.

Events of type new may set $dirty_epoch. $dirty_epoch is normally not used and the epoch is calculated by the update() routine itself based on current time. But if there is the demand to insert a not-so-current file into the dataset, then the caller sets $dirty_epoch. This causes the epoch of the registered event to become $dirty_epoch or -- if the exact value given is already taken -- a tiny bit more. As compensation the dirtymark of the whole dataset is set to now or the current epoch, whichever is higher. Note: setting the dirty_epoch to the future is prohibited as it's very unlikely to be intended: it definitely might wreak havoc with the index files.

The new file event is unshifted (or, if dirty_epoch is set, inserted at the place it belongs to, according to the rule to have a sequence of strictly decreasing timestamps) to the array of recent_events and the array is shortened to the length of the timespan allowed. This is usually the timespan specified by the interval of this recentfile but as long as this recentfile has not been merged to another one, the timespan may grow without bounds.

The third form runs an update without inserting a new file. This may be desired to truncate a recentfile.

$obj->batch_update($batch)

Like update but for many files. $batch is an arrayref containing hashrefs with the structure

{
  path => $path,
  type => $type,
  epoch => $epoch,
}

seed

Sets this recentfile in the state of 'seeded' which means it has to re-evaluate its uptodateness.

seeded

Tells if the recentfile is in the state 'seeded'.

uptodate

True if this object has mirrored the complete interval covered by the current recentfile.

$obj->write_recent ($recent_files_arrayref)

Writes a recentfile based on the current reflection of the current state of the tree limited by the current interval.

$obj->write_0 ($recent_files_arrayref)

Delegate of write_recent() on protocol 0

$obj->write_1 ($recent_files_arrayref)

Delegate of write_recent() on protocol 1

SERIALIZERS

The following suffixes are supported and trigger the use of these serializers:

".yaml" => "YAML::Syck"
".json" => "JSON"
".sto" => "Storable"
".dd" => "Data::Dumper"

INTERVAL SPEC

An interval spec is a primitive way to express time spans. Normally it is composed from an integer and a letter.

As a special case, a string that consists only of the single letter Z, stands for MAX_INT seconds.

The following letters express the specified number of seconds:

s => 1
m => 60
h => 60*60
d => 60*60*24
W => 60*60*24*7
M => 60*60*24*30
Q => 60*60*24*90
Y => 60*60*24*365.25

SEE ALSO

File::Rsync::Mirror::Recent, File::Rsync::Mirror::Recentfile::Done, File::Rsync::Mirror::Recentfile::FakeBigFloat

BUGS

Please report any bugs or feature requests through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=File-Rsync-Mirror-Recentfile. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

KNOWN BUGS

Memory hungry: it seems all memory is allocated during the initial rsync where a list of all files is maintained in memory.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc File::Rsync::Mirror::Recentfile

You can also look for information at:

ACKNOWLEDGEMENTS

Thanks to RJBS for module-starter.

AUTHOR

Andreas König

COPYRIGHT & LICENSE

Copyright 2008,2009 Andreas König.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.