NAME
App::dupfind::Common - Public methods for the App::dupfind deduplication engine
VERSION
version 0.172690
DESCRIPTION
Together with App::dupfind::Guts, the methods from this module are composed into the App::dupfind class in order to provide the user with the high-level methods that are directly callable from the user's application.
INTERNALS
There are some implementation-based concepts that don't really matter to the end user, but which are briefly discussed here because they form concepts that are used throughout the entirety of the codebase and are referred to by a large number of documented class methods.
THE DUPLICATES MASTER HASH
Potential duplicate files are kept in groupings of same-size files, organized by file size. They are tracked in a hashref datastructure.
Specifically, the keys of the hashref are the integers indicating file sizes in bytes. The corresponding value for each of these "size" keys is a listref containing the group of filenames that are of that file size.
Random example:
$dupes =
{
0 => # a zero-size file
[
'~/run/some_file.lock',
],
1024 => # some files that are 1024 bytes
[
'~/Pictures/kitty.jpg',
'~/Pictures/cat.jpg'
'~/.cache/foo'
],
4096 => # some files that are 4096 bytes
[
'~/Documents/notes.txt',
'~/Downloads/bar.gif',
],
}
METHODS
- cache_stats
-
Retrieve information about cache hits/misses that happened during the calculation of file digests in the digest_dups method. Used as part of the run summary that gets printed out at the end of execution of $bin/dupfind
Returns $cache_hits, $cache_misses (both integers)
- count_dups
-
Examines its argument and sums up the number of its members. Expects a datastructure in the form of the master dupes hashref.
Returns $dup_count (integer)
- delete_dups
-
Deletes duplicate files, optionally prompting the user for which files to delete and for confirmation of deletion (if command-line parameters supplied by the user dictate that interactive prompting is desired)
Returns nothing
- digest_dups
-
Expects a datastructure in the form of the master dupes hashref.
Iterates over the datastructure and calculates digests for each of the files.
If ramcache is enabled (which is the default), a rudimentary caching mechanism is used in order to avoid calculating digests multiple times for files with the same content.
Returns a lexical copy of the duplicates hashref with non-dupes removed
- get_size_dups
-
Scans the directory specified by the user and assembles the master dupes hashref datastructure as described above. Files with no same-size counterparts are not included in the datastructure.
Returns $dupes_hashref, $scan_count, $size_dup_count ... ...where $dupes_hashref is the master duplicates hashref, $scan_count is the number of files that were scanned, and $size_dup_count is the total number of same-size files encompassing each same-size group
- opts
-
A read-only accessor method that returns a hashref of options as specified by either or both of the default settings and user input at invocation time.
Examples:
$self->opts->{threads} # contains the number of threads the user wants $self->opts->{dir} # name of the directory to scan for duplicates
- say_stderr
-
The same as Perl's built-in say function, except that:
It is a class method
It outputs to STDERR instead of STDOUT
- show_dups
-
Expects a datastructure in the form of the master dupes hashref.
Produces the formatted output for $bin/dupfind based on what duplicate files were found during execution. Currently two output formats are supported: "human" and "robot". Logically, the robot output is easily machine-parsable, while the human output is more visually palatable to human users (it makes sense to people).
Returns the number of duplicates shown.
- sort_dups
-
Expects a datastructure in the form of the master dupes hashref.
Iterates through the hashref and examines the listrefs of file names that comprise its values. It then sorts the listrefs in place with the following sort:
sort { $a cmp $b }
Returns a lexical copy of the newly-sorted master duplicates hashref
- toss_out_hardlinks
-
Expects a datastructure in the form of the master dupes hashref.
Iterates through the hashref and examines the listrefs of file names that comprise its values.
For each file in each group, it looks at the underlying storage for the file on the storage medium using a stat call. Any files with the same device major number AND the same inode number are obvious hardlinks.
After alphabetizing any hard links that are detected, it throws out all hard links but the first one. This simplifies the output, and the easy rationale behind this is that a hard link constitutes a file that has already been deduplicated because it refers to the same underlying storage.
Returns a lexical copy of the master duplicates hashref
- weed_dups
-
Expects a datastructure in the form of the master dupes hashref.
Runs the weed out pass(es) on the datastructure in an attempt to eliminate as many non-duplicate files as possible from the same-size file groupings without having to resort to resource-intensive file hashing (i.e.- the calculation of file digests).
If no duplicates remain after the weed out pass(es), then the need for hashing is obviated and it doesn't get performed. For any remaining potential duplicates however, the hashing is ultimately used to provide the final decision on file uniqueness.
One or more passes may be performed, based on user input. Currently the default is to use only one pass, with the "first_middle_last" weed-out algorithm which has proved so far to be the most efficient.
Returns a (hopefully reduced) lexical copy of the master duplicates hashref