NAME
File::Unpack - An aggressive archive file unpacker, based on mime-types
VERSION
Version 0.23
SYNOPSIS
File::Unpack is an aggressive unpacker for archive files. We call it aggressive, because it can recursivly descend into any freshly unpacked file, if it appears to be an archive itself. It also uncompresses files where needed. The ultimate goal of File::Unpack is to extract as much readable text (ascii or any other encoding) as possible. Most of the currently known archive file formats are supported.
use File::Unpack;
my $log;
my $u = File::Unpack->new(logfile => \$log);
my $m = $u->mime('/etc/init.d/rc');
print "$m->[0]; charset=$m->[1]\n"
# text/x-shellscript; charset=us-ascii
map { print "$_->{name}\n" } @{$u->mime_handler()};
# application/%rpm
# application/%tar+gzip
# application/%tar+bzip2
# ...
$u->unpack("inputfile.tar.bz2");
while ($log =~ m{^\s*"(.*?)":}g) # it's JSON.
{
print "$1\n"; # report all files unpacked
}
...
Examines the contents of an archive file or directory by extensive mime-type analysis. The contents is unpacked recursively to the given destination directory; a listing of the unpacked files is reported through the built in logging facility during unpacking. The mime-type handlers are customizable, as well as exclude patterns.
SUBROUTINES/METHODS
new
my $u = new(destdir => '.', logfile => \*STDOUT, maxfilesize => '100M', verbose => 1);
Creates an unpacker instance. The parameter destdir
must be writable location; all output files and directories are placed inside destdir. Subdirectories will be created in an attempt to reflect the structure of the input. Destdir defaults to the current directory; relative paths are resolved immediatly, so that chdir() after calling new is harmless.
The parameter logfile
can be a reference to a scalar, a filename, or a filedescriptor. The logfile starts with a JSON formatted prolog, where all lines start with printable characters. For each file unpacked, a one line record is appended, started with a single whitespace ' ', and terminated by "\n". Each record is formatted as a JSON " key: value\n" pair, where key is the filename, and value a hash including mime, size, and other information. The logfile is terminated by an epilog, where each line starts with a printable character. Per default, the logfile is sent to STDOUT.
The parameter maxfilesize
is a safeguard against compressed sparse files. Such files could easily fill up any available disk space when unpacked. Files hitting this limit will be silently truncated. Check the logfile records or epilog to see if this has happened. BSD::Resource is used manipulate RLIMIT_FSIZE. To be implemented.
The parameter one_shot
can optionally be set to non-zero, to limit unpacking to one level of unpacking. Unpacking of very well known compressed archives like e.g. tar.bz2 is considered one level only. The exact semantics depend on the configured mime helpers.
exclude
exclude(add => ['.svn', '*.orig' ], del => '.svn', force => 1)
Defines the exclude-list for unpacking. This list is advisory for the mime-handlers. The exclude-list items are shell glob patterns, where '*' or '?' never match '/'.
You can use force to have any of these removed after unpacking. Use (vcs => 1) to exclude a long list of known version control system directories, use (vcs => 0) to remove them. The default is exclude(empty =
1)>, which is the same as (empty_file => 1, empty_dir => 1) -- having the obvious meaning.
(re => 1) returns the active exclude-list as a regexp pattern. Otherwise exclude
always returns the list as an array ref.
unpack
$u->unpack($archive, [$destdir])
Determines the contents of an archive and recursivly extracts its individual files. An archive may be the pathname of a file or directory. The extracted contents will be stored in "destdir/$subdir/$dest_name", where dest_name is the filename component of $archive without any leading pathname components, and possibly stripped or added suffix. (Subdir defaults to ''.) If $archive is a directory, then dest_name will also be a directory. If archive is a file, the type of dest_name depends on the type of packing: If the archive expands to multiple files, dest_name will be a directory, otherwise it will be a file. If a file of the same name already exists in the destination subdir, an additional subdir component is created to avoid any conflicts. For each extracted file, a record is written to the logfile. When unpacking is finished, the logfile contains one valid JSON structure. Unpack achieves this by writing suitable prolog and epilog lines to the logfile.
The actual unpacking is dispatched to mime-type specfic mime handlers, selected using mime
. A mime-handler can either be built-in code, or an external program (or shell-script) found in a directory registered with mime_handler_dir
. The standard place for external handlers is /usr/share/File-Unpack/helper; it can be changed by the environment variable FILE_UNPACK_HELPER_DIR or the new
parameter helper_dir
.
A mime-handler is called with 6 parameters: source_path, destfile, destination_path, mimetype, description, and config_dir. Note, that destination_path is a freshly created empty working directory, even if the unpacker is expected to unpack only a single file. The unpacker is called after chdir into destination_path, so you usually do not need to evaluate the third parameter.
The directory config_dir
contains unpack configuration in .sh, .js and possibly other formats. A mime-handler may use this information, but need not. All data passed into new
is reflected there, as well as the active exclude-list. Using the config information can help a mime-handler to skip unwanted work or otherwise optimize unpacking.
unpack
monitors the available filesystem space in destdir. If there is less space than configured with minfree
, a warning can be printed and unpacking is optionally paused. It also monitors the mime-handlers progress reading the archive at source_path and reports percentages to STDERR (if verbose is 1 or more).
After the mime-handler is finished, unpack
examines the files it created. If it created no files in destdir, an error is reported, and the source_path may be passed to other unpackers, or finally be added to the log as is.
If the mime-handler wants to express, that source_path is already unpacked as far as possible and it should be added to the log without any errir messages, it should create a symbolic link destdir pointing to source_path.
The system considers replacing the directory with a file, under the following conditions:
There is exactly one file in the directory.
The file name is identical with directory name, except for one changed or removed suffix-word. (*.tar.gz -> *.tar; or *.tgz -> *.tar)
The file must not already exist in the parent directory.
unpack
prepares 20 empty subdirectory levels and chdirs the unpacker in there. This number can be adjusted using new(dot_dot_safeguard =
20)>. A directory 20 levels up from the current working dir has mode 0 while the mime-handler runs. unpack
can optionally chmod(0) the parent of the subdirectory after it chdirs the unpacker inside. Use new(jail_chmod0 =
1)> for this, default is off. If enabled, a mime-handler trying to place files outside of the specified destination_path may receive 'permission denied' conditions.
These are special hacks to keep badly constructed tar balls, cpio, or zip archives at bay.
Please note, that all this helps against relative paths, but not against absolute paths in archives. It is the responsibility of mime-handlers to not create absolute paths.
A missing mime-handler is skipped. A mime-handler is expected to return an exit status of 0 upon success. If it runs into a problem, it should print lines starting with the affected filenames to stderr. Such errors are recorded in the log with the unpacked archive, and as far as files were created, also with these files.
run
$u->run([argv0, ...], @redir, ... { init => sub ..., in, out, err, watch, every, prog, ... })
A general purpose fork-exec wrapper, based on IPC::Run. STDIN is closed, unless you specify an in => as described in IPC::Run. STDERR and STDOUT are both printed to STDOUT, prefixed with 'E: ' and 'O: ' respectively, unless you specify out =>, err =>, or out_err => ... for both.
Using redirection operators in @redir takes precedence over the above in/out/err redirections. See also IPC::Run. If you use the options in/out/err, you should restrict your redirection operators to the forms '<', '0<', '1>', '2>', or '>&' due to limitations in the precedence logic. Piping via '|' is properly recognized, but background execution '&' may confuse the precedence logic.
This run
method is completly independent of the rest of File::Unpack. It works both as a static function and as a method call. It is used internally by unpack
, but is exported to be of use elsewhere.
Init is run after construction of redirects. Calling chdir() in init thus has no effect on redirects with relative paths.
Return value in scalar context is the first nonzero result code, if any. In list context all return values are returned.
fmt_run_shellcmd
File::Unpack::fmt_run_shellcmd( $m->{argvv} )
Static function to pretty print the return value $m of method find_mime_handler(); It formats a command array used with run() as a properly escaped shell command string.
mime_handler_dir mime_handler
$u->mime_handler_dir($dir, ...) $u->mime_handler($mime_name, $suffix_regexp, \@argv, @redir, ...)
Registers one or more directories where external mime-helper programs are found. The words helper and handler are used as synonyms here, helpers often refer to external programs, where handlers refer to builtin shell commands. Multiple directories can be registered, They are searched in reverse order, i.e. last added takes precedence. Any external mime-handler takes precedence over built-in code. An array ref to the new list of directories is returned.
The suffix_regexp is not used to find helpers. It is applied to derive the destination name from the source name.
Helpers are mapped to mime-types by their mime_name. The name can be constructed from the mimetype by replacing the '/' with a '=' character, and by using the word 'ANY' as a wildcard component. The '=' character is interpreted as an implicit '=ANY+' if needed.
Examples:
Mimetype handler names tried in sequence
----------------------------------------------------------
image/png image=png
image=ANY
image
ANY=ANY
ANY
application/vnd.oasis+zip application=vnd.oasis+zip
application=ANY+zip
application=ANYzip
application=zip
application=ANY
...
A trailing '=ANY' is implicit, as shown by these examples. The rules for determinig precedence are this:
Search in one directory is exhaused before the next is considered.
A matching name with wildcards has lower precedence than a matching name without.
A wildcard before the '=' sign lowers precedence more than one after it.
The mapping takes place when mime_handler_dir
is called, later additions are not recognized. mime_handler
does not do any implicit expansions. Call it multiple times with the same handler command and different names if needed. The default argument list is "%(src)s %(destfile)s %(destdir)s %(mime)s %(descr)s %(configdir)s" -- this is applied, if no args are given and no redirections are given. See also unpack
for more semantics and how a handler should behave.
Both methods return an ARRAY-ref of all currently known mime handlers.
find_mime_handler
$u->find_mime_handler($mimetype)
Returns a mime-handler suitable for unpacking the given $mimetype. If called in list context, a second return value indicates which mime handlers whould be suitable, but could not be found in the system.
minfree
$u->minfree(factor => 10, bytes => '100M', percent => '3%', warning => sub { .. })
THE ACTUAL TESTS ARE NOT IMPLEMENTED.
Guard the filesystem (destdir) against becoming full during unpack
. Before unpacking each source archive, the free space is measured and compared against three conditions:
The archive size multiplied with the given factor must fit into the filesystem.
The given number of bytes in optional K, M, G, or T units must be free.
The filesystem must have at least the given free percentage. The '%' character is optional.
The warning method is called with the following parameters: &warning->($pathname, $full_percentage, $free_bytes, $free_inodes); It is expected to print an appropriate warning message, and delay a few seconds. It should return 0 to cause a retry. It should return nonzero to continue unpacking. The default warning method prints a message to STDERR, waits 30 seconds, and returns 0.
The filesystem may still become full and unpacking may fail, if e.g. factor was chosen lower then the compression ratio of the unpacked archives.
mime
$u->mime($filename)
$u->mime(file => $filename)
$u->mime(buf => "#!/bin ...", file => "what-was-read")
$u->mime(fd => \*STDIN, file => "what-was-opened")
Determines the mimetype (and optionally additional information) of a file. The file can be specified by filename, by a provided buffer or an opened filedescriptor. For the latter two casese, speifying the filename is optional, and used for diagnostics.
mime
uses Christos Zoulas' excellent libmagic exposed via File::LibMagic and the shared-mime-info database from freedesktop.org exposed via File::MimeInfo::Magic, if available. Either one is sufficient, but having both is better. LibMagic sometimes says 'text/x-pascal', although we have a .desktop file, or returns says 'text/plain', but has contradicting details in its description.
File::MimeInfo::Magic::magic
is consulted where the libmagic output is dubious.
This implementation also features multi-level mime-type recognition for efficient unpacking. If we'd recognize a large bzipped tar ball only as bzip, we'd unpack a huge temporary tar-file, consuming the same amount of disk space as its content, which unpack
would extract in a second step. The multi-level recognition returns 'application/x-tar+bzip2' in this case, and allows for a mime-handler to e.g. pipe the bzip2 contents into tar (which is exactly what 'tar jxvf' does, making a very simple and efficient mime-handler).
mime
returns a 3 or 4 element arrayref with mimetype, charset, description, diff; where diff is only present when both methods disagree.
In case of 'text/plain', an additional rule based on file name suffix is used to allow recognizing well known plain text pack formats. We return 'text/x-suffix-XX+plain', where XX is one of the recognized suffixes (in all lower case and without the dot). E.g. a plain mmencoded file has no header and looks like 'plain/text' to all the known magic libraries. We recognize the suffixes .mm, .b64, and .base64 for this (case insignificant). A similar rule exitst for 'application/octect-stream'. It may trigger if lzma recognition fails.
Examples:
[ 'text/x-perl', 'us-ascii', 'a /usr/bin/perl -w script text']
[ 'text/x-mpegurl', 'utf-8', 'M3U playlist text',
[ 'text/plain', 'application/x-mpegurl']]
[ 'application/x-tar+bzip2, 'binary',
"bzip2 compressed data, block size = 900k\nPOSIX tar archive (GNU)", ...]
AUTHOR
Juergen Weigert, <jnw at cpan.org>
BUGS
The implementation of mime
is an ugly hack. We suffer from the existance of multiple file magic databases, and multiple conflicting implementations. With perl we have at least 5 modules for this; here we use two.
The builtin list of mime-handlers is incomplete. Please submit your handler code.
Please report any bugs or feature requests to bug-file-unpack at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=File-Unpack. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
RELATED MODULES
While designing File::Unpack, a range of other perl modules were examined. Many modules provide valuable service to File::Unpack and became dependencies or are recommended. Others exposed drawbacks during closer examination and may find some of their wheels re-invented here.
Used Modules
- File::LibMagic
-
This is the prefered mimetype engine. It disregards the suffix, recognizes more types than any of the alternatives, and uses exactly the same engine as /usr/bin/file in your openSUSE system. It also returns charset and description information. We crossreference the description with the mimetype to detect weaknesses, and consult File::MimeInfo::Magic and some own logic, for e.g. detecting LZMA compression which fails to provide any recognizable magic. Required if you use
mime
; not a hard requirement. - File::MimeInfo::Magic
-
Uses both magic information and file suffixes to determine the mimetype. Its magic() function is used in a few cases, where File::LibMagic fails. E.g. as of June 2010, libmagic does not recognize 'image/x-targa'. File::MimeInfo::Magic may be slower, but it features the shared-mime-info database from freedesktop.org . Recommended if you use
mime
. - String::ShellQuote
-
Used to call external mime-handlers. Required.
- BSD::Resource
-
Used to reliably restrict the maximum file size. Recommended.
- File::Path
-
mkpath(). Required.
- Cwd
-
fast_abs_path(). Required.
- JSON
-
Used for formatting the logfile. Required.
Modules Not Used
- Archive::Extract
-
Archive::Extract tries first to determine what type of archive you are passing it, by inspecting its suffix. It does not do this by using Mime magic. Maybe this module should use something like "File::Type" to determine the type, rather than blindly trust the suffix. [quoted from perldoc]
Set $Archive::Extract::PREFER_BIN to 1, which will prefer the use of command line programs and won't consume so much memory. Default: use "Archive::Tar".
- Archive::Zip
-
If you are just going to be extracting zips (and/or other archives) you are recommended to look at using Archive::Extract . [quoted from perldoc] It is pure perl, so it's a lot slower then your '/usr/bin/zip'.
- Archive::Tar
-
It is pure perl, so it's a lot slower then your "/bin/tar". It is heavy on memory, all will be read into memory. [quoted from perldoc]
- File::MMagic, File::MMagic::XS, File::Type
-
Compared to File::LibMagic and File::MimeInfo::Magic, these three are inferior. They often say 'text/plain' or 'application/octet-stream' where the latter two report useful mimetypes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc File::Unpack
You can also look for information at:
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
SOURCE REPOSITORY
https://developer.berlios.de/projects/perl-file-unpck
svn co https://svn.berlios.de/svnroot/repos/perl-file-unpck/trunk/File-Unpack
ACKNOWLEDGEMENTS
Mime-type recognition relies heavily on libmagic by Christos Zoulas. I had long hesitated implementing File::Unpack, but set to work, when I dicovered that File::LibMagic brings your library to perl. Thanks Christos. And thanks for tcsh too.
LICENSE AND COPYRIGHT
Copyright 2010 Juergen Weigert.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.