NAME
HTTP::GetImages - Spider to recover and store images from web pages.
SYNOPSIS
use HTTP::GetImages;
$_ = new HTTP::GetImages (
dir => '.',
todo => ['http://www.google.com/',],
dont => ['http://www.somewhere/ignorethis.html','http://and.this.html'],
chat => 1,
);
$_->print_imgs;
$_->print_done;
$_->print_failed;
$_->print_ignored;
my $hash = $_->imgs_as_hash;
foreach (keys %{$hash}){
warn "$_ = ",$hash->{$_},"\n";
}
exit;
DESCRIPTION
This module allow syou to automate the searching, recovery and local storage of images from the web, including those linked by anchor (A
), mage (IMG
) and image map (AREA
) elements.
Supply a URI or list of URIs to process, and HTTP::GetImages
will recurse over every link it finds, searching for images.
By supplying a list of URIs, you can restrict the search to certain webservers and directories, or exclude it from certain webservers and directories.
You can also decide to reject images that are too small or too large.
DEPENDENCIES
LWP::UserAgent;
HTTP::Request;
HTML::TokeParser;
PACKAGE GLOBAL VARIABLE
$CHAT
Set to above zero if you'd like a real-time report to STDERR
. Defaults to off.
CONSTRUCTOR METHOD new
Besides the class reference, accepts name=>value pairs:
- max_attempts
-
The maximum attempts the agent should make to access the site. Default is three.
- dir
-
the path to the directory in which to store images (no trailing oblique necessary);
- rename
-
Default value is 0, which allows images to be saved with their original names. If set with a value of 1, images will be given new names based on the time they were saved at. If set to 2, images will be given filenames according to their source location.
- todo
-
one or more URL to process: can be an anonymous array, array reference, or scalar.
- dont
-
As
todo
, above, but URLs should be ignored.If one of these is
ALL
, then will ignore all HTML documents that do not match exactly those in thetodo
array of URLs to process. If one of these isNONE
, will ignore no documents. - ext_ok
-
A regular expression 'or' list of image extensions to match.
Will be applied at the end of a filename, after a point, and is insensitive to case.
Defaults to
(jpg|jpeg|bmp|gif|png|xbm|xmp)
. - ext_bad
-
As
ext_ok
(above), but default value is:(wmv|avi|rm|mpg|asf|ram|asx|mpeg|mp3)
- match_url
-
The minimum path a URL must contain. This can be a scalar or an array reference.
- min_size.
-
The minimum size an image can be if it is to be saved.
- max_size
-
The maximum size an image can be if it is to be saved.
The object has several private variables, which you can access for the results when the job is done. However, do check out the public methods for accessing these.
- DONE
-
a hash keys of which are the original URLs of the images, value being are the local filenames.
- FAILED
-
a hash, keys of which are the failed URLs, values being short reasons.
METHOD print_imgs
Print a list of the images saved.
METHOD imgs_as_hash
Returns a reference to a hash of images saved, where keys are new image locations, values are original locations.
METHOD print_done
Print a list of the URLs accessed and return a reference to a hash of the same.
METHOD print_failed
Print a list of the URLs failed, and reasons and return a reference to a hash of the same.
METHOD print_ignored
Print a list of the URLs ignored and return a reference to a hash of the same.
SEE ALSO
Every thing and every one listed above under DEPENDENCIES.
REVISIONS
Version 0.34*, updates by Lee Goddard:
Re-implemented the dont =
['ALL']> feature that got lost during the redesign of the API; agent now makes multiple attempts to get the image.
Version 0.32, updates by Lee Goddard: fixed bugs.
Version 0.31, updates by Lee Goddard: added 'max_size'.
Version 0.3, updates by Lee Goddard:
Made it a nicer API and tidied up some coding and added a couple of methods. Started to add tests.
Version 0.25, updates by Duncan Lamb and Lee Goddard:
The character
~
in the URL would confuse theabs_url
subroutine, resolvinghttp://www.o.com/~home/page.html
tohttp://www.o.com
. It doesn't any more.Double obliques in a link would cause an endless loop - no longer.
A link refrencing its own directory with
./
would also cause an endless loop - but no more.EXTENSIONS_BAD
list added.NEWNAMES
updated.Frame parsing.
Multiple minimum-paths for URLs added.
USES
GetImages.pm
is proud to be part of Duncan Lamb's HTTP::StegTest
:
An example report can be found at http://64.192.146.9/ in which the library was run against several anti-American and "pro-Taliban" sites. The reports display images that changed between collections, images that tested positive for being altered by an outside program, and images which were "false positives." Over 25,000 images were tested across 10 sites.
AUTHOR
Lee Goddard (LGoddard@CPAN.org) 05/05/2001 16:08 ff.
With updates and fixes from Duncan Lamb (duncan_lamb@hotmail.com), 12/2001.
COPYRIGHT
Copyright 2000-2001 Lee Goddard.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 472:
You forgot a '=back' before '=head2'