NAME
WWW::Crawl - A simple web crawler for extracting links and more from web pages
VERSION
This documentation refers to WWW::Crawl version 0.1.
SYNOPSIS
use WWW::Crawl;
my $crawler = WWW::Crawl->new();
my $url = 'https://example.com';
my @visited = $crawler->crawl($url, \&process_page);
sub process_page {
my $url = shift;
print "Visited: $url\n";
# Your processing logic here
}
DESCRIPTION
The WWW::Crawl
module provides a simple web crawling utility for extracting links and other resources from web pages within a single domain. It can be used to recursively explore a website and retrieve URLs, including those found in HTML href attributes, form actions, external JavaScript files, and JavaScript window.open links.
WWW::Crawl
will not stray outside the supplied domain.
CONSTRUCTOR
new(%options)
Creates a new WWW::Crawl
object. You can optionally provide the following options as key-value pairs:
agent
: The user agent string to use for HTTP requests. Defaults to "Perl-WWW-Crawl-VERSION" where VERSION is the module version.timestamp
: If a timestamp is added to external JavaScript files to ensure the latest version is loaded by the browser, this option prevents multiple copied of the same file being indexed by ignoring the timestamp query parameter.nolinks
: Don't follow links found in the starting page. This option is provided for testing and preventsWWW::Crawl
following the links it finds. It also affects the return value of the crawl method.
METHODS
crawl($url, [$callback])
Starts crawling the web starting from the given URL. The $url
parameter specifies the starting URL.
The optional $callback
parameter is a reference to a subroutine that will be called for each visited page. It receives the URL of the visited page as an argument.
The crawl
method will explore the provided URL and its linked resources. It will also follow links found in form actions, external JavaScript files, and JavaScript window.open links. The crawling process continues until no more unvisited links are found.
In exploring the website, crawl
will ignore links to the following types of file .pdf
, .css
, .png
, .jpg
, .svg
and .webmanifest
Returns an array of URLs that were parsed during the crawl. Unless the nolinks
option is passed to new, then it returns an array of links found on the itial page.
AUTHOR
Ian Boddison, <bod at cpan.org>
BUGS
Please report any bugs or feature requests to bug-www-crawl at rt.cpan.org
, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=WWW-Crawl. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc WWW::Crawl
You can also look for information at:
GitHub
RT: CPAN's request tracker (report bugs here)
Search CPAN
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
This software is Copyright (c) 2023 by Ian Boddison.
This program is released under the following license:
Perl