NAME
CrawlerCommons::RobotRules - the result of a parsed robots.txt
SYNOPSIS
use CrawlerCommons::RobotRules;
use CrawlerCommons::RobotRulesParser;
my $rules_parser = CrawlerCommons::RobotRulesParser->new;
my $content = "User-agent: *\r\nDisallow: *images";
my $content_type = "text/plain";
my $robot_names = "any-old-robot";
my $url = "http://domain.com/";
my $robot_rules =
$rules_parser->parse_content($url, $content, $content_type, $robot_names);
# obtain the 'mode' of the robot rules object
say "Anything Goes!!!!" if $robot_rules->is_allow_all;
say "Nothing to see here!" if $robot_rules->is_allow_none;
say "Default robot rules mode..." if $robot_rules->is_allow_some;
# are we allowed to crawl a URL (returns 1 if so, 0 if not)
say "We're allowed to crawl the index :)"
if $robot_rules->is_allowed( "https://www.domain.com/index.html");
say "Not allowed to crawl: $_" unless $robot_rules->is_allowed( $_ )
for ("http://www.domain.com/images/some_file.png",
"http://www.domain.com/images/another_file.png");
DESCRIPTION
This object is the result of parsing a single robots.txt file
VERSION
Version 0.03
METHODS
my $true_or_false = $robot_rules->is_allowed( $url )
Returns 1 if we're allowed to crawl the URL represented by $url
and 0 otherwise. Will return 1 if the method is_allow_all()
returns true, otherwise, if is_allow_none
is false, returns 1 if there is an allow rule or no disallow rule for this URL.
$url
The URL whose path is used to search for a matching rule within the object for evaluation.
AUTHOR
Adam Robinson <akrobinson74@gmail.com>