NAME

CrawlerCommons::RobotRules - the result of a parsed robots.txt

SYNOPSIS

use CrawlerCommons::RobotRules;
use CrawlerCommons::RobotRulesParser;

my $rules_parser = CrawlerCommons::RobotRulesParser->new;

my $content = "User-agent: *\r\nDisallow: *images";
my $content_type = "text/plain";
my $robot_names = "any-old-robot";
my $url = "http://domain.com/";

my $robot_rules =
  $rules_parser->parse_content($url, $content, $content_type, $robot_names);

# obtain the 'mode' of the robot rules object
say "Anything Goes!!!!" if $robot_rules->is_allow_all;
say "Nothing to see here!" if $robot_rules->is_allow_none;
say "Default robot rules mode..." if $robot_rules->is_allow_some;

# are we allowed to crawl a URL (returns 1 if so, 0 if not)
say "We're allowed to crawl the index :)"
 if $robot_rules->is_allowed( "https://www.domain.com/index.html");

say "Not allowed to crawl: $_" unless $robot_rules->is_allowed( $_ )
  for ("http://www.domain.com/images/some_file.png",
       "http://www.domain.com/images/another_file.png");

DESCRIPTION

This object is the result of parsing a single robots.txt file

VERSION

Version 0.03

METHODS

my $true_or_false = $robot_rules->is_allowed( $url )

Returns 1 if we're allowed to crawl the URL represented by $url and 0 otherwise. Will return 1 if the method is_allow_all() returns true, otherwise, if is_allow_none is false, returns 1 if there is an allow rule or no disallow rule for this URL.

  • $url

    The URL whose path is used to search for a matching rule within the object for evaluation.

AUTHOR

Adam Robinson <akrobinson74@gmail.com>