NAME

CrawlerCommons::RobotRulesParser - parser for robots.txt files

SYNOPSIS

use CrawlerCommons::RobotRulesParser;

my $rules_parser = CrawlerCommons::RobotRulesParser->new;

my $content = "User-agent: *\r\nDisallow: *images";
my $content_type = "text/plain";
my $robot_names = "any-old-robot";
my $url = "http://domain.com/";

my $robot_rules =
  $rules_parser->parse_content($url, $content, $content_type, $robot_names);

say "We're allowed to crawl the index :)"
 if $robot_rules->is_allowed( "https://www.domain.com/index.html");

say "Not allowed to crawl: $_" unless $robot_rules->is_allowed( $_ )
  for ("http://www.domain.com/images/some_file.png",
       "http://www.domain.com/images/another_file.png");

DESCRIPTION

This module is a fairly close reproduction of the Crawler-Commons SimpleRobotRulesParser

From BaseRobotsParser javadoc:

Parse the robots.txt file in <i>content</i>, and return rules appropriate
for processing paths by <i>userAgent</i>. Note that multiple agent names
may be provided as comma-separated values; the order of these shouldn't
matter, as the file is parsed in order, and each agent name found in the
file will be compared to every agent name found in robotNames.
Also note that names are lower-cased before comparison, and that any
robot name you pass shouldn't contain commas or spaces; if the name has
spaces, it will be split into multiple names, each of which will be
compared against agent names in the robots.txt file. An agent name is
considered a match if it's a prefix match on the provided robot name. For
example, if you pass in "Mozilla Crawlerbot-super 1.0", this would match
"crawlerbot" as the agent name, because of splitting on spaces,
lower-casing, and the prefix match rule.

The method failedFetch is not implemented.

VERSION

Version 0.02

METHODS

my $robot_rules = $rules_parser->parse_content($url, $content, $content_type, $robot_name)

Parsers robots.txt data in $content for the User-agent(s) specified in $robot_name returning a CrawlerCommons::RobotRules object corresponding to the rules defined for $robot_name.

  • $url

    URL string that's parsed in a URI object to provide scheme, authority, and path for sitemap directive values. If the directive's value begins with a '/', it overrides the path value provided by this URL context string.

  • $content

    The text content of the robots.txt file to be parsed.

  • $content_type

    The content-type of the robots.txt content to be parsed. Assumes text/plain by default. If type is text/html, the parser will attempt to strip-out html tags from the content.

  • $robot_name

    A string signifying for which user-agent(s) the rules should be extracted.

AUTHOR

Adam K Robinson <akrobinson74@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2017 by Adam K Robinson.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.