The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

WWW::RobotRules::Parser::MultiValue - Parse robots.txt

SYNOPSIS

use WWW::RobotRules::Parser::MultiValue;
use LWP::Simple qw(get);

my $url = 'http://example.com/robots.txt';
my $robots_txt = get $url;

my $rules = WWW::RobotRules::Parser::MultiValue->new(
    agent => 'TestBot/1.0',
);
$rules->parse($url, $robots_txt);

if ($rules->allows('http://example.com/some/path')) {
    my $delay = $rules->delay_for('http://example.com/');
    sleep $delay;
    ...
}

my $hash = $rules->rules_for('http://example.com/');
my @list_of_allowed_paths = $hash->get_all('allow');
my @list_of_custom_rule_value = $hash->get_all('some-rule');

DESCRIPTION

WWW::RobotRules::Parser::MultiValue is a parser for robots.txt.

Parsed rules for the specified user agent is stored as a Hash::MultiValue, where the key is a lower case rule name.

Request-rate rule is handled specially. It is normalized to Crawl-delay rule.

METHODS

new
$rules = WWW::RobotRules::Parser::MultiValue->new(
    aget => $user_agent
);
$rules = WWW::RobotRules::Parser::MultiValue->new(
    aget => $user_agent,
    ignore_default => 1,
);

Creates a new object to handle rules in robots.txt. The object parses rules match with $user_agent. The rules of User-agent: * always match and have a lower precedence than the rules explicitly matched with $user_agent. If ignore_default option is specified, rules of User-agent: * are simply ignored.

parse
$rules->parse($uri, $text);

Parses a text content $text whose URI is $uri.

match_ua
$rules->match_ua($pattern);

Test if the user agent matches with $pattern.

rules_for
$hash = $rules->rules_for($uri);

Returns a Hash::MultiValue, which describes the rules of the domain of $uri.

allows
$test = $rules->allows($uri);

Tests if the user agent is allowed to visit $uri. If there is 'Allow' rule for the path of $uri, then the $uri is allowed to visit. If there is 'Disallow' rule for the path of $uri, then the $uri is not allowed to visit. Otherwise, the $uri is allowed to visit.

delay_for
$delay = $rules->delay_for($uri);
$delay_in_milliseconds = $rules->delay_for($uri, 1000);

Calculate a crawl delay for the specified $uri. The value is determined by 'Crawl-delay' rule or 'Request-rate' rule. The second argument specifies the base of the return value.

SEE ALSO

Hash::MultiValue

LICENSE

Copyright (C) INA Lintaro

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

INA Lintaro <tarao.gnn@gmail.com>