NAME

WWW::Agent::Plugins::Director - plugin for controlling an agent

SYNOPSIS

use WWW::Agent;
use WWW::Agent::Plugins::Director;
my $a = new WWW::Agent (plugins => [
                                    ...,
                                    new WWW::Agent::Plugins::Director,
                                    ....
                                    ]);

# do it manually (consider to use WWW::Agent::Zombie)
use POE;
POE::Kernel->post ( 'agent', 'director_execute', 'zombie', $weezl );
$a->run;

DESCRIPTION

This plugin for WWW::Agent allows to send the agent a script, written in WeeZL. That can direct the agent to visit particular pages, assert that the URL is what you expect, wait for some time, check for text in the page, fill out forms and automatically click on URLs.

The language also allows to define functional blocks which are executed whenever a specified URL is visited.

Requisites

If you use this plugin then you must make sure that also WWW::Agent::Plugins::Focus is loaded first.

Web Zombie Language (WeeZL)

The Web Zombie Language, pronounced weezle, specifies the behaviour of a virtual web user. It also allows to define assertions and conditions to be checked at certain times. The former can be used for testing web sites, the latter to trigger customized actions.

WeeZL is for most of its part a procedural language, so commands are executed in sequential order, as given in the text.

Comments

WeeZL can contain comments, similar to Perl these start with a hash sign (#) and reach until the end of the same line.

Actions

Actions are the primitives which can be executed. As such, they can fail and doing so, an internal exception is raised. This is not fatal to the process as actions can be combined such that one failure can be compensated by another action.

The agent is also using the concept of a focus: At any page, the browser can be asked to focus on a particular subelement (interpreting the HTML as an decent XML document). The focus can be narrowed down. After every statement, though, the focus is reset to the whole page.

The language offers the following primitives:

URL request

The command url URL make the agent move to this given URL. If the URL cannot be fetched successfully, an internal exception is raised.

URL assertion

The command url regexp tests whether the agent is currently at a URL which matches the given regular expression. If not, an internal exception will be raised.

forced exception

The command die message will raise an internal exception. It always fails (in succeeding :-) The message will be forwarded to the application unless the exception is handled internally.

messages

The command warn message will write the message onto STDERR. It never fails.

waiting

The command wait n secs makes the agent wait the given amount of time. The command never fails.

The variant wait ~ n secs will randomly dither the time to wait. The dithering can be controlled with the time_dither parameter for the constructor.

text testing

The command text regexp test whether the current focus contains text which matches the given regular expression. Hereby all HTML elements have been removed. If there is no match, then this command fails with an exception.

HTML testing

The command html regexp tests whether the current focus matches the regular expression given. If not, then this command will fail with an exception.

focussing

The command < html-element > changes the current focus by looking for this particular HTML element in the current focus (or the whole page if not focus yet exists). If that subelement cannot be detected, this command will fail.

Optionally a regular expression can be added, so that this command only succeeds if the text inside the new focus would match the regular expression.

Optionally a index can be provided with [ n ] to select the nth occurrence of that element in the current focus. Counting starts with zero.

Filling out FORMs

The command fill identifier value assumes that the current focus is on a FORM element. Otherwise the command will fail.

For FORMs, the field identified will be filled with the value given.

NOTE: This is not yet fully functionally complete (popup menues, checkboxes....).

The command click assumes that the current focus is either on a FORM or on an anchor (<a>) element.

For a FORM it will use the FORM's current value and submit the FORM as provided in the ACTION attribute.

For an anchor, the command will make the agent follow that link provided in the HREF attribute.

Blocks

You can also define separate blocks which can be invoked similar to subroutines or handlers. To define a block you can either use a label or a regular expression.

In case of simple names for labels these blocks behave like subroutines, as the following example demonstrates. First we define a block which takes care of logging into a site:

login: {
        url http://www.example.org/login.php
        <form> and fill username 'jill'
               and fill password 'jack'
        text m|logged in|
        }

Later on in our script we invoke that block

url http://www.example.org/
login()
#....

You can also pass parameters into a block

login: {
        url http://www.example.org/login.php
        <form> and fill username $uid
               and fill password $pwd
        text m|logged in|
        }

url http://www.example.org/
login(uid => 'jill', pwd => 'jack')
#....

which are then available as variables (prefixed with '$', of course).

You can also use as block names regular expressions. These will be checked after each successful request whether one of them matches the current URL. If so, then the block associated with the regular expression will be executed automatically. No order is defined here.

q|login.php|: {
        <form> and fill username 'jack'
               and fill password 'jill'
        text m|logged in|
        }

url http://www.example.org/
url http://www.example.org/login.php  # here we trigger the block
#....

Application Hooks

In some cases you may want to invoke functions you provide inside a WeeZL script. This is useful when you have reached a certain page (or a part of it) and want to extract specific information out of it.

For this purpose you have to list your functions in the constructor

new WWW::Agent::Plugins::Director (...
                                   functions   => {
                                                   extract1 => sub {...},
                                                   extract2 => sub {...},
                                                   ...
                                                   }
                                   )

Inside a WeeZL script you simple name the function you want to invoke

url http://www.example.org/interesting.html
<table> [1] and extract1
<table> [3] and extract2
extract3

After loading the named page, the agent will try to focus on the 2nd (index 1) table element and will invoke the function associated with extract1. In this process the function will get one parameter, namely the HTMLified text of the current focus.

NOTE: THIS MAY CHANGE IN FUTURE VERSIONS.

The function is not supposed to return anything but may be allowed to die.

NOTE: THIS IS NOT WELL SUPPORTED YET.

If that invocation was successful then the 4th table is selected in the current page and extract2 is invoked. After that extract3 is called whereby it gets the whole page as focus.

Conjunctions

Primitive actions can be combined with and. As a consequence, the successful execution of the actions to the left of the and are a prerequisite, that the action right to the and is executed:

<form> and fill name 'James Bond'

Here the fillout of the form is only tried after the form has been found, whereas in

<form>
fill name 'James Bond'

first the form is found, then again forgotten as we refocus on the page. Filling out will fail then.

Random Choice

Using the infix operator xor you can also make the agent to choose arbitrarily between two or more choices:

url http://www.example1.org/ xor
url http://www.example2.org/ xor
url http://www.example3.org/

will follow one of the choices.

Catching Exceptions

If an action fails then the exception can be caught internally by providing more actions connected with the infix operator or:

url http://www.example1.org/ or warn "that is not good, but we continue"

url http://www.example2.org/ or die "now this is really bad"

url http://www.example3.org/logged-in.php or 
    login (uid => 'jack', pwd => 'jill');

Only if the last action in an or sequence fails, the whole command fails.

Examples

@@@ TBW @@@

Grammar

As notation we use | for alternatives, [] to group optional sequences, {} to group sequences which may occur any number of times. The notation

< something ',' >

is equivalent to, but more concise than

[ something { ',' something } ]

'xxx' is used for terminals, regular expressions are used to characterize other lexical constants, all others identifiers are non-terminals:

    plan          : { subplan } { step }

    subplan       : indicator ':' '{' { step } '}'

    indicator     : regexp | identifier

    identifier    : /\w+/

    step          : or_clause

    or_clause     : < xor_clause /or/ >

    xor_clause    : < and_clause /xor/ >

    and_clause    : < clause /and/ >

    clause        : '{' { step } '}'
                    |
		    'url'  url
                    |
		    'url'  regexp
                    |
		    'die'  [ value ]
                    |
		    'warn' [ value ]
                    |
		    'wait' [ '~' ] /\d+/ ('sec' | 'secs' )
                    |
                    identifier '(' < param /,/ > ')'
                    |
		    identifier
                    |
                    'html' regexp
                    |
                    'text' regexp
                    |
                    '<' identifier '>' [ regexp ]  [ index ]
                    |
		    'fill' identifier value
                    |
                    'click' [ identifier ]

    index         : '[' integer ']'

    value         : string | variable

    variable      : /\$\w+/

    integer       : /\d+/

    param         : identifier '=>' value

    url           : /\w+:[^\s)]+/ # crude approximation

    string        :  '"'  /[^\"]*/ '"'

    string        :  /\'/ /[^\']*/ /\'/

    regexp        : 'm|' /[^\|]+/ '|' /[i]*/

INTERFACE

Constructor

The constructor accepts a hash and processes the following keys:

time_dither (percentage value, optional)

To control the randomized waiting a percentage value of the form /\d+%/ can be provided, the default is 10%.

functions (hash reference)

If your script may invoke external functions, then you can provide them here. The keys are the names which can be used inside WeeZL, the values are subroutine references.

exception

If an exception is not handled internally, then it has to be escalated into the application. By providing a subroutine reference you define a handler which may memorize or otherwise process this event.

NOTE: A real exception cannot be used, because we do not want the POE process really to die.

SEE ALSO

WWW::Agent

AUTHOR

Robert Barta, <rho@bigpond.net.au>

COPYRIGHT AND LICENSE

Copyright (C) 2005 by Robert Barta

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.