NAME
WWW::Agent::Plugins::Director - plugin for controlling an agent
SYNOPSIS
use WWW::Agent;
use WWW::Agent::Plugins::Director;
my $a = new WWW::Agent (plugins => [
...,
new WWW::Agent::Plugins::Director,
....
]);
# do it manually (consider to use WWW::Agent::Zombie)
use POE;
POE::Kernel->post ( 'agent', 'director_execute', 'zombie', $weezl );
$a->run;
DESCRIPTION
This plugin for WWW::Agent allows to send the agent a script, written in WeeZL. That can direct the agent to visit particular pages, assert that the URL is what you expect, wait for some time, check for text in the page, fill out forms and automatically click on URLs.
The language also allows to define functional blocks which are executed whenever a specified URL is visited.
Requisites
If you use this plugin then you must make sure that also WWW::Agent::Plugins::Focus is loaded first.
Web Zombie Language (WeeZL)
The Web Zombie Language, pronounced weezle, specifies the behaviour of a virtual web user. It also allows to define assertions and conditions to be checked at certain times. The former can be used for testing web sites, the latter to trigger customized actions.
WeeZL is for most of its part a procedural language, so commands are executed in sequential order, as given in the text.
Comments
WeeZL can contain comments, similar to Perl these start with a hash sign (#) and reach until the end of the same line.
Actions
Actions are the primitives which can be executed. As such, they can fail and doing so, an internal exception is raised. This is not fatal to the process as actions can be combined such that one failure can be compensated by another action.
The agent is also using the concept of a focus: At any page, the browser can be asked to focus on a particular subelement (interpreting the HTML as an decent XML document). The focus can be narrowed down. After every statement, though, the focus is reset to the whole page.
The language offers the following primitives:
- URL request
-
The command
url
URL make the agent move to this given URL. If the URL cannot be fetched successfully, an internal exception is raised. - URL assertion
-
The command
url
regexp tests whether the agent is currently at a URL which matches the given regular expression. If not, an internal exception will be raised. - forced exception
-
The command
die
message will raise an internal exception. It always fails (in succeeding :-) The message will be forwarded to the application unless the exception is handled internally. - messages
-
The command
warn
message will write the message onto STDERR. It never fails. - waiting
-
The command
wait
nsecs
makes the agent wait the given amount of time. The command never fails.The variant
wait
~
nsecs
will randomly dither the time to wait. The dithering can be controlled with thetime_dither
parameter for the constructor. - text testing
-
The command
text
regexp test whether the current focus contains text which matches the given regular expression. Hereby all HTML elements have been removed. If there is no match, then this command fails with an exception. - HTML testing
-
The command
html
regexp tests whether the current focus matches the regular expression given. If not, then this command will fail with an exception. - focussing
-
The command < html-element > changes the current focus by looking for this particular HTML element in the current focus (or the whole page if not focus yet exists). If that subelement cannot be detected, this command will fail.
Optionally a regular expression can be added, so that this command only succeeds if the text inside the new focus would match the regular expression.
Optionally a index can be provided with
[
n]
to select the nth occurrence of that element in the current focus. Counting starts with zero. - Filling out FORMs
-
The command
fill
identifier value assumes that the current focus is on a FORM element. Otherwise the command will fail.For FORMs, the field identified will be filled with the value given.
NOTE: This is not yet fully functionally complete (popup menues, checkboxes....).
- Following Links
-
The command
click
assumes that the current focus is either on a FORM or on an anchor (<a>) element.For a FORM it will use the FORM's current value and submit the FORM as provided in the ACTION attribute.
For an anchor, the command will make the agent follow that link provided in the HREF attribute.
Blocks
You can also define separate blocks which can be invoked similar to subroutines or handlers. To define a block you can either use a label or a regular expression.
In case of simple names for labels these blocks behave like subroutines, as the following example demonstrates. First we define a block which takes care of logging into a site:
login: {
url http://www.example.org/login.php
<form> and fill username 'jill'
and fill password 'jack'
text m|logged in|
}
Later on in our script we invoke that block
url http://www.example.org/
login()
#....
You can also pass parameters into a block
login: {
url http://www.example.org/login.php
<form> and fill username $uid
and fill password $pwd
text m|logged in|
}
url http://www.example.org/
login(uid => 'jill', pwd => 'jack')
#....
which are then available as variables (prefixed with '$', of course).
You can also use as block names regular expressions. These will be checked after each successful request whether one of them matches the current URL. If so, then the block associated with the regular expression will be executed automatically. No order is defined here.
q|login.php|: {
<form> and fill username 'jack'
and fill password 'jill'
text m|logged in|
}
url http://www.example.org/
url http://www.example.org/login.php # here we trigger the block
#....
Application Hooks
In some cases you may want to invoke functions you provide inside a WeeZL script. This is useful when you have reached a certain page (or a part of it) and want to extract specific information out of it.
For this purpose you have to list your functions in the constructor
new WWW::Agent::Plugins::Director (...
functions => {
extract1 => sub {...},
extract2 => sub {...},
...
}
)
Inside a WeeZL script you simple name the function you want to invoke
url http://www.example.org/interesting.html
<table> [1] and extract1
<table> [3] and extract2
extract3
After loading the named page, the agent will try to focus on the 2nd (index 1) table element and will invoke the function associated with extract1
. In this process the function will get one parameter, namely the HTMLified text of the current focus.
NOTE: THIS MAY CHANGE IN FUTURE VERSIONS.
The function is not supposed to return anything but may be allowed to die.
NOTE: THIS IS NOT WELL SUPPORTED YET.
If that invocation was successful then the 4th table is selected in the current page and extract2
is invoked. After that extract3 is called whereby it gets the whole page as focus.
Conjunctions
Primitive actions can be combined with and
. As a consequence, the successful execution of the actions to the left of the and
are a prerequisite, that the action right to the and
is executed:
<form> and fill name 'James Bond'
Here the fillout of the form is only tried after the form has been found, whereas in
<form>
fill name 'James Bond'
first the form is found, then again forgotten as we refocus on the page. Filling out will fail then.
Random Choice
Using the infix operator xor
you can also make the agent to choose arbitrarily between two or more choices:
url http://www.example1.org/ xor
url http://www.example2.org/ xor
url http://www.example3.org/
will follow one of the choices.
Catching Exceptions
If an action fails then the exception can be caught internally by providing more actions connected with the infix operator or
:
url http://www.example1.org/ or warn "that is not good, but we continue"
url http://www.example2.org/ or die "now this is really bad"
url http://www.example3.org/logged-in.php or
login (uid => 'jack', pwd => 'jill');
Only if the last action in an or
sequence fails, the whole command fails.
Examples
@@@ TBW @@@
Grammar
As notation we use |
for alternatives, []
to group optional sequences, {}
to group sequences which may occur any number of times. The notation
< something ',' >
is equivalent to, but more concise than
[ something { ',' something } ]
'xxx' is used for terminals, regular expressions are used to characterize other lexical constants, all others identifiers are non-terminals:
plan : { subplan } { step }
subplan : indicator ':' '{' { step } '}'
indicator : regexp | identifier
identifier : /\w+/
step : or_clause
or_clause : < xor_clause /or/ >
xor_clause : < and_clause /xor/ >
and_clause : < clause /and/ >
clause : '{' { step } '}'
|
'url' url
|
'url' regexp
|
'die' [ value ]
|
'warn' [ value ]
|
'wait' [ '~' ] /\d+/ ('sec' | 'secs' )
|
identifier '(' < param /,/ > ')'
|
identifier
|
'html' regexp
|
'text' regexp
|
'<' identifier '>' [ regexp ] [ index ]
|
'fill' identifier value
|
'click' [ identifier ]
index : '[' integer ']'
value : string | variable
variable : /\$\w+/
integer : /\d+/
param : identifier '=>' value
url : /\w+:[^\s)]+/ # crude approximation
string : '"' /[^\"]*/ '"'
string : /\'/ /[^\']*/ /\'/
regexp : 'm|' /[^\|]+/ '|' /[i]*/
INTERFACE
Constructor
The constructor accepts a hash and processes the following keys:
- time_dither (percentage value, optional)
-
To control the randomized waiting a percentage value of the form /\d+%/ can be provided, the default is 10%.
- functions (hash reference)
-
If your script may invoke external functions, then you can provide them here. The keys are the names which can be used inside WeeZL, the values are subroutine references.
- exception
-
If an exception is not handled internally, then it has to be escalated into the application. By providing a subroutine reference you define a handler which may memorize or otherwise process this event.
NOTE: A real exception cannot be used, because we do not want the POE process really to die.
SEE ALSO
AUTHOR
Robert Barta, <rho@bigpond.net.au>
COPYRIGHT AND LICENSE
Copyright (C) 2005 by Robert Barta
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.