NAME
HTML::TreeBuilderX::ASP_NET - Scrape ASP.NET/VB.NET sites which utilize Javascript POST-backs.
SYNOPSIS
my $ua = LWP::UserAgent->new;
my $resp = $ua->get('http://uniqueUrl.com/Server.aspx');
my $root = HTML::TreeBuilder->new_from_content( $resp->content );
my $a = $root->look_down( _tag => 'a', id => 'nextPage' );
my $aspnet = HTML::TreeBuilderX::ASP_NET->new({
element => $a
, baseURL =>$resp->request->uri ## takes into account posting redirects
});
my $resp = $ua->request( $aspnet->httpResponse );
## or the easy cheating way see the SEE ALSO section for links
my $aspnet = HTML::TreeBuilderX::ASP_NET->new_with_traits( traits => ['htmlElement'] );
$form->look_down(_tag=> 'a')->httpResponse
DESCRIPTION
Scrape ASP.NET sites which utilize the language's __VIEWSTATE, __EVENTTARGET, __EVENTARGUMENT, __LASTFOCUS, et al. This module returns a HTTP::Response from the form with the use of the method ->httpResponse
.
In this scheme many of the links on a webpage will apear to be javascript functions. The default Javascript function is __doPostBack(eventTarget, eventArgument)
. ASP.NET has two hidden fields which record state: __VIEWSTATE, and __LASTFOCUS. It abstracts each link with a method that utilizes an HTTP post-back to the server. The Javascript behind __doPostBack
simply appends __EVENTTARGET=$eventTarget&__EVENTARGUMENT=$eventArgument onto the POST request from the parent form and submits it. When the server receives this request it decodes and decompresses the __VIEWSTATE and uses it along with the new __EVENTTARGET and __EVENTARGUMENT to perform the action, which is often no more than serializing the data back into the __VIEWSTATE.
Sometimes developers cloak the __doPostBack(target,arg)
with names akin to changepage(arg)
which simply call __doPostBack("target", arg)
. This module will handle this use case as well using the explicit an eventTriggerArugment in the constructor.
This flow is a bane on RESTLESS http and makes no sense whatsoever. Thanks Microsoft.
.-------------------------------------------------------------------.
| HTML FORM 1 |
| <form action="Server.aspx" method="post"> |
| <input type="hidden" name="__VIEWSTATE" value="encryptedXML-FOO"> |
| <a>1</a> | |
| <a href="javascript:__doPostBack('gotopage','2')">2</a> |
| ... |
'-------------------------------------------------------------------'
|
v
_________________________________
\ \
) User clicks the link named "2" )
/________________________________/
|
v
.------------------------------------------------------------------------.
| POST http://aspxnonsensery/Server.aspx |
| Content-Length: 2659 |
| Content-Type: application/x-www-form-urlencoded |
| |
| __VIEWSTATE=encryptedXML-FOO&__EVENTTARGET=gotopage1&__EVENTARGUMENT=2 |
'------------------------------------------------------------------------'
|
v
.----------------------------------------------------------------------.
| HTML FORM 2 |
| (different __VIEWSTATE) |
| <form action="Server.aspx" method="post"> |
| <input type="hidden" name="__VIEWSTATE" value="encryptedXML-BAR"> |
| <a href="javascript:__doPostBack('gotopage','1')">1</a> | |
| <a>2</a> |
| ... |
'----------------------------------------------------------------------'
METHODS
IN ADDITION TO ALL OF THE METHODS FROM HTTP::Request::Form
- ->new({ hashref })
-
Takes a HashRef, returns a new instance some of the possible key/values are:
- form => $htmlElement
-
optional: You explicitly send the HTML::Elmenet representing the form. If you do not one will be implicitly deduced from the $self->element, making element=>$htmlElement a requirement
- eventTriggerArgument => $hashRef
-
Not needed if you supply an element. This takes a HashRef and will create HTML::Elements that mimmick hidden input fields. From which to tack onto the $self->form.
- element => $htmlElement
-
Not needed if you send an eventTriggerArgument. Attempts to deduce the __EVENTARGUMENT and __EVENTTARGET from the 'href' attribute of the element just as if the two were supplied explicitly. It will also be used to deduce a form by looking up in the HTML tree if one is not supplied.
- debug => *0|1
-
optional: Sends the debug flag H:R:F, default is off.
- baseURL => $uri
-
optional: Sets the base of the URL for the post action
- ->httpRequest
-
Returns an HTTP::Request object for the HTTP POST
- ->hrf
-
Explicitly return the underlying HTTP::Request::Form object. All methods fallback here anyway, but this will return that object directly.
FUNCTIONS
None of these are exported...
- createInputElements( {eventTarget => eventArgument} )
-
Helper function takes two values in an HashRef. Assumes the key is the __EVENTTARGET and value the __EVENTARGUMENT, returns two HTML::Element pseudo-input fields with the information.
- parseDoPostBack( $str )
-
Accepts a string that is often the "href" attribute of an HTTP::Element. It simple parses out the call to Javascript, using regexes, and makes the two args useable to perl in the form of an HashRef.
SEE ALSO
- HTML::TreeBuilderX::ASP_NET::Roles::htmlElement
-
For an easy way to glue the two together
- HTTP::Request
-
For the object the method htmlElement returns
- HTTP::Request::Form
-
For a base class, to which all methods are valid
- HTML::Element
-
For the base class of all HTML tokens
AUTHOR
Evan Carroll, <me at evancarroll.com>
BUGS
None, though *much* more support should be added to ->element. Not everthing is a simple anchor tag.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc HTML::TreeBuilderX::ASP_NET
You can also look for information at:
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-TreeBuilderX-ASP_NET
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
COPYRIGHT & LICENSE
Copyright 2008 Evan Carroll, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.