CONCEPTS
- a "listpage" is returned by the initial get_fill_submit which is parsed into:
- a "listpage" is parsed into:
-
{ items => \@items, pageno => $pageno, num_pages => $num_pages, nextlink => $nextlink, }
- an "item" is
-
+{ id => $id, url => $url, }
- the item url points to a "page" which is parsed into
ADDITIONAL METHODS
list_parse
($text, $pageurl, $listre)
one_parse
Function:
($text, $scrapespec, $scrapepostpro)
parse_fill_submit
Function:
($cjar, $html, $real_url, $vars, $varnamechange)
parse_refresh
Parses out redirects done with Refresh
header.
redirect_cookie_loop
Gets web content, iterating through redirects while capturing cookies.
NAME
Mail::POP3::Folder::webscrape - class that makes a website look like a POP3 mailbox
SYNOPSIS
use Mail::POP3;
my $m = Mail::POP3::Folder::webscrape->new(
$user_name,
$password,
$starturl, # where the first form is found
$userfieldnames, # listref same order as values supplied in USER
$otherfields, # hash fieldname => value
$listre, # field => RE; fields: pageno, num_pages, nextlink, itemurls
$itemre, # hash extractfield => RE to get it from "page"
$itempostpro, # extractfield => sub returns pairS of field/value
$itemurl2id, # sub taking URL, returns unique, persistent item ID
$itemformat, # takes item hash, returns email message
$messagesize,
);
DESCRIPTION
This class makes a website look like a POP3 mailbox in accordance with the requirements of a Mail::POP3 server. It is entirely API-compatible with Mail::POP3::Folder::mbox.
The virtual e-mails will all be at least (the amount specified in the last parameter to new
- recommend 2000) octets long, being padded to this length. While it should truncate if necessary, the class currently does not.
PARAMETERS
$user_name
-
The username is interpreted as a ":"-separated string, also "URL-encoded" such that spaces are encoded as "+" characters. The values supplied will be for variables named in the
$userfieldnames
parameter. $password
-
The password is ignored.
$starturl
-
The webpage that contains the initial search form.
$userfieldnames
-
A reference to a list of the names of CGI variables whose values are supplied by the POP3 user in the username.
$otherfields
-
Reference to hash of CGI field mapped to value.
$listre
-
Reference to hash of fieldname mapped to regular expression for finding the relevant value on each search result page. The value is expected to be in
$1
. These fields must be defined:pageno
,num_pages
,nextlink
,itemurls
. The last may (obviously) match more than once. $itemre
-
Reference to hash of fieldname mapped to regular expression for finding the relevant value on each item's page (as linked to by an
itemurl
as found from the above parameter), similar to the above. Any number of fields may be sought, and a hash of the fieldname to the found value will be passed to the item-formatting function below. $itempostpro
-
Reference to hash of fieldname mapped to reference to function that is called with the field name and value, and will return a list of one or more pairs of fieldname / value. Typical use might be to remove HTML from a result.
$itemurl2id
-
Reference to function that is called with each
itemurl
, and will return a unique, persistent identifier for that item, compatible with an RFC 1939 message ID. $itemformat
-
Reference to function that is called for each item, taking two parameters: a reference to a hash of fieldname / value (as extracted by the "item RE" above), and the unique message-ID (as generated above); and will return the text of an email message describing that item.
$messagesize
-
The size of each message, in the style of Procrustes. This is so the class can return an accurate(ish) result for the POP3 command STAT knowing only the number of hits there have been, and not having downloaded and formatted every single item to see how large each one is - such an extra step would probably trigger timeouts.
A script webscrape
is supplied in the scripts
subdirectory of the distribution that can be used to test and develop a working configuration for this class.
METHODS
None extra are defined.
SEE ALSO
RFC 1939, Mail::POP3::Folder::mbox.