CONCEPTS

a "listpage" is returned by the initial get_fill_submit which is parsed into:
a "listpage" is parsed into:
{ items => \@items, pageno => $pageno, num_pages => $num_pages,
  nextlink => $nextlink, }
an "item" is
+{ id => $id, url => $url, }
the item url points to a "page" which is parsed into

ADDITIONAL METHODS

list_parse

($text, $pageurl, $listre)

one_parse

Function:

($text, $scrapespec, $scrapepostpro)

parse_fill_submit

Function:

($cjar, $html, $real_url, $vars, $varnamechange)

parse_refresh

Parses out redirects done with Refresh header.

Gets web content, iterating through redirects while capturing cookies.

NAME

Mail::POP3::Folder::webscrape - class that makes a website look like a POP3 mailbox

SYNOPSIS

use Mail::POP3;
my $m = Mail::POP3::Folder::webscrape->new(
  $user_name,
  $password,
  $starturl, # where the first form is found
  $userfieldnames, # listref same order as values supplied in USER
  $otherfields, # hash fieldname => value
  $listre, # field => RE; fields: pageno, num_pages, nextlink, itemurls
  $itemre, # hash extractfield => RE to get it from "page"
  $itempostpro, # extractfield => sub returns pairS of field/value
  $itemurl2id, # sub taking URL, returns unique, persistent item ID
  $itemformat, # takes item hash, returns email message
  $messagesize,
);

DESCRIPTION

This class makes a website look like a POP3 mailbox in accordance with the requirements of a Mail::POP3 server. It is entirely API-compatible with Mail::POP3::Folder::mbox.

The virtual e-mails will all be at least (the amount specified in the last parameter to new - recommend 2000) octets long, being padded to this length. While it should truncate if necessary, the class currently does not.

PARAMETERS

$user_name

The username is interpreted as a ":"-separated string, also "URL-encoded" such that spaces are encoded as "+" characters. The values supplied will be for variables named in the $userfieldnames parameter.

$password

The password is ignored.

$starturl

The webpage that contains the initial search form.

$userfieldnames

A reference to a list of the names of CGI variables whose values are supplied by the POP3 user in the username.

$otherfields

Reference to hash of CGI field mapped to value.

$listre

Reference to hash of fieldname mapped to regular expression for finding the relevant value on each search result page. The value is expected to be in $1. These fields must be defined: pageno, num_pages, nextlink, itemurls. The last may (obviously) match more than once.

$itemre

Reference to hash of fieldname mapped to regular expression for finding the relevant value on each item's page (as linked to by an itemurl as found from the above parameter), similar to the above. Any number of fields may be sought, and a hash of the fieldname to the found value will be passed to the item-formatting function below.

$itempostpro

Reference to hash of fieldname mapped to reference to function that is called with the field name and value, and will return a list of one or more pairs of fieldname / value. Typical use might be to remove HTML from a result.

$itemurl2id

Reference to function that is called with each itemurl, and will return a unique, persistent identifier for that item, compatible with an RFC 1939 message ID.

$itemformat

Reference to function that is called for each item, taking two parameters: a reference to a hash of fieldname / value (as extracted by the "item RE" above), and the unique message-ID (as generated above); and will return the text of an email message describing that item.

$messagesize

The size of each message, in the style of Procrustes. This is so the class can return an accurate(ish) result for the POP3 command STAT knowing only the number of hits there have been, and not having downloaded and formatted every single item to see how large each one is - such an extra step would probably trigger timeouts.

A script webscrape is supplied in the scripts subdirectory of the distribution that can be used to test and develop a working configuration for this class.

METHODS

None extra are defined.

SEE ALSO

RFC 1939, Mail::POP3::Folder::mbox.