SYNOPSIS
use WebService::LOC::CongRec::Crawler;
use Log::Log4perl;
Log::Log4perl->init_once('log4perl.conf');
$crawler = WebService::LOC::CongRec::Crawler->new();
$crawler->congress(107);
$crawler->oldest(1);
$crawler->goForth();
ATTRIBUTES
- congress
-
The numbered congress to be fetched. If this is not given, the current congress is fetched.
- issuesRoot
-
The root page for Daily Digest issues.
Breadcrumb path: Library of Congress > THOMAS Home > Congressional Record > Browse Daily Issues
- issues
-
A hash of issues: %issues{year}{month}{day}{section}
- mech
-
A WWW::Mechanize object with state that we can use to grab the page from Thomas.
- oldest
-
Boolean attribute specifying that pages are visited from earliest to most recent.
The default is 0 - that is visit most recent first.
METHODS
goForth()
$crawler->goForth();
$crawler->goForth(process => \&process_page);
$crawler->goForth(start => $x);
$crawler->goForth(end => $y);
Start crawling from the Daily Digest issues page, i.e. http://thomas.loc.gov/home/Browse.php?&n=Issues
Also, for a specific congress, where NUM is congress number: http://thomas.loc.gov/home/Browse.php?&n=Issues&c=NUM
Returns the total number of pages grabbed.
Accepts an optional processing function to perform for each page.
Accpets optional page counter start and end ranges. If neither are given, or given as zero, crawing starts from the beginning and goes until all pages are visited.
parseRoot(Str $content)
Parse the the root of an issue an fill our hash of available issues