NAME
FEAR::API - Web Scraping Zen
SYNOPSIS
FEAR
= ∑( WWW Crawler, Data Extractor, Data Munger, (X|HT)ML Parser, ...... , Yucky Overloading )
= ∞
= ☯
= 禪
DESCRIPTION
FEAR::API is a tool that helps reduce your time creating site scraping scripts and help you do it in a much more elegant way. FEAR::API combines many strong and powerful features from various CPAN modules, such as LWP::UserAgent, WWW::Mechanize, Template::Extract, Encode, HTML::Parser, etc. and digests them into a deeper Zen.
However, this module violates probably every single rule of any Perl coding standards. Please stop here if you don't want to see the yucky code.
This module was originated from a short-term project. I was asked to extract data from several commercial websites. During my development, I found many redundancies of code, and I attempted to reduce the code size and to create something that is specialized to do this job: Site Scraping. (Or Web Scraping, Screen Scraping). Before creating this module, I have surveyed some site scrapers or information extraction tools, and none of them could really satisfy my needs. I meditated on what the my ideal tool should be shaped like, and the ideas gradually got solidified in my mind.
Then I created FEAR::API.
It is a highly specialized module with a domain-specific syntax. Maybe you are used to creating browser emulator using WWW::Mechanize, but you need to write some extra code to parse the content. Sometimes, after you have extracted data from documents, you also need to write some extra code to store them into databases or plain text files. It may be very easy for you, but is not always done quickly. That's why FEAR::API is here. FEAR::API encapsulates necessary components in any site scraping flow, trying to help you speed up the whole process.
THE FIVE ELEMENTS
There are 5 essential elements in this module.
FEAR::API::Agent
FEAR::API::Document
FEAR::API::Extract
FEAR::API::Filter
FEAR::API
FEAR::API::Agent is the crawler component. It fetches web pages, and passes contents to FEAR::API::Document.
FEAR::API::Document stores fetched documents.
FEAR::API::Extract performs data extraction on documents.
FEAR::API::Filter does pre-processing on documents and post-processing on extracted results. This component let you clean up fetched pages and refine extracted results.
FEAR::API is the public interface, and everything is handled and coordinated internally in it. Generally, you interact only with this package, and it is supposed to solve most of your problems.
The architecture is not complicated. I guess, the most bewildering thing may be the over-simplified syntax. According to some users who have already tried some of the example codes, they still have completely no idea about what's really going on with this module.
After having done parallel prefetching based on Larbin, I decided to start my documentation. (And I started to regret a little bit that I created this module.)
USAGE
The first line
use FEAR::API -base;
To -base, or not to -base. That is no question.
Using FEAR::API with -base means your current package is a subclass of FEAR::API, and $_ is auto-initiated as a FEAR::API object.
Using it without -base is like using any other OO Perl modules. You need to do instantiation by yourself, and specify the object with each method call.
use strict;
use FEAR::API;
my $f = fear();
$f->url("blah");
# blah, blah, blah.....
Fetch a page
url("google.com");
fetch();
FEAR::API maintains a URL queue in itself. Everytime you call url(), it pushes your arguments to the queue, and when you call fetch(), the URL at the front will be poped and be requested. If the request is successful, the fetched document will be stored in FEAR::API::Document.
fetch() not only pops the top element in the queue, but also takes arguments. If you pass a URL to fetch(), FEAR::API will fetch the one you specify, and ignore the URL quque temporarily.
Fetch a page and store it in a scalar
fetch("google.com") > my $content;
my $content = fetch("google.com")->document->as_string;
Fetch a page and print to STDOUT
getprint("google.com");
print fetch("google.com")->document->as_string;
fetch("google.com");
print $$_;
fetch("google.com") | _print;
Fetch a page and save it to a file
getstore("google.com", 'google.html');
url("google.com")->() | _save_as("google.html");
fetch("google.com") | io('google.html');
Dispatch Links
Deal with links in a web page (I)
Once you have a page fetched, you will probably need to process the links in this page. FEAR::API provides a method dispatch_links() (or report_links()) designed to do this job.
dispatch_links() takes a list of pairs of (regular expression => action). For each link in the page, if it matches a certain regular expression (or, say rule), then the action will be taken.
You can also set fallthrough_report(1) to test all the rules.
>> is overloaded. It is equivalent to method dispatch_links() or report_links(). fallthrough_report() is automatically set to 1 if >> is followed by an array ref [], and 0 if >> is followed by an hash ref {}.
In the following code examples, a constant _self is used with rules, which means links that matches rules will be all pushed back to the URL queue.
Verbose
fetch("http://google.com")
->report_links(
qr(^http:) => _self,
qr(google) => \my @l,
qr(google) => sub { print ">>>".$_[0]->[0],$/ }
);
fetch while has_more_urls;
print Dumper \@l;
Minimal
url("google.com")->()
>> [
qr(^http:) => _self,
qr(google) => \my @l,
qr(google) => sub { print ">>>".$_[0]->[0],$/ }
];
$_->() while $_;
print Dumper \@l;
Equivalent Code
url("tw.yahoo.com")->();
my @l;
foreach my $link (links){
$link->[0] =~ /^http:/ and url($link) and next;
$link->[0] =~ /tw.yahoo/ and push @l, $link and next;
$link->[0] =~ /tw.yahoo/ and print ">>>".$link->[0],$/ and next;
}
fetch while has_more_links;
print Dumper \@l;
Deal with links in a web page (II)
Verbose
fetch("http://google.com")
->fallthrough_report(1)
->report_links(
qr(^http:) => _self,
qr(google) => \my @l,
qr(google) => sub { print ">>>".$_[0]->[0],$/ }
);
fetch while has_more_urls;
print Dumper \@l;
Minimal
url("google.com")->()
>> {
qr(^http:) => _self,
qr(google) => \my @l,
qr(google) => sub { print ">>>".$_[0]->[0],$/ }
};
$_->() while $_;
print Dumper \@l;
Equivalent Code
url("tw.yahoo.com")->();
my @l;
foreach my $link (links){
$link->[0] =~ /^http:/ and url($link);
$link->[0] =~ /tw.yahoo/ and push @l, $link;
$link->[0] =~ /tw.yahoo/ and print ">>>".$link->[0],$/;
}
fetch while has_more_links;
print Dumper \@l;
Follow links in Google's homepage
url("google.com")->() >> _self;
&$_ while $_;
Save links in Google's homepage
url("google.com")->() >> _self | _save_as_tree("./root");
$_->() | _save_as_tree("./root") while $_;
Recursively get web pages from Google
url("google.com");
&$_ >> _self while $_;
In English, line 1 sets the initial URL. Line 2 says, while there are more links in the queue, FEAR::API will continue fetching and feeding back the links to itself.
Recursively get web pages from Google
url("google.com");
&$_ >> _self | _save_as_tree("./root") while $_;
In English, line 1 sets the initial URL. Line 2 says, while there are more links in the queue, FEAR::API will continue fetching and feeding back the links to itself, and saving the current document in a tree structure with its root called "root" on file system. And guess what? It is the minimal web spider written in Perl. (Well, at least, I am not aware of any other pure perl implementation.)
Mechanize fans?
FEAR::API borrows (or, steals) some useful methods from WWW::Mechanize.
Follow the second link of Google
url("google.com")->()->follow_link(n => 2);
Return links from Google's homepage
print Dumper fetch("google.com")->links;
Submit a query to Google
url("google.com")->();
submit_form(
form_number => 1,
fields => { q => "Kill Bush" }
);
Get links of some pattern
If you have used curl before, then you may have tried to embed multiple URLs in one line. FEAR::API gives a similar functionality based on Template Toolkit. In the following code, the initial ones are http://some.site/a, http://some.site/b, ......, http://some.site/z
url("[% FOREACH i = ['a'..'z'] %]
http://some.site/[% i %]
[% END %]");
&$_ while $_;
Extraction
Use template() to set up the template for extraction. Note that FEAR::API will add [% FOREACH rec %] and [% END %] to your template if your extraction method is set to Template::Extract.
preproc() (or doc_filter()) can help you clean up document before you apply your template. postproc() (or result_filter()) is called after you perform extraction. The argument can be of two types. You can insert a string containing Perl code which will be evaluated, or you can use named filters. They are documented in FEAR::API::Filters.
Extract data from CPAN
url("http://search.cpan.org/recent")->();
submit_form(
form_name => "f",
fields => {
query => "perl"
});
template("<!--item-->[% p %]<!--end item-->"); # [% FOREACH rec %]<!--item-->[% p %]<!--end item-->[% END %], actually.
extract;
print Dumper extresult;
Extract data from CPAN after some HTML cleanup
url("http://search.cpan.org/recent")->();
submit_form(
form_name => "f",
fields => {
query => "perl"
});
# Only the section between <!--results--> and <!--end results--> is wanted.
preproc(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s));
print document->as_string; # print content to STDOUT
template("<!--item-->[% p %]<!--end item-->");
extract;
print Dumper extresult;
HTML cleanup, extract data, and refine results
url("http://search.cpan.org/recent")->();
submit_form(
form_name => "f",
fields => {
query => "perl"
});
preproc(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s));
template("<!--item-->[% rec %]<!--end item-->");
extract;
postproc(q($_->{rec} =~ s/<.+?>//g)); # Strip HTML tags brutally
print Dumper extresult;
Use filtering syntax
fetch("http://search.cpan.org/recent");
submit_form(
form_name => "f",
fields => {
query => "perl"
})
| _doc_filter(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s))
| _template("<!--item-->[% rec %]<!--end item-->")
| _result_filter(q($_->{rec} =~ s/<.+?>//g));
print Dumper \@$_;
This is like doing piping in shell. Site scraping is actually just a flow of data. It is a process turning data into information. People usually pipe sort, wc, uniq, head, ... , etc. in shell to extract the thing they need. In FEAR::API, site scraping is equivalent to data munging. Every piece of information goes through multiple filters before the wanted information really comes out.
Invoke handler for extracted results
When you have results extracted, you can write handlers to process the data. invoke_handler() can takes arguments like "Data::Dumper", "YAML", a subref, an object-relational mapper, etc. And argument types are expected to grow.
fetch("http://search.cpan.org/recent");
submit_form(
form_name => "f",
fields => {
query => "perl"
})
| _doc_filter(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s))
| "<!--item-->[% rec %]<!--end item-->"
| _result_filter(q($_->{rec} =~ s/<.+?>//g));
invoke_handler('Data::Dumper');
Named Filters
Here are examples of using named filters provided by FEAR::API itself.
Preprocess document
url("google.com")->()
| _preproc(use => "html_to_null")
| _preproc(use => "decode_entities")
| _print;
Postprocess extraction results
fetch("http://search.cpan.org/recent");
submit_form(
form_name => "f",
fields => {
query => "perl"
})
| _doc_filter(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s))
| _template("<!--item-->[% rec %]<!--end item-->")
| _result_filter(use => "html_to_null", qw(rec));
| _result_filter(use => "decode_entities", qw(rec))
print Dumper \@$_;
ORMs
FEAR::API makes it very easy to transfer your extracted data straight to databases. All you need to do is set up an ORM, and invoke the mapper once you have new results extracted. (Though I still think it's not quick enough. It's better not to create any ORMs. FEAR::API should secretly build them for you.)
template($template);
extract;
invoke_handler('Some::Module::based::on::Class::DBI');
# or
invoke_handler('Some::Module::based::on::DBIx::Class::CDBICompat');
Scraping a file
It is possible to use FEAR::API to extract data from local files. It implies you can use other web crawlers to fetch web pages and use FEAR::API to do scraping jobs.
file('somse_file');
url('file:///the/path/to/your/file');
Then you need to tell FEAR::API what the content type is because the document is loaded from your local file system. Generally, FEAR::API assumes files to be plain text.
force_content_type('text/html');
THE XXX FILES
FEAR::API empowers you to select sub-documents using XPath. If your document is not in XML, you have to upgrade it first.
Upgrade HTML to XHTML
print fetch("google.com")->document->html_to_xhtml->as_string;
fetch("google.com") | _to_xhtml;
print $$_;
Do XPathing
print fetch("google.com")->document->html_to_xhtml->xpath('/html/body/*/form')->as_string;
fetch("google.com") | _to_xhtml | _xpath('/html/body/*/form');
print $$_;
Make your site scraping script a subroutine
It is possible to destruct your scripts or modules into several different components using SST (Site Scraping Template).
load_sst('fetch("google.com") >> _self; $_->() while $_');
run_sst;
load_sst('fetch("[% initial_link %]") >> _self; $_->() while $_');
run_sst({ initial_link => 'google.com'});
# Load from a file
load_sst_file("MY_SST");
run_sst({ initial_link => 'google.com'});
Tabbed scraping
I don't really know what this is good for. I added this because I saw some scraper could do this fancy stuff.
fetch("google.com"); # Default tab is 0
tab 1; # Create a new tab, and switch to it.
fetch("search.cpan.org"); # Fetch page in tab 1
tab 0; # Switch back to tab 0
template($template); # Continue processing in tab 0
extract();
keep_tab 1; # Keep tab 1 only and close others
close_tab 1; # Close tab 1
RSS
You can create RSS feeds easily with FEAR::API.
use FEAR::API -base, -rss;
my $url = "http://google.com";
url($url)->();
rss_new( $url, "Google", "Google Search Engine" );
rss_language( 'en' );
rss_webmaster( 'xxxxx@yourdomain.com' );
rss_twice_daily();
rss_item(@$_) for map{ [ $_->url(), $_->text() ] } links;
die "No items have been added." unless rss_item_count;
rss_save('google.rss');
See also XML::RSS::SimpleGen
Parallel Fetching
Primitive type
FEAR::API provides a method pfetch(). It can do parallel fetching, but I don't really know whether it's safe to use it or not. This is actually going to be replaced by the novel prefetching mechanism.
url("google.com")->() >> _self;
pfetch(sub{
local $_ = shift;
print join q/ /, title, current_url, document->size, $/;
});
Prefetching and document caching
Here I have designed two options for doing prefetching and document caching. One is purely written in Perl, and the other is a C++ web crawling engine. The perl solution is simple, easy-to-install, but not really efficient I think. The C++ crawler is extremely fast. It claims that it fetches 100 million pages on a home PC, with a good network. However, the C++ crawler is much more complex than the simple pure-perl prefetching.
Native perl prefetching based on fork()
use FEAR::API -base, -prefetching;
Simple, and not efficient
C++ parallel crawling based on pthread
use FEAR::API -base, -larbin;
Larbin is required. Amazingly fast. See also http://larbin.sourceforge.net/index-eng.html and larbin/README.
The default document repository is at /tmp/fear-api/pf. (Non-changeable for now).
ONE-LINERS
fearperl -e 'fetch("google.com")'
perl -M'FEAR::API -base' -e 'fetch("google.com")'
ARTICLE
There is also an article about this module. Please see http://www.perl.com/pub/a/2006/06/01/fear-api.html.
DEBATE
This module has been heavily criticized on Perlmonks. Please go to http://perlmonks.org/?node_id=537504 for details.
SEE ALSO
WWW::Mechanize, LWP::UserAgent, LWP::Simple, perlrequick, perlretut, perlre, perlreref, Regexp::Bind, Template::Extract, Template, IO::All, XML::Parser, XML::XPath, XML::RSS, XML::RSS::SimpleGen, Data::Dumper, YAML, Class::DBI, DBIx::Class
Larbin http://larbin.sourceforge.net/index-eng.html
AUTHOR & COPYRIGHT
Copyright (C) 2006 by Yung-chung Lin (a.k.a. xern) <xern@cpan.org>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 1535:
Non-ASCII character seen before =encoding in '∑('. Assuming UTF-8