NAME
WWW::Extractor - Semi-automated extraction of records from WWW pages
SYNOPSIS
use strict;
use WWW::Extractor;
my($extractor) = WWW::Extractor->new();
$extractor->process($string);
DESCRIPTION
WWW::Extractor is a tool for semi automated extraction of records from a string containing HTML. One record within the string is marked up with extraction markups and the modules uses a pattern matching algorithm to match up the remaining records.
Extraction markup
The user markups up one record withing the HTML stream with the following symbols.
- (((BEGIN)))
-
Begin a record
- (((fieldname)))
-
Begin a field named fieldname
- [[[literal string]]]
-
This identifies a block of text that the extractor attempts to match. This string is dumped out when the records are extracted.
- {{{literal string}}}
-
This identifies a block of text that the extractor attempts to match. This string is not dumped out when the records are extracted.
- (((nodump)))
-
This marks an area of text that is not to be dumped out.
- (((/nodump)))
-
This ends a section of text that is not to be dumped out.
- (((END)))
-
End a record.
ALGORITHM
The algorithm used is based on the edit distance wrapper generation method described in
@inproceedings{ chidlovskii00automatic, author = "Boris Chidlovskii and Jon Ragetli and Maarten de Rijke", title = "Automatic Wrapper Generation for Web Search Engines", booktitle = "Web-Age Information Management", pages = "399-410", year = "2000", url = "citeseer.nj.nec.com/chidlovskii00automatic.html" }
but with two major enhancements.
- 1 Before calculating edit distance, the system divides the tokens into different classification groups.
- 2 Instead of creating a general grammar from all of the records in a file, the data extractor creates one grammar from the sample entry and then matches the rest of the text to that one grammar.
DISCUSSION AND DEVELOPMENT
A wiki on this module is located at
http://www.gnacademy.org/twiki/bin/view/Gna/AutomatedDataExtraction
Please contact gna@gnacademy.org for ideas on improvements.
COPYRIGHT AND LICENSE
Copyright 2002, 2003 Globewide Network Academy
Redistributed under the terms of the Lesser GNU Public License