NAME

WWW::Extractor - Semi-automated extraction of records from WWW pages

SYNOPSIS

use strict;
use WWW::Extractor;

my($extractor) = WWW::Extractor->new();

$extractor->process($string);

DESCRIPTION

WWW::Extractor is a tool for semi automated extraction of records from a string containing HTML. One record within the string is marked up with extraction markups and the modules uses a pattern matching algorithm to match up the remaining records.

Extraction markup

The user markups up one record withing the HTML stream with the following symbols.

(((BEGIN)))

Begin a record

(((fieldname)))

Begin a field named fieldname

[[[literal string]]]

This identifies a block of text that the extractor attempts to match. This string is dumped out when the records are extracted.

{{{literal string}}}

This identifies a block of text that the extractor attempts to match. This string is not dumped out when the records are extracted.

(((nodump)))

This marks an area of text that is not to be dumped out.

(((/nodump)))

This ends a section of text that is not to be dumped out.

(((END)))

End a record.

ALGORITHM

The algorithm used is based on the edit distance wrapper generation method described in

@inproceedings{ chidlovskii00automatic, author = "Boris Chidlovskii and Jon Ragetli and Maarten de Rijke", title = "Automatic Wrapper Generation for Web Search Engines", booktitle = "Web-Age Information Management", pages = "399-410", year = "2000", url = "citeseer.nj.nec.com/chidlovskii00automatic.html" }

but with two major enhancements.

1 Before calculating edit distance, the system divides the tokens into different classification groups.
2 Instead of creating a general grammar from all of the records in a file, the data extractor creates one grammar from the sample entry and then matches the rest of the text to that one grammar.

DISCUSSION AND DEVELOPMENT

A wiki on this module is located at

http://www.gnacademy.org/twiki/bin/view/Gna/AutomatedDataExtraction

Please contact gna@gnacademy.org for ideas on improvements.

COPYRIGHT AND LICENSE

Copyright 2002, 2003 Globewide Network Academy

Redistributed under the terms of the Lesser GNU Public License