NAME
HTML::Index::Create - Perl module for creating a searchable HTML files index
SYNOPSIS
use HTML::Index::Create;
$store = HTML::Index::Store->new;
$indexer = HTML::Indexer->new(
VERBOSE => 1,
STORE => $store,
PARSER => 'HTML',
);
for ( ... )
{
my $doc = HTML::Index::Document->new(
name => $name,
contents => $contents,
mod_time => $mod_time,
);
$indexer->index_document( $doc );
}
for ( ... )
{
my $doc = HTML::Index::Document->new( path => $path );
# name, contents, and mod_time are the path, contents and modification
# time of $path
$indexer->index_document( $doc );
}
DESCRIPTION
All files in are parsed using either the HTML::TreeBuilder module, or a "quick and dirty" regex - it's your choice. Words are stored lowercase, anything at least 2 characters long, and consisting of alphanumerics ([a-z\d]{2,}).
Indexes are stored as Berkeley DB files, but all storage operations are contained in the HTML::Index::Store module, which could be subclassed to support other storage options (such as SQL databases).
The inverted index (which stores the list of documents for each word) can be compressed. This adds a small overhead to the indexing, but is probably faster for search (since decompression is fast, and it is more likely that the index can be processed in memory).
CONSTRUCTOR OPTIONS
- VERBOSE
-
Prints stuff to STDERR.
- STORE
-
A an object which ISA HTML::Index::Store.
- PARSER
-
Should be one of html or regex. If html, documents are parsed using HTML::TreeBuilder to extract visible text. If regex, the same job is done by a "quick and dirty" regex.
- REFRESH
-
If true, the index will be refreshed (all existing data will be lost).
METHODS
- index_document
-
Takes an HTML::Index::Document as an argument. Indexes the document, based either on its content attribute, or the content of its path attribute. A search will return either its name attribute, or its path attribute. If an entry for that name already exists, then it will be re-indexed, iff the modification time of the document has changed. The mod_time attribute can be set explicitly, else it defaults to the modification time of the path attribute.
The idea of using the HTML::Index::Document abstraction in this way is to allow in the simple case to index file paths, but also to index any other data source (such as entries in a database, for example).
SEE ALSO
AUTHOR
Ave Wrigley <Ave.Wrigley@itn.co.uk>
COPYRIGHT
Copyright (c) 2001 Ave Wrigley. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.