The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

HTML::Index::Create - Perl module for creating a searchable HTML files index

SYNOPSIS

use HTML::Index::Create;

  $store = HTML::Index::Store->new;
  $indexer = HTML::Indexer->new(
      VERBOSE             => 1,
      STORE               => $store,
      PARSER              => 'HTML',
  );

  for ( ... )
  {
      my $doc = HTML::Index::Document->new( 
          name        => $name,
          contents    => $contents,
          mod_time    => $mod_time,
      );
      $indexer->index_document( $doc );
  }

  for ( ... )
  {
      my $doc = HTML::Index::Document->new( path => $path );
      # name, contents, and mod_time are the path, contents and modification
      # time of $path
      $indexer->index_document( $doc );
  }

DESCRIPTION

All files in are parsed using either the HTML::TreeBuilder module, or a "quick and dirty" regex - it's your choice. Words are stored lowercase, anything at least 2 characters long, and consisting of alphanumerics ([a-z\d]{2,}).

Indexes are stored as Berkeley DB files, but all storage operations are contained in the HTML::Index::Store module, which could be subclassed to support other storage options (such as SQL databases).

The inverted index (which stores the list of documents for each word) can be compressed. This adds a small overhead to the indexing, but is probably faster for search (since decompression is fast, and it is more likely that the index can be processed in memory).

CONSTRUCTOR OPTIONS

VERBOSE

Prints stuff to STDERR.

STORE

A an object which ISA HTML::Index::Store.

PARSER

Should be one of html or regex. If html, documents are parsed using HTML::TreeBuilder to extract visible text. If regex, the same job is done by a "quick and dirty" regex.

REFRESH

If true, the index will be refreshed (all existing data will be lost).

METHODS

index_document

Takes an HTML::Index::Document as an argument. Indexes the document, based either on its content attribute, or the content of its path attribute. A search will return either its name attribute, or its path attribute. If an entry for that name already exists, then it will be re-indexed, iff the modification time of the document has changed. The mod_time attribute can be set explicitly, else it defaults to the modification time of the path attribute.

The idea of using the HTML::Index::Document abstraction in this way is to allow in the simple case to index file paths, but also to index any other data source (such as entries in a database, for example).

SEE ALSO

HTML::Index

AUTHOR

Ave Wrigley <Ave.Wrigley@itn.co.uk>

COPYRIGHT

Copyright (c) 2001 Ave Wrigley. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.