NAME

HTML::Index::Create - Perl module for creating a searchable HTML files index

SYNOPSIS

use HTML::Index::Create;

  $store = HTML::Index::Store->new;
  $indexer = HTML::Indexer->new(
      VERBOSE             => 1,
      STORE               => $store,
      PARSER              => 'HTML',
  );

  for ( ... )
  {
      my $doc = HTML::Index::Document->new( 
          name        => $name,
          contents    => $contents,
          mod_time    => $mod_time,
      );
      $indexer->index_document( $doc );
  }

  for ( ... )
  {
      my $doc = HTML::Index::Document->new( path => $path );
      # name, contents, and mod_time are the path, contents and modification
      # time of $path
      $indexer->index_document( $doc );
  }

DESCRIPTION

All files in are parsed using either the HTML::TreeBuilder module, or a "quick and dirty" regex - it's your choice. Words are stored lowercase, anything at least 2 characters long, and consisting of alphanumerics ([a-z\d]{2,}).

Indexes are stored as Berkeley DB files, but all storage operations are contained in the HTML::Index::Store module, which could be subclassed to support other storage options (such as SQL databases).

The inverted index (which stores the list of documents for each word) can be compressed. This adds a small overhead to the indexing, but is probably faster for search (since decompression is fast, and it is more likely that the index can be processed in memory).

CONSTRUCTOR OPTIONS

VERBOSE: Prints stuff to STDERR.
STORE: A an object which ISA HTML::Index::Store.
PARSER: Should be one of html or regex. If html, documents are parsed using HTML::TreeBuilder to extract visible text. If regex, the same job is done by a "quick and dirty" regex.
REFRESH: If true, the index will be refreshed (all existing data will be lost).

METHODS

index_document

Takes an HTML::Index::Document as an argument. Indexes the document, based either on its content attribute, or the content of its path attribute. A search will return either its name attribute, or its path attribute. If an entry for that name already exists, then it will be re-indexed, iff the modification time of the document has changed. The mod_time attribute can be set explicitly, else it defaults to the modification time of the path attribute.

The idea of using the HTML::Index::Document abstraction in this way is to allow in the simple case to index file paths, but also to index any other data source (such as entries in a database, for example).

AUTHOR

Ave Wrigley <Ave.Wrigley@itn.co.uk>

COPYRIGHT

To install HTML::Index, copy and paste the appropriate command in to your terminal.

cpanm

cpanm HTML::Index

CPAN shell

perl -MCPAN -e shell
install HTML::Index

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)