NAME
SWISH::Prog - information retrieval application framework
SYNOPSIS
use SWISH::Prog;
my $program = SWISH::Prog->new(
invindex => 'path/to/myindex',
aggregator => 'fs',
indexer => 'native',
config => 'some/swish/config/file',
filter => sub { print $_[0]->url . "\n" },
);
$program->run('some/dir');
print $program->count . " documents indexed\n";
DESCRIPTION
SWISH::Prog is a full-text search framework based on Swish-e (http://swish-e.org/).
SWISH::Prog tries to fill a niche similar to Data::SearchEngine or DBI: providing a uniform and flexible interface to several different search engine tools and libraries.
SWISH::Prog does not try to replace the use of the underlying search engine tools, but instead tries to fill in some usability gaps and, like the DBI, make it relatively easy to switch between backend tools without needing to re-write an entire codebase.
SWISH::Prog implements all five basic components of a search application:
- Aggregator
-
Gather a document collection. A collection might be a group of HTML pages, or XML documents, or rows in a database. A collection might originate from the web, a filesystem, a database, an email inbox, or anywhere bytes are stored. An Aggregator gathers those documents in a uniform way.
SWISH::Prog provides a variety of Aggregators, for filesystems, email trees, spidering the web, pulling from databases, to name a few. See SWISH::Prog::Aggregator and its subclasses.
- Normalizer
-
Documents come in a variety of formats (MIME types). A Normalizer turns those disparate types into something text-based and parseable. SWISH::Prog uses SWISH::Filter to normalize documents.
- Parser/Analyzer
-
Documents are tokenized into "words" with attention to position, context, length, encoding, and linguistic quality (stemming, case, stopwords, etc.).
With the exception of the Native classes, SWISH::Prog uses SWISH::3 to parse HTML and XML documents (the most common normalized format for SWISH::Filter), and then delegates further analysis (tokenization, etc) to backend tools or libraries.
- Indexer
-
Each SWISH::Prog::Indexer subclass fronts an information retrieval (IR) tool or library that implements its own proprietary, highly optimized inverted index storage system that preserves the intelligence of the Parser/Analyzer.
For example, the SWISH::Prog::Lucy::Indexer is a wrapper around Lucy::Index::Indexer. SWISH::Prog::Native::Indexer is a wrapper around the
swish-e
tool. - Searcher
-
Like the Indexer, each SWISH::Prog::Searcher subclass delegates the searching of the inverted index to the backend IR tool or library.
For example, the SWISH::Prog::Lucy::Searcher is a wrapper around Lucy::Search::PolySearcher. SWISH::Prog::Native::Searcher is a wrapper around the SWISH::API::More module.
BACKGROUND
The name "SWISH::Prog" comes from the Swish-e -S prog feature. "prog" is short for "program". SWISH::Prog makes it easy to write indexing and search programs.
SWISH::Prog started as a way of making the swish-e
binary tool easier to integrate into Perl applications, and has since been expanded as a full implementation of Swish3, with alternate backend libraries (KinoSearch, Xapian, Apache Lucy, etc) filling the Indexer and Searcher roles.
METHODS
All of the following methods may be overridden when subclassing this module.
init
Overrides base SWISH::Prog::Class init() method.
filter( CODE ref )
Set in new(). See SWISH::Prog::Doc.
Example:
my $prog = SWISH::Prog->new(
filter => {
my $doc = shift;
# alter url
my $url = $doc->url;
$url =~ s/my.foo.com/my.bar.org/;
$doc->url( $url );
# alter content
my $buf = $doc->content;
$buf =~ s/foo/bar/gi;
$doc->content( $buf );
}
);
The filter value can also be the name of a file that evals to a CODE ref.
aggregator( $swish_prog_aggregator )
Get the SWISH::Prog::Aggregator object. You should set this in new().
aggregator_opts
Get the hashref of options passed internally to the aggregator constructor.
indexer_opts
Get the hashref of options passed internally to the indexer constructor.
run( collection )
Execute the program. This is an alias for index().
index( collection )
Add items in collection to the invindex().
config
Returns the aggregator's config() object.
invindex
Returns the indexer's invindex.
indexer
Returns the indexer.
count
Returns the indexer's count. NOTE This is the number of documents actually indexed, not counting the number of documents considered and discarded by the aggregator. If you want the number of documents the aggregator looked at, regardless of whether they were indexed, use the aggregator's count() method.
test_mode
Dry run mode, just prints info on stderr but does not build index. This flag is set in new() and passed to the indexer and aggregator.
AUTHOR
Peter Karman, <perl@peknet.com>
BUGS
Please report any bugs or feature requests to bug-swish-prog at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=SWISH-Prog. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc SWISH::Prog
You can also look for information at:
Mailing list
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
COPYRIGHT AND LICENSE
Copyright 2008-2009, 2012 by Peter Karman
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
SWISH::Prog::Doc, SWISH::Prog::Headers, SWISH::Prog::Indexer, SWISH::Prog::InvIndex, SWISH::Prog::Utils, SWISH::Prog::Aggregator, SWISH::Prog::Config