NAME
Apache::Wyrd::Services::Index
SYNOPSIS
my $init = {
file => '/var/lib/Wyrd/pageindex.db',
strict => 1,
attributes => [qw(author text subjects)],
maps => [qw(subjects)]
};
my $index = Apache::Wyrd::Services::Index->new($init);
my @subject_is_foobar = $index->word_search('foobar', 'subjects');
my @pages =
$index->word_search('+musthaveword -mustnothaveword
other words to search for and add to results');
foreach my $page (@pages) {
print "title: $$page{title}, author: $$page{author};
}
my @pages = $index->parsed_search('(this AND that) OR "the other"');
foreach my $page (@pages) {
print "title: $$page{title}, author: $$page{author};
}
DESCRIPTION
General purpose Index object for retrieving a variety of information on a class of objects. The objects can have any type, but must implement at a minumum the Apache::Wyrd::Interfaces::Indexable
interface.
The information stored is broken down into attributes. The main builtin (and not override-able) attributes are data, word, title, and description, as well as three internal attributes of reverse, timestamp, and digest. Additional attributes are specified via the hashref argument to the new
method (see below). There can be only 255 total attributes.
Attributes are of two types, either regular or map, and these relate to the main index, id. A regular attribute stores information on a one-id-to-one-attribute basis, such as title or description. A map attribute provides a reverse lookup, such as words in a document, or subjects covered by documents, such as documents with the word "foo" in them or items classified as "bar". One builtin map exists, word which reverse-indexes every word of the attribute data.
The Index is meant to be used as a storage for meta-data about web pages, and in this capacity, data and word provide the exact match and word-search capacity respectively.
The internal attributes of digest and timestamp are also used to determine whether the information for the item is fresh. It is assumed that testing a timestamp is faster than producing a digest, and that a digest is faster to produce than re-indexing a document, so a check to these two criteria is made before updating an entry for a given item. See update_entry
.
The information is stored in a Berkeley DB, using the BerkeleyDB::Btree
perl module. Because of concurrence of usage between different Apache demons in a pool of servers, it is important that this be a reasonably current version of BerkeleyDB which supports locking and read-during-update. This module was developed using Berkeley DB v. 3.3 on Darwin and Linux and has been tested a bit on Berkeley DB versions 4.0 and 4.1.
Use with vast amounts of large documents is not recommended, but a reasonably large (hundreds of 1000-word pages) web site can be indexed and searched reasonably quickly(TM) on most cheap servers as of this writing. All hail Moore's Law.
METHODS
(format: (returns) name (arguments after self))
- (Apache::Wyrd::Services::Index)
new
(hashref) -
Create a new Index object, creating the associated DB file if necessary. The index is configured via a hashref argument. Important keys for this hashref:
- file
-
Absolute path and filename for the DB file. Must be writeable by the Apache process.
- strict
-
Die on errors. Default 1 (yes).
- quiet
-
If not strict, be quiet in the error log about problems. Use at your own risk.
- attributes
-
Arrayref of attributes other than the default to use. For every attribute foo, an
index_foo
method should be implemented by the object being indexed. The value returned by this method will be stored under the attribute foo. - maps
-
Arrayref of which attributes to treat as maps. Anny attribute that is a map must also be included in the list of attributes.
- (void)
delete_index
(void) -
Zero all data in the index and open a new one.
- (scalar)
update_entry
(Apache::Wyrd::Interfaces::Indexable ref) -
Called by an indexable object, passing itself as the argument, in order to update it's entry in the index. This method calls
index_foo
for every attribute foo in the index, storing that value under the attribute entry for that object. The function always returns a message about the process.update_entry will always check index_timestamp and index_digest. If the stored value and the returned value agree on either attribute, the index will not be updated. This behavior can be overridden by returning a true value from method
force_update
. - (hashref)
entry_by_name
(scalar) -
Given the value of an name attribute, returns a hashref of all the regular attributes stored for a given entry.
- (scalar)
clean_html
(scalar) -
Given a string of HTML, this method strips out all tags, comments, etc., and returns only clean text for breaking down into tokens. You may want to override this method -- the default method is pretty quick-and-dirty.
- (array)
word_search
(scalar, [scalar]) -
return entries matching tokens in a string within a given map attribute. As map attributes store one token, such as a word, against which all entries are indexed, the string is broken into tokens before processing, with commas and whitespaces delimiting the tokens unless they are enclosed in double quotes.
If a token begins with a plus sign (+), results must have the word, with a minus sign, (-) they must not. These signs can also be placed left of phrases enclosed by double quotes.
Results are returned in an array of hashrefs ranked by "score". The attribute "score" is added to the hash, meaning number of matches for that given entry. All other regular attributes of the indexable object are values of the keys of each hash returned.
The default map to use for this method is 'word'. If the optional second argument is given, that map will be used.
- (array)
search
(scalar, [scalar]) -
Alias for word_search. Required by
Apache::Wyrd::Services::SearchParser
. - (array)
parsed_search
(scalar, [scalar]) -
Same as word_search, but with the logical qualifiers AND, OR, and NOT. More complex searches can be accomplished, at a cost of speed.
BUGS/CAVEATS
UNKNOWN
AUTHOR
Barry King <wyrd@nospam.wyrdwright.com>
SEE ALSO
- Apache::Wyrd
-
General-purpose HTML-embeddable perl object
- Apache::Wyrd::Interfaces::Indexable
-
Methods to be implemented by any item that wants to be indexed.
- Apache::Wyrd::Services::SearchParser
-
Parser for handling logical searches (AND/OR/NOT).
LICENSE
Copyright 2002-2004 Wyrdwright, Inc. and licensed under the GNU GPL.
See LICENSE under the documentation for Apache::Wyrd
.