NAME
File::Tabular::Web::Attachments::Indexed - Fulltext indexing in documents attached to File::Tabular::Web
DESCRIPTION
This abstract class adds support for fulltext indexing in documents attached to a File::Tabular::Web application.
Queries into the fulltext index should be passed under the SFT
("search full text") parameter, in addition to the usual S
parameter (search in metadata record). So for example
http://my/app.ftw?S=2007&SFT=perl
will search records containing the word "2007" and having an attached document in which there is the word "perl". Queries can of course be much more complex, with boolean operators, parentheses, excluded words, etc. --- see Search::Indexer and Query::Parser.
Indexing requires some mechanism to convert attached documents into plain text. This cannot be guessed by the present class, so you should write a subclass that implements such conversions; see the "SUBCLASSING" section below.
RESERVED FIELD NAMES
Records retrieved from a fulltext search will have two additional fields : score
(how well the document matched the query) and excerpts
(strings of text fragments close to the searched words). Therefore those field names should not be present as regular fields in the data file.
CONFIGURATION
[fields]
upload fieldname1
upload fieldname2 = indexed
Currently only one single upload field can be indexed within a given application.
subclassing
This class relies on the "indexed_doc_content" method for converting attached documents into plain text, which is a prerequisite to perform the indexing. The default implementation of "indexed_doc_content" just returns the raw file content, so it is most likely inappropriate to suit your needs; therefore you should write a subclass that overrides this method, and then associate this subclass to your application within the configuration file :
[application]
class = My::Subclass::Of::File::Tabular::Web::Attachements::Indexed
Asynchronous indexing
If your uploaded documents are Microsoft Office or OpenOffice documents, it may be too costly to convert them on the fly, while answering the HTTP request. A way to deal with this is to override the "after_add_attachment" and "before_delete_attachment" methods : instead of performing immediate adds or deletions into the index, these method can write indexing requests into an event queue. A separate process then reads the event queue and performs the indexing operations.
METHODS
app_initialize
Calls the parent method; records in $self->{app}{indexed_field}
which is the name of the indexed field.
words_queried
Returns a list of words queried either in the S
or SFT
parameters.
log_search
Logs both the S
and SFT
parameters.
before_search
Performs the fulltext search, and combines the results into the usual search string coming from the S
parameter.
search
Calls the parent method and adds a score
field into each record.
sort_and_slice
Calls the parent method and adds excerpts of the searched words from attached documents into each record of the slice.
add_excerpts
Implementation to find excerpts of searched word within attached documents and add them into the result set.
params_for_next_slice
Returns a string repeating the search parameters, for generating URLs to the next or previous slice.
after_add_attachment
Performs the indexing of the attached document
before_delete_attachment
Removes the document from the index.
indexed_doc_content
my $plain_text = $self->indexed_doc_content($record);
Returns the plain text representation of the document attached to $record
. To get to the actual file, your implementation can access
my $path = $self->upload_fullpath($record, $self->{indexed_field});