NAME
Search::Estraier - pure perl module to use Hyper Estraier search engine
SYNOPSIS
Simple indexer
use Search::Estraier;
# create and configure node
my $node = new Search::Estraier::Node(
url => 'http://localhost:1978/node/test',
user => 'admin',
passwd => 'admin',
create => 1,
label => 'Label for node',
croak_on_error => 1,
);
# create document
my $doc = new Search::Estraier::Document;
# add attributes
$doc->add_attr('@uri', "http://estraier.gov/example.txt");
$doc->add_attr('@title', "Over the Rainbow");
# add body text to document
$doc->add_text("Somewhere over the rainbow. Way up high.");
$doc->add_text("There's a land that I heard of once in a lullaby.");
die "error: ", $node->status,"\n" unless (eval { $node->put_doc($doc) });
Simple searcher
use Search::Estraier;
# create and configure node
my $node = new Search::Estraier::Node(
url => 'http://localhost:1978/node/test',
user => 'admin',
passwd => 'admin',
croak_on_error => 1,
);
# create condition
my $cond = new Search::Estraier::Condition;
# set search phrase
$cond->set_phrase("rainbow AND lullaby");
my $nres = $node->search($cond, 0);
if (defined($nres)) {
print "Got ", $nres->hits, " results\n";
# for each document in results
for my $i ( 0 ... $nres->doc_num - 1 ) {
# get result document
my $rdoc = $nres->get_doc($i);
# display attribte
print "URI: ", $rdoc->attr('@uri'),"\n";
print "Title: ", $rdoc->attr('@title'),"\n";
print $rdoc->snippet,"\n";
}
} else {
die "error: ", $node->status,"\n";
}
DESCRIPTION
This module is implementation of node API of Hyper Estraier. Since it's perl-only module with dependencies only on standard perl modules, it will run on all platforms on which perl runs. It doesn't require compilation or Hyper Estraier development files on target machine.
It is implemented as multiple packages which closly resamble Ruby implementation. It also includes methods to manage nodes.
There are few examples in scripts
directory of this distribution.
Inheritable common methods
This methods should really move somewhere else.
_s
Remove multiple whitespaces from string, as well as whitespaces at beginning or end
my $text = $self->_s(" this is a text ");
$text = 'this is a text';
Search::Estraier::Document
This class implements Document which is single item in Hyper Estraier.
It's is collection of:
- attributes
-
'key' => 'value'
pairs which can later be used for filtering of resultsYou can add common filters to
attrindex
in estmaster's_conf
file for better performance. Seeattrindex
in Hyper Estraier P2P Guide. - vectors
-
also
'key' => 'value'
pairs - display text
-
Text which will be used to create searchable corpus of your index and included in snippet output.
-
Text which will be searchable, but will not be included in snippet.
new
Create new document, empty or from draft.
my $doc = new Search::HyperEstraier::Document;
my $doc2 = new Search::HyperEstraier::Document( $draft );
add_attr
Add an attribute.
$doc->add_attr( name => 'value' );
Delete attribute using
$doc->add_attr( name => undef );
add_text
Add a sentence of text.
$doc->add_text('this is example text to display');
add_hidden_text
Add a hidden sentence.
$doc->add_hidden_text('this is example text just for search');
add_vectors
Add a vectors
$doc->add_vector(
'vector_name' => 42,
'another' => 12345,
);
set_score
Set the substitute score
$doc->set_score(12345);
score
Get the substitute score
id
Get the ID number of document. If the object has never been registred, -1
is returned.
print $doc->id;
attr_names
Returns array with attribute names from document object.
my @attrs = $doc->attr_names;
attr
Returns value of an attribute.
my $value = $doc->attr( 'attribute' );
texts
Returns array with text sentences.
my @texts = $doc->texts;
cat_texts
Return whole text as single scalar.
my $text = $doc->cat_texts;
dump_draft
Dump draft data from document object.
print $doc->dump_draft;
delete
Empty document object
$doc->delete;
This function is addition to original Ruby API, and since it was included in C wrappers it's here as a convinience. Document objects which go out of scope will be destroyed automatically.
Search::Estraier::Condition
new
my $cond = new Search::HyperEstraier::Condition;
set_phrase
$cond->set_phrase('search phrase');
add_attr
$cond->add_attr('@URI STRINC /~dpavlin/');
set_order
$cond->set_order('@mdate NUMD');
set_max
$cond->set_max(42);
set_options
$cond->set_options( 'SURE' );
$cond->set_options( qw/AGITO NOIDF SIMPLE/ );
Possible options are:
- SURE
-
check every N-gram
- USUAL
-
check every second N-gram
- FAST
-
check every third N-gram
- AGITO
-
check every fourth N-gram
- NOIDF
-
don't perform TF-IDF tuning
- SIMPLE
-
use simplified query phrase
Skipping N-grams will speed up search, but reduce accuracy. Every call to set_options
will reset previous options;
This option changed in version 0.04
of this module. It's backwards compatibile.
phrase
Return search phrase.
print $cond->phrase;
order
Return search result order.
print $cond->order;
attrs
Return search result attrs.
my @cond_attrs = $cond->attrs;
max
Return maximum number of results.
print $cond->max;
-1
is returned for unitialized value, 0
is unlimited.
options
Return options for this condition.
print $cond->options;
Options are returned in numerical form.
set_skip
Set number of skipped documents from beginning of results
$cond->set_skip(42);
Similar to offset
in RDBMS.
skip
Return skip for this condition.
print $cond->skip;
set_distinct
$cond->set_distinct('@author');
distinct
Return distinct attribute
print $cond->distinct;
set_mask
Filter out some links when searching.
Argument array of link numbers, starting with 0 (current node).
$cond->set_mask(qw/0 1 4/);
Search::Estraier::ResultDocument
new
my $rdoc = new Search::HyperEstraier::ResultDocument(
uri => 'http://localhost/document/uri/42',
attrs => {
foo => 1,
bar => 2,
},
snippet => 'this is a text of snippet'
keywords => 'this\tare\tkeywords'
);
uri
Return URI of result document
print $rdoc->uri;
attr_names
Returns array with attribute names from result document object.
my @attrs = $rdoc->attr_names;
attr
Returns value of an attribute.
my $value = $rdoc->attr( 'attribute' );
snippet
Return snippet from result document
print $rdoc->snippet;
keywords
Return keywords from result document
print $rdoc->keywords;
Search::Estraier::NodeResult
new
my $res = new Search::HyperEstraier::NodeResult(
docs => @array_of_rdocs,
hits => %hash_with_hints,
);
doc_num
Return number of documents
print $res->doc_num;
This will return real number of documents (limited by max
). If you want to get total number of hits, see hits
.
get_doc
Return single document
my $doc = $res->get_doc( 42 );
Returns undef if document doesn't exist.
hint
Return specific hint from results.
print $res->hint( 'VERSION' );
Possible hints are: VERSION
, NODE
, HIT
, HINT#n
, DOCNUM
, WORDNUM
, TIME
, LINK#n
, VIEW
.
hints
More perlish version of hint
. This one returns hash.
my %hints = $res->hints;
hits
Syntaxtic sugar for total number of hits for this query
print $res->hits;
It's same as
print $res->hint('HIT');
but shorter.
Search::Estraier::Node
new
my $node = new Search::HyperEstraier::Node;
or optionally with url
as parametar
my $node = new Search::HyperEstraier::Node( 'http://localhost:1978/node/test' );
or in more verbose form
my $node = new Search::HyperEstraier::Node(
url => 'http://localhost:1978/node/test',
user => 'admin',
passwd => 'admin'
create => 1,
label => 'optional node label',
debug => 1,
croak_on_error => 1
);
with following arguments:
- url
-
URL to node
- user
-
specify username for node server authentication
- passwd
-
password for authentication
- create
-
create node if it doesn't exists
- label
-
optional label for new node if
create
is used - debug
-
dumps a lot of debugging output
- croak_on_error
-
very helpful during development. It will croak on all errors instead of silently returning
-1
(which is convention of Hyper Estraier API in other languages).
set_url
Specify URL to node server
$node->set_url('http://localhost:1978');
set_proxy
Specify proxy server to connect to node server
$node->set_proxy('proxy.example.com', 8080);
set_timeout
Specify timeout of connection in seconds
$node->set_timeout( 15 );
set_auth
Specify name and password for authentication to node server.
$node->set_auth('clint','eastwood');
status
Return status code of last request.
print $node->status;
-1
means connection failure.
put_doc
Add a document
$node->put_doc( $document_draft ) or die "can't add document";
Return true on success or false on failure.
out_doc
Remove a document
$node->out_doc( document_id ) or "can't remove document";
Return true on success or false on failture.
out_doc_by_uri
Remove a registrated document using it's uri
$node->out_doc_by_uri( 'file:///document/uri/42' ) or "can't remove document";
Return true on success or false on failture.
edit_doc
Edit attributes of a document
$node->edit_doc( $document_draft ) or die "can't edit document";
Return true on success or false on failture.
get_doc
Retreive document
my $doc = $node->get_doc( document_id ) or die "can't get document";
Return true on success or false on failture.
get_doc_by_uri
Retreive document
my $doc = $node->get_doc_by_uri( 'file:///document/uri/42' ) or die "can't get document";
Return true on success or false on failture.
get_doc_attr
Retrieve the value of an atribute from object
my $val = $node->get_doc_attr( document_id, 'attribute_name' ) or
die "can't get document attribute";
get_doc_attr_by_uri
Retrieve the value of an atribute from object
my $val = $node->get_doc_attr_by_uri( document_id, 'attribute_name' ) or
die "can't get document attribute";
etch_doc
Exctract document keywords
my $keywords = $node->etch_doc( document_id ) or die "can't etch document";
etch_doc_by_uri
Retreive document
my $keywords = $node->etch_doc_by_uri( 'file:///document/uri/42' ) or die "can't etch document";
Return true on success or false on failture.
uri_to_id
Get ID of document specified by URI
my $id = $node->uri_to_id( 'file:///document/uri/42' );
This method won't croak, even if using croak_on_error
.
_fetch_doc
Private function used for implementing of get_doc
, get_doc_by_uri
, etch_doc
, etch_doc_by_uri
.
# this will decode received draft into Search::Estraier::Document object
my $doc = $node->_fetch_doc( id => 42 );
my $doc = $node->_fetch_doc( uri => 'file:///document/uri/42' );
# to extract keywords, add etch
my $doc = $node->_fetch_doc( id => 42, etch => 1 );
my $doc = $node->_fetch_doc( uri => 'file:///document/uri/42', etch => 1 );
# to get document attrubute add attr
my $doc = $node->_fetch_doc( id => 42, attr => '@mdate' );
my $doc = $node->_fetch_doc( uri => 'file:///document/uri/42', attr => '@mdate' );
# more general form which allows implementation of
# uri_to_id
my $id = $node->_fetch_doc(
uri => 'file:///document/uri/42',
path => '/uri_to_id',
chomp_resbody => 1
);
name
my $node_name = $node->name;
label
my $node_label = $node->label;
doc_num
my $documents_in_node = $node->doc_num;
word_num
my $words_in_node = $node->word_num;
size
my $node_size = $node->size;
search
Search documents which match condition
my $nres = $node->search( $cond, $depth );
$cond
is Search::Estraier::Condition
object, while <$depth> specifies depth for meta search.
Function results Search::Estraier::NodeResult
object.
cond_to_query
Return URI encoded string generated from Search::Estraier::Condition
my $args = $node->cond_to_query( $cond, $depth );
shuttle_url
This is method which uses LWP::UserAgent
to communicate with Hyper Estraier node master.
my $rv = shuttle_url( $url, $content_type, $req_body, \$resbody );
$resheads
and $resbody
booleans controll if response headers and/or response body will be saved within object.
set_snippet_width
Set width of snippets in results
$node->set_snippet_width( $wwidth, $hwidth, $awidth );
$wwidth
specifies whole width of snippet. It's 480
by default. If it's 0
snippet is not sent with results. If it is negative, whole document text is sent instead of snippet.
$hwidth
specified width of strings from beginning of string. Default value is 96
. Negative or zero value keep previous value.
$awidth
specifies width of strings around each highlighted word. It's 96
by default. If negative of zero value is provided previous value is kept unchanged.
set_user
Manage users of node
$node->set_user( 'name', $mode );
$mode
can be one of:
Return true on success, otherwise false.
set_link
Manage node links
$node->set_link('http://localhost:1978/node/another', 'another node label', $credit);
If $credit
is negative, link is removed.
admins
my @admins = @{ $node->admins };
Return array of users with admin rights on node
guests
my @guests = @{ $node->guests };
Return array of users with guest rights on node
links
my $links = @{ $node->links };
Return array of links for this node
cacheusage
Return cache usage for a node
my $cache = $node->cacheusage;
master
Set actions on Hyper Estraier node master (estmaster
process)
$node->master(
action => 'sync'
);
All available actions are documented in http://hyperestraier.sourceforge.net/nguide-en.html#protocol
PRIVATE METHODS
You could call those directly, but you don't have to. I hope.
_set_info
Set information for node
$node->_set_info;
_clear_info
Clear information for node
$node->_clear_info;
On next call to name
, label
, doc_num
, word_num
or size
node info will be fetch again from Hyper Estraier.
EXPORT
Nothing.
SEE ALSO
http://hyperestraier.sourceforge.net/
Hyper Estraier Ruby interface on which this module is based.
Hyper Estraier now also has pure-perl binding included in distribution. It's a faster way to access databases directly if you are not running estmaster
P2P server.
AUTHOR
Dobrica Pavlinusic, <dpavlin@rot13.org>
Robert Klep <robert@klep.name> contributed refactored search code
COPYRIGHT AND LICENSE
Copyright (C) 2005-2006 by Dobrica Pavlinusic
This library is free software; you can redistribute it and/or modify it under the GPL v2 or later.
2 POD Errors
The following errors were encountered while parsing the POD:
- Around line 1775:
Expected text after =item, not a number
- Around line 1779:
Expected text after =item, not a number