NAME
Search::Brick::RAM::Query - Perl interface for the RAMBrick search thingie https://github.com/jackdoe/brick/tree/master/bricks/RAMBrick
SYNOPSIS
use Search::Brick::RAM::Query qw(query true)
my @results = query(host => ['127.0.0.1:9000'],
settings => { log_slower_than => 100, explain => true }, # log queries slower than 100ms
request => { term => { "title" => "book","author" => "john" } },
brick => 'RAM',
action => 'search',
timeout => 0.5 # in seconds - this is total timeout
# connect time + write time + read time
);
#shortcuts are available using the search object:
my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
my @results = $s->search({ term => { author => 'jack' } }, { size => 20, explain => true, log_slower_than => 5 });
$s->delete(); # deletes the __test__ index
DESCRIPTION
minimalistic interface to RAMBrick (a minimalistic Lucene wrapper)
FUNCTION: Search::Brick::RAM::Query::query( $args :Hash ) :Array
Search::Brick::RAM::Query::query
has the following parameters:
host => ['127.0.0.1:9000'],
request => { term => { "title" => "book","author" => "john" } },
brick => 'RAM',
action => 'search', # (search|store|alias|load)
index => '...', # name of your index
timeout => 0.5, # in seconds
settings => {} # used by different actions to get context on the request
- timeout
-
timeout
is the whole round trip timeout: (connect time + write time + read time). - brick
-
brick
is the brick's name (in this case 'RAM') - action
-
action
is the requested action (search|store|stat) - request
-
request
must be reference to array of hashrefs - index
-
index
is the action argument, provides context to the request (usually index name) - settings
-
settings
RAMBrick
requires settings to be sent (by default they are empty) things likesize
oritems_per_group
like:{ size => 5, explain => true }
- host
-
host
string or arrayref of strings - the same request will be sent to all hosts in the list (the whole thing is async, so will be as slow as the slowest host) and un-ordered array of results is returned.
If the result is not ref()
or there is any kind of error(including timeout), the request will die.
Query object
you can also use the module by creating a Query object, such as:
my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
my @results = $s->search({ term => { author => 'jack' } }, { size => 20, explain => true, log_slower_than => 5 });
Simple wrapper making query()
calls, but without having to pass index
, brick
, host
every time.
- search($query,$settings,$timeout || DEFAULT_TIMEOUT)
- index($data,$settings,$timeout || DEFAULT_TIMEOUT)
- alias($settings,$timeout || DEFAULT_TIMEOUT)
- stat($timeout || DEFAULT_TIMEOUT)
- delete($timeout || DEFAULT_TIMEOUT)
all of the above are just making query()
calls with the selected parameters, but also passing brick => 'RAM'>
, index => __test
(in this example) and host => [ '127.0.0.1:9000' ]
multiple hosts
each request can be sent to more than 1 host, the exact same copy of the request is sent asynchronously to all of them, after all copies have been sent, we asynchronously wait for the results. So the whole request is as slow as the slowest host.
result
the results are un-sorted, so if the request is for host 127.0.0.1:9000, 10.0.0.2:9000
the result array might be ([arrayref of the result of 10.0.0.2:9000],[arrayref of 127.0.0.1:9000])
.
Every result object is always array
, even if the request has only 1 host.
timeout
the timeout is total timeout, includint connect
time + send
time + rect
time for all the hosts combined. So when you say 0.5 (half a second) and give it 20 hosts, the moment it reaches 0.5 seconds, it will die, regardless of how much data it received, or how much data it is about to send.
error
there is no error object, the module dies on every error (including timeouts, query syntax errors etc..).
SEARCH
my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
my @results = $s->search({ term => { author => 'jack' } }, { size => 20, explain => true, log_slower_than => 5 });
QUERY SYNTAX
boosting
every query except term
and lucene
supports boost
parameter like:
bool => { must [ { term => { author => 'jack' } } ], boost => 4.3 }
term
creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/TermQuery.html
syntax:
term => { field => value }
example:
my @results = $s->search({ term => { author => 'jack' } });
since RAMBrick does not do any query rewrites (like ElasticSearch's match
query) and it also does not do any kind of analysis on the query string, the term
and lucene
queries are the only queries that can be used to match specific documents.
lucene
syntax:
lucene => "author:john"
example:
my @result = $s->search({ lucene => "ConstantScore(author:john)^8273" });
you can do pretty much everything with it like in this example it creates a constant score query
over a term query
dis_max
creates https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/DisjunctionMaxQuery.html
syntax:
dis_max => {
queries => [
{ term => { author => 'jack' } },
{ term => { category => 'comedy' } }
],
boost => 1.0,
tie_breaker => 0.3
}
bool
creates: https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanQuery.html
syntax:
bool => {
must => [
{ term => { author => 'jack' } },
],
should => [
{ term => { category => 'drama' } },
{ term => { category => 'comedy' } }
],
must_not => [
{ term => { deleted => 'true' } }
],
minimum_should_match => 1,
boost => 1.0
}
constant_score
creates https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/ConstantScoreQuery.html
syntax:
constant_score => { query => ..., boost => 4.0 }
filtered
creates https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/QueryWrapperFilter.html or https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/CachingWrapperFilter.html
syntax:
filtered => { query => ..., filter => { query.. } }
filtered => { query => ..., cached_filter => { query.. } }
example:
$s->search(
{
filtered => {
query => { match_all => {} },
cached_filter => { term => { author => "jack" } }
}
})
match_all
creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/MatchAllDocsQuery.html
syntax:
match_all => {}
custom_score
syntax:
{
custom_score => {
query => {
term => { author => "jack" }
},
class => 'bz.brick.RAMBrick.BoosterQuery',
params => {}
}
}
will create an instance of BoosterQuery, with "Map<String,Map<String,String>>" params
look at https://github.com/jackdoe/brick/blob/master/bricks/RAMBrick/queries/BoosterQuery.java for simple example
span_first
creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/spans/SpanFirstQuery.html
synax:
{
span_first => {
match => {
span_term => { author => "doe" }
},
end => 1
}
}
matches the term "doe" in the first "end" positions of the field more detailed info on the span queries: http://searchhub.org/2009/07/18/the-spanquery/
span_near
creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/spans/SpanNearQuery.html
syntax: { span_near => { clauses => [ { span_term => { author => 'jack' } }, { span_term => { author => 'doe' } } ], slop => 0, in_order => true() } }
more detailed info on the span queries: http://searchhub.org/2009/07/18/the-spanquery/ example:
{
span_near => {
clauses => [
{
span_near => {
clauses => [
{ span_term => { author => 'jack' } },
{ span_term => { author => 'doe' } }
],
slop => 1
},
},
{
span_term => { author => 'go!' }
}
],
slop => 0
}
}
span_term
creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/spans/SpanTermQuery.html
syntax: span_term => { field => value }
span_term queries are the building block of all span queries
QUERY SETTINGS
log_slower_than
syntax:
log_slower_than => 5
example:
my @result = $s->search({ term => { author => 'jack' } },{ log_slower_than => 5 });
if the query takes more than 5 milliseconds, in the return object there will be a query
key which will contain the actual query. in this case it will look like this:
{
hits => [],
took => 6,
query => "author:jack"
}
explain
syntax: explain => true()
example: my @result = $s->search({ term => { author => 'jack' } },{ explain => true() });
will fill '__explain' field in each document, like this:
{
hits => [
{
...
'__score' => '4.19903135299683',
'__explain' => '4.1990314 = (MATCH) weight(author:jack in 19) [BoostSimilarity], result of:
4.1990314 = <3.2> value of f_boost field + subscorer
0.9990314 = score(doc=19,freq=1.0 = termFreq=1.0), product of:
0.9995156 = queryWeight, product of:
0.9995156 = idf(docFreq=2064, maxDocs=2064)
1.0 = queryNorm
0.9995156 = fieldWeight in 19, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.9995156 = idf(docFreq=2064, maxDocs=2064)
1.0 = fieldNorm(doc=19)
',
'author' => 'jack',
...
}
],
took => 6,
}
dump_query
syntax:
dump_query => true()
example:
my @result = $s->search({ term => { author => 'jack' } },{ dump_query => true() });
will return the actual query.toString() in the result structure:
[
{
'took' => 109,
'query' => 'author:jack',
'hits' => [ {},{}... ]
}
]
INDEX
all the indexes are created from messagepack'ed streams, following the same protocol:
mapping + settings
data
example:
my $settings = {
mapping => {
author => {
type => "string", # "string|int|long|double|float",
index => true(),
store => true(),
omit_norms => false(),
store_term_vector_offsets => false(),
store_term_vector_positions => false(),
store_term_vector_payloads => false(),
store_term_vectors => false(),
tokenized => true()
},
group_by => {
type => "string",
index => true(),
store => false()
}
},
settings => {
expected => 2, # number of expected documents from the data-part
shards => 4, # how many shards will be created
# (in this example each shard will have 1/4th of the data)
similarity => "bz.brick.RAMBrick.IgnoreIDFSimilarity", # can use:
# org.apache.lucene.search.similarities.BM25Similarity
# org.apache.lucene.search.similarities.DefaultSimilarity
# etc..
expect_array => true(),
store => "/var/lib/brick/ram/primary_index_20022014" # it will actually create lucene index there
# and next time it tries to autoload the file
# it will just check if number of documents
# match.
}
};
check out the field type options from: https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/document/FieldType.html similarity information: https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/similarities/Similarity.html
the data
can be in array format, or just concatinated documnts joined by '' (depending on the expect_array
setting)
the store
option tells RAMBrick
to store the lucene indexes on disk(instead of using http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/store/RAMDirectory.html
), just specify the directory name in the index's settings (it MUST be somewhere within the RAMBRICK_AUTO_LOAD_ROOT
).
the structure in our $settings example will look like:
/var/lib/brick/ram/primary_index_20022014/SHARD_0
/var/lib/brick/ram/primary_index_20022014/SHARD_1
/var/lib/brick/ram/primary_index_20022014/SHARD_2
/var/lib/brick/ram/primary_index_20022014/SHARD_3
each of those directories will contain the lucene index (using http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/store/NIOFSDirectory.html)
In case the number of expected documents does not match the number of documents indexed, it will not create the index.
- file backed
-
You can create
.msgpack
files by concatinating the settings with the data, and just putting it in theRAMBRICK_AUTO_LOAD_ROOT
(by default /var/lib/brick/ram/) directory (start brick withRAMBRICK_AUTO_LOAD_ROOT
env variable set to wherever you want). Those indexes can also be stored aslucene
index (if thestore
setting points to a directory within theRAMBRIKC_AUTO_LOAD_ROOT
) - online
-
my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__'); $s->index([{ author => 'jack', group_by => "23" },{ author => 'jack', group_by => "24" }],$settings);
this will just send one blob of data to
ram:store:__test__
, which will be rerouted toRAMBrick.store('__test__',unpacker)
and the next portion of data will be in the format <settings><data...>, this index will be lost after thebrick
server is restarted. - delete
-
when you delete an index, it will delete the autoload .msgpack file + the stored directory (containing all the
lucene
indexes)my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__'); $s->delete();
ALIAS
The aliases state is kept in a small metadata file named $RAMBRICK_AUTO_LOAD_ROOT/alias.metadata
, every time alias is modified/created/deleted it will update this file. This file is loaded into the alias hash
at init time.
my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
$s->alias({
add => [ { "some_alias_name" => "some_index_name" } ],
delete => [ "some_alias","some_other_alias" ]
});
aliases are atomic, meaning that the whole request will be executed at one go. the order of the add
and delete
operation is undefined, but the order within the operations is honored.
STAT
my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
print Dumper([$s->stat()]);
will produce:
[
{
'index.default.groups' => 20
'index.default.searches' => 4,
'index.default.documents' => 9999,
'index.default.last_query_stamp' => 1391416081,
'java.non_heap.usage.used' => 12678192,
'java.heap.usage.init' => 31138368,
'java.heap.usage.committed' => 91226112,
'java.heap.usage.used' => 47775336,
'java.non_heap.usage.init' => 24576000,
'java.heap.usage.max' => 620756992,
'java.non_heap.usage.max' => 224395264,
'java.non_heap.usage.committed' => 24576000,
'main.connection_pool.size' => 12,
'main.connection_pool.active' => 1,
'brick.search_pool.size' => 15,
'brick.search_pool.active' => 0,
'brick.time_indexing' => 3,
'brick.uptime' => 212,
'brick.time_searching' => 0,
'brick.searches' => 8,
}
]
EXAMPLES:
at the moment it looks like this: sister queries are joined by BooleanQuery with a MUST clause for example:
{
term => { "author" => "john" },
dis_max => {
queries => [
{ term => { "category" => "horror" } },
{ term => { "category" => "comedy" } }
],
tie_breaker => 0.2
}
},
will generate: +(category:horror | category:comedy)~0.2 +author:john
. different example:
query(
host => ['127.0.0.1:9000','127.0.0.1:9000'],
request => {
term => { "author" => "john" } ,
dis_max => {
queries => [
{ term => { "category" => "horror" } },
{ term => { "category" => "comedy" } }
],
tie_breaker => 0.2
},
bool => {
must => [
{ lucene => "isbn:74623783 AND year:[20021201 TO 200402302]" }
],
should => [
{ lucene => "cover:(red OR blue)" },
{ term => { old => "0" } }
],
must_not => [
{ term => { deleted => "1" } }
]
}
},
brick => 'RAM',
timeout => 10,
settings => { "dump_query" => "true" });
generates: +((+(+isbn:74623783 +year:[20021201 TO 200402302]) -deleted:1 (cover:red cover:blue) old:0)~1) +(category:horror | category:comedy)~0.2 +author:john
another example:
my $b = Search::Brick::RAM::Query->new(index => '__test__');
$b->delete();
my $settings = {
mapping => {
author => {
type => "string",
index => Data::MessagePack::true(),
store => Data::MessagePack::true(),
},
f_boost => {
type => "float",
index => Data::MessagePack::true(),
store => Data::MessagePack::true(),
},
group_by => {
type => "string",
index => Data::MessagePack::true(),
store => Data::MessagePack::true()
}
},
settings => {
expected => 2,
shards => 1
}
};
$b->index([
{ author => 'jack', group_by => "23", f_boost => 0.5 },
{ author => 'john', group_by => "24",f_boost => 0.5 }
],$settings);
my @result = $b->search({ lucene => "ConstantScore(author:john)^8273" },
{ log_slower_than => "1",explain => "true" } );
$VAR1 = [{
'took' => 4,
'hits' => [
{
'__score' => '8273.5',
'f_boost' => '0.5',
'__group_index' => '0',
'group_by' => '24',
'__explain' => '8273.5 = (MATCH) sum of:
8273.5 = (MATCH) weight(author:john^8273.0 in 1) [BoostSimilarity], result of:
8273.5 = <0.5> value of f_boost field + subscorer
8273.0 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
8273.0 = queryWeight, product of:
8273.0 = boost
1.0 = idf(docFreq=1, maxDocs=2)
1.0 = queryNorm
1.0 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.0 = idf(docFreq=1, maxDocs=2)
1.0 = fieldNorm(doc=1)
',
'author' => 'john'
}
],
'query' => '__no_default_field__:ConstantScore author:john^8273.0'
}];
as you can see the return structure is [{},{},{}] one result per request (for example if we do $b = Search::Brick::RAM::Query->new(host => [ '127.0.0.1:900','127.0.0.1:900]) there will be [{hits => []},{hits => []}] in the output)
SEE ALSO
lucene: http://lucene.apache.org/core/4_6_0/
brick: https://github.com/jackdoe/brick
AUTHOR
Borislav Nikolov, <jack@sofialondonmoskva.com>
COPYRIGHT AND LICENSE
Copyright (C) 2014 by Borislav Nikolov
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.18.2 or, at your option, any later version of Perl 5 you may have available.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 276:
Can't have a 0 in =over 0