NAME

Search::Brick::RAM::Query - Perl interface for the RAMBrick search thingie https://github.com/jackdoe/brick/tree/master/bricks/RAMBrick

SYNOPSIS

use Search::Brick::RAM::Query qw(query true)
my @results = query(host    => ['127.0.0.1:9000'],
                    settings => { log_slower_than => 100, explain => true }, # log queries slower than 100ms
                    request  => { term => { "title" => "book","author" => "john" } },
                    brick    => 'RAM',
                    action   => 'search',
                    timeout  => 0.5 # in seconds - this is total timeout
                                    # connect time + write time + read time
              );

#shortcuts are available using the search object:
my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
my @results = $s->search({ term => { author => 'jack' } }, { size => 20, explain => true, log_slower_than => 5 });
$s->delete(); # deletes the __test__ index

DESCRIPTION

minimalistic interface to RAMBrick (a minimalistic Lucene wrapper)

FUNCTION: Search::Brick::RAM::Query::query( $args :Hash ) :Array

Search::Brick::RAM::Query::query has the following parameters:

host    => ['127.0.0.1:9000'],
request => { term => { "title" => "book","author" => "john" } },
brick   => 'RAM',
action  => 'search', # (search|store|alias|load)
index   => '...',    # name of your index
timeout => 0.5,      # in seconds
settings => {}       # used by different actions to get context on the request
timeout

timeout is the whole round trip timeout: (connect time + write time + read time).

brick

brick is the brick's name (in this case 'RAM')

action

action is the requested action (search|store|stat)

request

request must be reference to array of hashrefs

index

index is the action argument, provides context to the request (usually index name)

settings

settings RAMBrick requires settings to be sent (by default they are empty) things like size or items_per_group like: { size => 5, explain => true }

host

host string or arrayref of strings - the same request will be sent to all hosts in the list (the whole thing is async, so will be as slow as the slowest host) and un-ordered array of results is returned.

If the result is not ref() or there is any kind of error(including timeout), the request will die.

Query object

you can also use the module by creating a Query object, such as:

my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
my @results = $s->search({ term => { author => 'jack' } }, { size => 20, explain => true, log_slower_than => 5 });

Simple wrapper making query() calls, but without having to pass index, brick, host every time.

search($query,$settings,$timeout || DEFAULT_TIMEOUT)
index($data,$settings,$timeout || DEFAULT_TIMEOUT)
alias($settings,$timeout || DEFAULT_TIMEOUT)
stat($timeout || DEFAULT_TIMEOUT)
delete($timeout || DEFAULT_TIMEOUT)

all of the above are just making query() calls with the selected parameters, but also passing brick => 'RAM'>, index => __test(in this example) and host => [ '127.0.0.1:9000' ]

multiple hosts

each request can be sent to more than 1 host, the exact same copy of the request is sent asynchronously to all of them, after all copies have been sent, we asynchronously wait for the results. So the whole request is as slow as the slowest host.

result

the results are un-sorted, so if the request is for host 127.0.0.1:9000, 10.0.0.2:9000 the result array might be ([arrayref of the result of 10.0.0.2:9000],[arrayref of 127.0.0.1:9000]).

Every result object is always array, even if the request has only 1 host.

timeout

the timeout is total timeout, includint connect time + send time + rect time for all the hosts combined. So when you say 0.5 (half a second) and give it 20 hosts, the moment it reaches 0.5 seconds, it will die, regardless of how much data it received, or how much data it is about to send.

error

there is no error object, the module dies on every error (including timeouts, query syntax errors etc..).

SEARCH

my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
my @results = $s->search({ term => { author => 'jack' } }, { size => 20, explain => true, log_slower_than => 5 });

QUERY SYNTAX

boosting

every query except term and lucene supports boost parameter like:

bool => { must [ { term => { author => 'jack' } } ], boost => 4.3 }

term

creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/TermQuery.html

syntax:

term => { field => value }

example:

my @results = $s->search({ term => { author => 'jack' } });

since RAMBrick does not do any query rewrites (like ElasticSearch's match query) and it also does not do any kind of analysis on the query string, the term and lucene queries are the only queries that can be used to match specific documents.

lucene

creates http://lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description

syntax:

lucene => "author:john"

example:

my @result = $s->search({ lucene => "ConstantScore(author:john)^8273" });

you can do pretty much everything with it like in this example it creates a constant score query over a term query

dis_max

creates https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/DisjunctionMaxQuery.html

syntax:

dis_max => {
    queries => [ 
        { term => { author => 'jack' } },
        { term => { category => 'comedy' } } 
    ],
    boost => 1.0,
    tie_breaker => 0.3
}

bool

creates: https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanQuery.html

syntax:

bool => {
    must => [
        { term => { author => 'jack' } },
    ],
    should => [
        { term => { category => 'drama' } },
        { term => { category => 'comedy' } }
    ],
    must_not => [
        { term => { deleted => 'true' } }
    ],
    minimum_should_match => 1,
    boost => 1.0
}

constant_score

creates https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/ConstantScoreQuery.html

syntax:

constant_score => { query => ..., boost => 4.0 }

filtered

creates https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/QueryWrapperFilter.html or https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/CachingWrapperFilter.html

syntax:

filtered => { query => ..., filter => { query.. } }
filtered => { query => ..., cached_filter => { query.. } }

example:

$s->search(
{
       filtered => { 
                query => { match_all => {} },
                cached_filter => { term => { author => "jack" } }
       } 
})

match_all

creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/MatchAllDocsQuery.html

syntax:

match_all => {}

custom_score

syntax:

{ 
   custom_score => { 
       query => { 
           term => { author => "jack" } 
       }, 
       class => 'bz.brick.RAMBrick.BoosterQuery', 
       params => {} 
   }
}

will create an instance of BoosterQuery, with "Map<String,Map<String,String>>" params

look at https://github.com/jackdoe/brick/blob/master/bricks/RAMBrick/queries/BoosterQuery.java for simple example

span_first

creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/spans/SpanFirstQuery.html

synax:

{ 
   span_first => { 
       match => { 
           span_term => { author => "doe" }
       }, 
       end => 1 
   }
}

matches the term "doe" in the first "end" positions of the field more detailed info on the span queries: http://searchhub.org/2009/07/18/the-spanquery/

span_near

creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/spans/SpanNearQuery.html

syntax:

{
   span_near => { 
       clauses => [ 
           { span_term => { author => 'jack' } }, 
           { span_term => { author => 'doe' } } 
       ], 
       slop => 0,
       in_order => true()
   }
}

more detailed info on the span queries: http://searchhub.org/2009/07/18/the-spanquery/ example:

{
   span_near => {
       clauses => [
           {
               span_near => {
                   clauses => [
                       { span_term => { author => 'jack' } },
                       { span_term => { author => 'doe' } }
                    ],
                    slop => 1
               },
           },
           {
               span_term => { author => 'go!' }
           }
       ],
       slop => 0
   }
}

span_term

creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/spans/SpanTermQuery.html

syntax:

span_term => { field => value }

span_term queries are the building block of all span queries

QUERY SETTINGS

log_slower_than

syntax:

log_slower_than => 5

example:

my @result = $s->search({ term => { author => 'jack' } },{ log_slower_than => 5 });

if the query takes more than 5 milliseconds, in the return object there will be a query key which will contain the actual query. in this case it will look like this:

{
  hits => [],
  took => 6,
  query => "author:jack"
}

explain

syntax: explain => true()

example: my @result = $s->search({ term => { author => 'jack' } },{ explain => true() });

will fill '__explain' field in each document, like this:

{
  hits => [
 {
...
'__score' => '4.19903135299683',
'__explain' => '4.1990314 = (MATCH) weight(author:jack in 19) [BoostSimilarity], result of:
 4.1990314 = <3.2> value of f_boost field + subscorer
   0.9990314 = score(doc=19,freq=1.0 = termFreq=1.0), product of:
     0.9995156 = queryWeight, product of:
       0.9995156 = idf(docFreq=2064, maxDocs=2064)
       1.0 = queryNorm
     0.9995156 = fieldWeight in 19, product of:
       1.0 = tf(freq=1.0), with freq of:
         1.0 = termFreq=1.0
       0.9995156 = idf(docFreq=2064, maxDocs=2064)
       1.0 = fieldNorm(doc=19)
',
'author' => 'jack',
...
}
  ],
  took => 6,
}

dump_query

syntax:

dump_query => true()

example:

my @result = $s->search({ term => { author => 'jack' } },{ dump_query => true() });

will return the actual query.toString() in the result structure:

[
 {
   'took' => 109,
   'query' => 'author:jack',
   'hits' => [ {},{}... ]
 }
]

INDEX

all the indexes are created from messagepack'ed streams, following the same protocol:

mapping + settings
data

example:

my $settings = {
   mapping => {
       author => {
           type  => "string", # "string|int|long|double|float",
           index =>  true(),
           store =>  true(),
           omit_norms => false(),
           store_term_vector_offsets => false(),
           store_term_vector_positions => false(),
           store_term_vector_payloads => false(),
           store_term_vectors => false(),
           tokenized => true()
       },
       group_by => {
           type  => "string",
           index =>  true(),
           store =>  false()
       }
   },
   settings => {
       expected => 2, # number of expected documents from the data-part
       shards => 4,   # how many shards will be created 
                      # (in this example each shard will have 1/4th of the data)
       similarity => "bz.brick.RAMBrick.IgnoreIDFSimilarity", # can use:
                                                              # org.apache.lucene.search.similarities.BM25Similarity
                                                              # org.apache.lucene.search.similarities.DefaultSimilarity
                                                              # etc..
       expect_array => true(),
       store => "/var/lib/brick/ram/primary_index_20022014" # it will actually create lucene index there
                                                            # and next time it tries to autoload the file
                                                            # it will just check if number of documents
                                                            # match.
   }
};

check out the field type options from: https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/document/FieldType.html similarity information: https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/similarities/Similarity.html

the data can be in array format, or just concatinated documnts joined by '' (depending on the expect_array setting)

the store option tells RAMBrick to store the lucene indexes on disk(instead of using http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/store/RAMDirectory.html), just specify the directory name in the index's settings (it MUST be somewhere within the RAMBRICK_AUTO_LOAD_ROOT).

the structure in our $settings example will look like:

/var/lib/brick/ram/primary_index_20022014/SHARD_0
/var/lib/brick/ram/primary_index_20022014/SHARD_1
/var/lib/brick/ram/primary_index_20022014/SHARD_2
/var/lib/brick/ram/primary_index_20022014/SHARD_3

each of those directories will contain the lucene index (using http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/store/NIOFSDirectory.html)

In case the number of expected documents does not match the number of documents indexed, it will not create the index.

file backed

You can create .msgpack files by concatinating the settings with the data, and just putting it in the RAMBRICK_AUTO_LOAD_ROOT (by default /var/lib/brick/ram/) directory (start brick with RAMBRICK_AUTO_LOAD_ROOT env variable set to wherever you want). Those indexes can also be stored as lucene index (if the store setting points to a directory within the RAMBRIKC_AUTO_LOAD_ROOT)

online
my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
$s->index([{ author => 'jack', group_by => "23" },{ author => 'jack', group_by => "24" }],$settings);

this will just send one blob of data to ram:store:__test__, which will be rerouted to RAMBrick.store('__test__',unpacker) and the next portion of data will be in the format <settings><data...>, this index will be lost after the brick server is restarted.

delete

when you delete an index, it will delete the autoload .msgpack file + the stored directory (containing all the lucene indexes)

my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
$s->delete();

ALIAS

The aliases state is kept in a small metadata file named $RAMBRICK_AUTO_LOAD_ROOT/alias.metadata, every time alias is modified/created/deleted it will update this file. This file is loaded into the alias hash at init time.

my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
$s->alias({
  add => [ { "some_alias_name" => "some_index_name" } ],
  delete => [ "some_alias","some_other_alias" ]
});

aliases are atomic, meaning that the whole request will be executed at one go. the order of the add and delete operation is undefined, but the order within the operations is honored.

STAT

my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
print Dumper([$s->stat()]);

will produce:

[
 {
  'index.default.groups' => 20
  'index.default.searches' => 4,
  'index.default.documents' => 9999,
  'index.default.last_query_stamp' => 1391416081,
  'java.non_heap.usage.used' => 12678192,
  'java.heap.usage.init' => 31138368,
  'java.heap.usage.committed' => 91226112,
  'java.heap.usage.used' => 47775336,
  'java.non_heap.usage.init' => 24576000,
  'java.heap.usage.max' => 620756992,
  'java.non_heap.usage.max' => 224395264,
  'java.non_heap.usage.committed' => 24576000,
  'main.connection_pool.size' => 12,
  'main.connection_pool.active' => 1,
  'brick.search_pool.size' => 15,
  'brick.search_pool.active' => 0,
  'brick.time_indexing' => 3,
  'brick.uptime' => 212,
  'brick.time_searching' => 0,
  'brick.searches' => 8,
 }
]

EXAMPLES

at the moment it looks like this: sister queries are joined by BooleanQuery with a MUST clause for example:

{ 
    term => { "author" => "john" },
    dis_max => { 
       queries => [ 
           { term => { "category" => "horror" } },
           { term => { "category" => "comedy" } } 
       ], 
       tie_breaker => 0.2 
    }
},

will generate: +(category:horror | category:comedy)~0.2 +author:john. different example:

query(
 host => ['127.0.0.1:9000','127.0.0.1:9000'],
 request => { 
    term => { "author" => "john" } ,
    dis_max => { 
       queries => [ 
        { term => { "category" => "horror" } },
        { term => { "category" => "comedy" } } 
       ], 
       tie_breaker => 0.2 
    },
    bool => {
        must => [
           { lucene => "isbn:74623783 AND year:[20021201 TO 200402302]" }
        ],
        should => [
           { lucene => "cover:(red OR blue)" },
           { term => { old => "0" } }
        ],
        must_not => [
           { term => { deleted => "1" } }
        ]
    }
 },
 brick => 'RAM',
 timeout => 10,
 settings => { "dump_query" => "true" });
 

generates: +((+(+isbn:74623783 +year:[20021201 TO 200402302]) -deleted:1 (cover:red cover:blue) old:0)~1) +(category:horror | category:comedy)~0.2 +author:john

another example:

my $b = Search::Brick::RAM::Query->new(index => '__test__');
$b->delete();
my $settings = {
   mapping => {
       author => {
           type  => "string",
           index =>  Data::MessagePack::true(),
           store =>  Data::MessagePack::true(),
       },
       f_boost => {
           type  =>  "float",
           index =>  Data::MessagePack::true(),
           store =>  Data::MessagePack::true(),
       },
       group_by => {
           type  => "string",
           index =>  Data::MessagePack::true(),
           store =>  Data::MessagePack::true()
       }
   },
   settings => {
       expected => 2,
       shards => 1
   }
};

$b->index([
    { author => 'jack', group_by => "23", f_boost => 0.5 },
    { author => 'john', group_by => "24",f_boost => 0.5 }
  ],$settings);
my @result = $b->search({ lucene => "ConstantScore(author:john)^8273" },
                        { log_slower_than => "1",explain => "true" } );
$VAR1 = [{
 'took' => 4,
 'hits' => [
   {
     '__score' => '8273.5',
     'f_boost' => '0.5',
     '__group_index' => '0',
     'group_by' => '24',
     '__explain' => '8273.5 = (MATCH) sum of:
 8273.5 = (MATCH) weight(author:john^8273.0 in 1) [BoostSimilarity], result of:
   8273.5 = <0.5> value of f_boost field + subscorer
     8273.0 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
       8273.0 = queryWeight, product of:
         8273.0 = boost
         1.0 = idf(docFreq=1, maxDocs=2)
         1.0 = queryNorm
       1.0 = fieldWeight in 1, product of:
         1.0 = tf(freq=1.0), with freq of:
           1.0 = termFreq=1.0
         1.0 = idf(docFreq=1, maxDocs=2)
         1.0 = fieldNorm(doc=1)
',
     'author' => 'john'
   }
 ],
 'query' => '__no_default_field__:ConstantScore author:john^8273.0'
}];

as you can see the return structure is [{},{},{}] one result per request (for example if we do $b = Search::Brick::RAM::Query->new(host => [ '127.0.0.1:900','127.0.0.1:900]) there will be [{hits => []},{hits => []}] in the output).

SEE ALSO

lucene: http://lucene.apache.org/core/4_6_0/

brick: https://github.com/jackdoe/brick

AUTHOR

Borislav Nikolov, <jack@sofialondonmoskva.com>

COPYRIGHT AND LICENSE

Copyright (C) 2014 by Borislav Nikolov

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.18.2 or, at your option, any later version of Perl 5 you may have available.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 275:

Can't have a 0 in =over 0