NAME

Search::Brick::RAM::Query - Perl interface for the RAMBrick search thingie https://github.com/jackdoe/brick/tree/master/bricks/RAMBrick

SYNOPSIS

use Search::Brick::RAM::Query qw(query true)
my @results = query(host    => ['127.0.0.1:9000'],
                    settings => { log_slower_than => 100, explain => true }, # log queries slower than 100ms
                    request  => { term => { "title" => "book","author" => "john" } },
                    brick    => 'RAM',
                    action   => 'search',
                    timeout  => 0.5 # in seconds - this is total timeout
                                    # connect time + write time + read time
              );

#shortcuts are available using the search object:
my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
my @results = $s->search({ term => { author => 'jack' } }, { size => 20, explain => true, log_slower_than => 5 });
$s->delete(); # deletes the __test__ index

DESCRIPTION

minimalistic interface to RAMBrick

FUNCTION: Search::Brick::RAM::Query::query( $args :Hash ) :Ref

Search::Brick::RAM::Query::query has the following parameters:

host    => ['127.0.0.1:9000'],
request => { term => { "title" => "book","author" => "john" } },
brick   => 'RAM',
action  => 'search', # (search|store|alias|load)
index   => '...',    # name of your index
timeout => 0.5,      # in seconds
settings => {}       # used by different actions to get context on the reques

timeout is the whole round trip timeout: (connect time + write time + read time).

brick is the brick's name (in this case 'RAM')

action is the requested action (search|store|stat)

request must be reference to array of hashrefs

index is the action argument, provides context to the request (usually index name)

settings RAMBrick requires settings to be sent (by default they are empty) things like "size" or "items_per_group" like: { size => 5, explain => true }

host string or arrayref of strings - the same request will be sent to all hosts in the list (the whole thing is async, so will be as slow as the slowest host) and un-ordered array of results is returned.

If the result is not ref() or there is any kind of error(including timeout), the request will die.

my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
my @results = $s->search({ term => { author => 'jack' } }, { size => 20, explain => true, log_slower_than => 5 });

QUERY SYNTAX

boosting

every query except term and lucene supports boost parameter like:

bool => { must [ { term => { author => 'jack' } } ], boost => 4.3 }
term

creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/TermQuery.html

syntax:

term => { field => value }

example:

my @results = $s->search({ term => { author => 'jack' } });

since RAMBrick does not do any query rewrites (like ElasticSearch's match query) and it also does not do any kind of analysis on the query string, the term and lucene queries are the only queries that can be used to match specific documents.

lucene

creates http://lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description query:

syntax:

lucene => "author:john"

example:

my @result = $s->search({ lucene => "ConstantScore(author:john)^8273" });

you can do pretty much everything with it like in this example it creates a constant score query over a term query

dis_max

syntax:

dis_max => {
    queries => [ 
        { term => { author => 'jack' } },
        { term => { category => 'comedy' } } 
    ],
    boost => 1.0,
    tie_breaker => 0.3
}
bool

syntax:

bool => {
    must => [
        { term => { author => 'jack' } },
    ],
    should => [
        { term => { category => 'drama' } },
        { term => { category => 'comedy' } }
    ],
    must_not => [
        { term => { deleted => 'true' } }
    ],
    minimum_should_match => 1,
    boost => 1.0
}
constant_score

creates https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/ConstantScoreQuery.html

syntax:

constant_score => { query => ..., boost => 4.0 }
filtered

creates https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/QueryWrapperFilter.html or https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/CachingWrapperFilter.html

syntax:

filtered => { query => ..., filter => { query.. } }
filtered => { query => ..., cached_filter => { query.. } }

example:

$s->search(
{
       filtered => { 
                query => { match_all => {} },
                cached_filter => { term => { author => "jack" } }
       } 
})
match_all

creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/MatchAllDocsQuery.html

syntax:

match_all => {}
custom_score

syntax:

{ 
   custom_score => { 
       query => { 
           term => { author => "jack" } 
       }, 
       class => 'bz.brick.RAMBrick.BoosterQuery', 
       params => {} 
   }
}

will create an instance of BoosterQuery, with "Map<String,Map<String,String>>" params

look at https://github.com/jackdoe/brick/blob/master/bricks/RAMBrick/queries/BoosterQuery.java for simple example

span_first

creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/spans/SpanFirstQuery.html

synax:

{ 
   span_first => { 
       match => { 
           span_term => { author => "doe" }
       }, 
       end => 1 
   }
}

matches the term "doe" in the first "end" positions of the field more detailed info on the span queries: http://searchhub.org/2009/07/18/the-spanquery/

span_near

creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/spans/SpanNearQuery.html

syntax: { span_near => { clauses => [ { span_term => { author => 'jack' } }, { span_term => { author => 'doe' } } ], slop => 0, in_order => true() } }

more detailed info on the span queries: http://searchhub.org/2009/07/18/the-spanquery/ example:

{
   span_near => {
       clauses => [
           {
               span_near => {
                   clauses => [
                       { span_term => { author => 'jack' } },
                       { span_term => { author => 'doe' } }
                    ],
                    slop => 1
               },
           },
           {
               span_term => { author => 'go!' }
           }
       ],
       slop => 0
   }
}
span_term

creates http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/spans/SpanTermQuery.html

syntax: span_term => { field => value }

span_term queries are the building block of all span queries

QUERY SETTINGS

log_slower_than

syntax:

log_slower_than => 5

example:

my @result = $s->search({ term => { author => 'jack' } },{ log_slower_than => 5 });

if the query takes more than 5 milliseconds, in the return object there will be a query key which will contain the actual query. in this case it will look like this:

{
  hits => [],
  took => 6,
  query => "author:jack"
}
explain

syntax: explain => true()

example: my @result = $s->search({ term => { author => 'jack' } },{ explain => true() });

will fill '__explain' field in each document, like this:

{
  hits => [
 {
...
'__score' => '4.19903135299683',
'__explain' => '4.1990314 = (MATCH) weight(author:jack in 19) [BoostSimilarity], result of:
 4.1990314 = <3.2> value of f_boost field + subscorer
   0.9990314 = score(doc=19,freq=1.0 = termFreq=1.0), product of:
     0.9995156 = queryWeight, product of:
       0.9995156 = idf(docFreq=2064, maxDocs=2064)
       1.0 = queryNorm
     0.9995156 = fieldWeight in 19, product of:
       1.0 = tf(freq=1.0), with freq of:
         1.0 = termFreq=1.0
       0.9995156 = idf(docFreq=2064, maxDocs=2064)
       1.0 = fieldNorm(doc=19)
',
'author' => 'jack',
...
}
  ],
  took => 6,
}

syntax: dump_query => true()

example: my @result = $s->search({ term => { author => 'jack' } },{ dump_query => true() });

will return the actual query.toString() in the result structure:

[
 {
   'took' => 109,
   'query' => 'author:jack',
   'hits' => [ {},{}... ]
 }
]

INDEX

all the indexes are created from messagepack'ed streams, following the same protocol:

mapping + settings
data

example:

my $settings = {
   mapping => {
       author => {
           type  => "string", # "string|int|long|double|float",
           index =>  true(),
           store =>  true(),
           omit_norms => false(),
           store_term_vector_offsets => false(),
           store_term_vector_positions => false(),
           store_term_vector_payloads => false(),
           store_term_vectors => false(),
           tokenized => true()
       },
       group_by => {
           type  => "string",
           index =>  true(),
           store =>  false()
       }
   },
   settings => {
       expected => 2, # number of expected documents from the data-part
       shards => 4,   # how many shards will be created 
                      # (in this example each shard will have 1/4th of the data)
       similarity => "bz.brick.RAMBrick.IgnoreIDFSimilarity", # can use:
                                                              # org.apache.lucene.search.similarities.BM25Similarity
                                                              # org.apache.lucene.search.similarities.DefaultSimilarity
                                                              # etc..
       expect_array => true(),
       store => "/var/lib/brick/ram/primary_index_20022014" # it will actually create lucene index there
                                                            # and next time it tries to autload the file
                                                            # it will just check if number of documents
                                                            # match.
   }
};

check out the field type options from: https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/document/FieldType.html similarity information: https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/similarities/Similarity.html

the "data" can be in array format, or just concatinated documnts joined by '' (depending on the "expect_array" setting)

  • file backed

    You can create .messagepack files by concatinating the settings with the data, and just putting it in the "RAMBRICK_AUTO_LOAD_ROOT" (by default "/var/lib/brick/ram/") directory (start brick with RAMBRICK_AUTO_LOAD_ROOT env variable set to wherever you want). Those indexes can also be stored as lucene index (if the "stored" setting points to a directory within the RAMBRIKC_AUTO_LOAD_ROOT)

  • online

    my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
    $s->index([{ author => 'jack', group_by => "23" },{ author => 'jack', group_by => "24" }],$settings);

    this will just send one blob of data to "ram:store:__test__", which will be rerouted to "RAMBrick.store('__test__',unpacker)" and the next portion of data will be in the format <settings><data...>

In case the number of expected documents does not match the number of documents indexed, it will not create the index.

the store option

there is an option to store the indexes on disk, just specify the directory name in the index's settings (it MUST be somewhere within the RAMBRICK_AUTO_LOAD_ROOT).

the structure in our $settings example will look like:

/var/lib/brick/ram/primary_index_20022014/SHARD_0
/var/lib/brick/ram/primary_index_20022014/SHARD_1
/var/lib/brick/ram/primary_index_20022014/SHARD_2
/var/lib/brick/ram/primary_index_20022014/SHARD_3

each of those directories will contain the lucene index (using http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/store/NIOFSDirectory.html)

delete an index

when you delete an index, it will delete the autoload .messagepack file + the stored directory

my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
$s->delete();

ALIAS

STAT

my $s = Search::Brick::RAM::Query->new(host => '127.0.0.1:9000', index => '__test__');
print Dumper([$s->stat()]);

will produce:

[
 {
  'index.default.groups' => 20
  'index.default.searches' => 4,
  'index.default.documents' => 9999,
  'index.default.last_query_stamp' => 1391416081,
  'java.non_heap.usage.used' => 12678192,
  'java.heap.usage.init' => 31138368,
  'java.heap.usage.committed' => 91226112,
  'java.heap.usage.used' => 47775336,
  'java.non_heap.usage.init' => 24576000,
  'java.heap.usage.max' => 620756992,
  'java.non_heap.usage.max' => 224395264,
  'java.non_heap.usage.committed' => 24576000,
  'main.connection_pool.size' => 12,
  'main.connection_pool.active' => 1,
  'brick.search_pool.size' => 15,
  'brick.search_pool.active' => 0,
  'brick.time_indexing' => 3,
  'brick.uptime' => 212,
  'brick.time_searching' => 0,
  'brick.searches' => 8,
 }
]

EXAMPLES:

at the moment it looks like this: sister queries are joined by BooleanQuery with a MUST clause for example:

{ 
    term => { "author" => "john" },
    dis_max => { 
       queries => [ 
           { term => { "category" => "horror" } },
           { term => { "category" => "comedy" } } 
       ], 
       tie_breaker => 0.2 
    }
},

will generate: +(category:horror | category:comedy)~0.2 +author:john. different example:

query(
 host => ['127.0.0.1:9000','127.0.0.1:9000'],
 request => { 
    term => { "author" => "john" } ,
    dis_max => { 
       queries => [ 
        { term => { "category" => "horror" } },
        { term => { "category" => "comedy" } } 
       ], 
       tie_breaker => 0.2 
    },
    bool => {
        must => [
           { lucene => "isbn:74623783 AND year:[20021201 TO 200402302]" }
        ],
        should => [
           { lucene => "cover:(red OR blue)" },
           { term => { old => "0" } }
        ],
        must_not => [
           { term => { deleted => "1" } }
        ]
    }
 },
 brick => 'RAM',
 timeout => 10,
 settings => { "dump_query" => "true" });
 

generates: +((+(+isbn:74623783 +year:[20021201 TO 200402302]) -deleted:1 (cover:red cover:blue) old:0)~1) +(category:horror | category:comedy)~0.2 +author:john

another example:

my $b = Search::Brick::RAM::Query->new(index => '__test__');
$b->delete();
my $settings = {
   mapping => {
       author => {
           type  => "string",
           index =>  Data::MessagePack::true(),
           store =>  Data::MessagePack::true(),
       },
       f_boost => {
           type  =>  "float",
           index =>  Data::MessagePack::true(),
           store =>  Data::MessagePack::true(),
       },
       group_by => {
           type  => "string",
           index =>  Data::MessagePack::true(),
           store =>  Data::MessagePack::true()
       }
   },
   settings => {
       expected => 2,
       shards => 1
   }
};

$b->index([
    { author => 'jack', group_by => "23", f_boost => 0.5 },
    { author => 'john', group_by => "24",f_boost => 0.5 }
  ],$settings);
my @result = $b->search({ lucene => "ConstantScore(author:john)^8273" },
                        { log_slower_than => "1",explain => "true" } );
$VAR1 = [{
 'took' => 4,
 'hits' => [
   {
     '__score' => '8273.5',
     'f_boost' => '0.5',
     '__group_index' => '0',
     'group_by' => '24',
     '__explain' => '8273.5 = (MATCH) sum of:
 8273.5 = (MATCH) weight(author:john^8273.0 in 1) [BoostSimilarity], result of:
   8273.5 = <0.5> value of f_boost field + subscorer
     8273.0 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
       8273.0 = queryWeight, product of:
         8273.0 = boost
         1.0 = idf(docFreq=1, maxDocs=2)
         1.0 = queryNorm
       1.0 = fieldWeight in 1, product of:
         1.0 = tf(freq=1.0), with freq of:
           1.0 = termFreq=1.0
         1.0 = idf(docFreq=1, maxDocs=2)
         1.0 = fieldNorm(doc=1)
',
     'author' => 'john'
   }
 ],
 'query' => '__no_default_field__:ConstantScore author:john^8273.0'
}];

as you can see the return structure is [{},{},{}] one result per request (for example if we do $b = Search::Brick::RAM::Query->new(host => [ '127.0.0.1:900','127.0.0.1:900]) there will be [{hits => []},{hits => []}] in the output)

SEE ALSO

lucene: http://lucene.apache.org/core/4_6_0/

brick: https://github.com/jackdoe/brick

AUTHOR

Borislav Nikolov, <jack@sofialondonmoskva.com>

COPYRIGHT AND LICENSE

Copyright (C) 2014 by Borislav Nikolov

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.18.2 or, at your option, any later version of Perl 5 you may have available.