NAME

ElasticSearchX::Model::Tutorial - Tutorial for ElasticSearchX::Model

VERSION

version 0.0.5

INTRODUCTION

In this tutorial we are going to walk through the ElasticSearch example on http://www.elasticsearch.org/. Go ahead and read it first, this gives you a good insight into how ElasticSearch works and how ElasticSearchX::Model can help to make it even more elastic.

DOCUMENTS

ElasticSearch is a document-based storage system. Even though it states that it is schema free, it is not recommended to use ElasticSearch without defining a proper schema or mapping, as ElasticSearch calls it.

ElasticSearchX::Model::Document takes care of that. The ElasticSearch example consists of two types: tweet and user. The tweet type contains the properties user, post_date and message. The user type contains only the name property. Using ElasticSearchX::Model::Document this looks like:

package MyModel::Tweet;
use Moose;
use ElasticSearchX::Model::Document;

has id        => ( is => 'ro', id => [qw(user post_date)] );
has user      => ( is => 'ro', isa => 'Str' );
has post_date => ( is => 'ro', isa => 'DateTime', required => 1, default => sub { DateTime->now } );
has message   => ( is => 'rw', isa => 'Str', index => 'analyzed' );

package MyModel::User;
use Moose;
use ElasticSearchX::Model::Document;

has nickname => ( is => 'ro', isa => 'Str', id => 1 );
has name     => ( is => 'ro', isa => 'Str' );

You might be wondering why there is an additional id attribute and a nickname. The id attribute in the Tweet class is build dynamically by concatenating the values of user and post_date. this value is digested using SHA1 and used as id for the document. If you want to change the message of the tweet, you don't have to delete the old record and add a new one but simply change the message and reindex the document. Since the id will stay the same, the new record will overwrite the old one. Also, you don't have to keep track of incrementing numerical document ids.

In the User class, the nickname attribute acts as id. Since it does not depend on the value of any other attribute, the id matches the nickname.

ElasticSearch will assign a random id to the document if there is no id attribute.

MAPPING

Each document belongs to a type. Think of it as a table in a relational database. And each type belongs to an index, which corresponds to a database.

Modeling indices and types with ElasticSearchX::Model is pretty easy and the types have actually already been built: the meta objects of the document classes describe the types. They include all the necessary information to build a type mapping. You can even use MooseX::Types::Structured to build deepy nested structures that will be translated to object properties in ElasticSearch. DateTime attributes become a Date type and so on.

Indices are defined in a model class:

package MyModel;
use Moose;
use ElasticSearchX::Model;

index twitter => ( namespace => 'MyModel' );

This is all you need to define the index and its types. The namespace option of the index twitter will load all classes in the MyModel namespace and add them to the twitter index. Actually, you don't even have to define the namespace in this case, since the namespace defaults to the name of the model class. You can also load types explicitly by defining a types option:

index twitter => ( types => [qw(MyModel::Tweet MyModel::User)] );

Make sure that the classes are loaded. See ElasticSearchX::Model::Index for all the available options.

To deploy the indices and mappings to ElasticSearch, simply call

my $model = MyModel->new;
$model->deploy;

This will try to connect to an ElasticSearch instance on 127.0.0.1:9200. See "es" in ElasticSearchX::Model for more information.

INDEXING

Indexing describes the process of adding documents to types.

use DateTime;

my $twitter = $model->index('twitter');
my $timestamp = DateTime->now;
my $tweet = $twitter->type('tweet')->put({
    user => 'mo',
    post_date => $timestamp,
    message => 'Elastic baby!',
}, { refresh => 1 });

$twitter->type('tweet')->count; # 1

The first parameter contains the property/value pairs. The post_date property is special because it is a DateTime object. Objects are being deflated prior to insertion. This is handled by MooseX::Attribute::Deflator and is configured in ElasticSearchX::Model::Document::Types. You can easily add deflators for other objects.

Since the post_date property is required and has a default, you don't even have to it to put. ElasticSearchX::Model will automatically build values from required attributes. If there is no builder or default, it will throw an exception.

The second parameter to "put" in ElasticSearchX::Model::Document::Set tells ElasticSearch to refresh the index immediately. Otherwise it can take up to one second for the server to refresh and the subsequent call to "count" in ElasticSearchX::Model::Document::Set will return 0.

If you index large numbers of documents, it is advised to call "refresh" in ElasticSearchX::Model::Index once you are finished and not on every put.

RETRIEVING

Documents can be retrieved either with their id or by providing the properties that define the id:

my $tweet_copy = $twitter->type('tweet')->get($tweet->id);
# or
my $tweet_copy = $twitter->type('tweet')->get({
    user      => 'mo',
    post_date => $timestamp,
});

Objects that have been deflated (i.e. post_date) will be inflated again. Thus, $tweet_copy->post_date is a DateTime object again.

If you don't really care about objects or need extra speed, you can set "inflate" in ElasticSearchX::Model::Documents::Set to 0. This will return the raw response from ElasticSearch.

$twitter->type('tweet')->raw->get($tweet->id);

SEARCHING AND SCROLLING

ElasticSearch is You know, for Search. ElasticSearchX::Model::Set tries to help you with its very verbose query syntax.

my @tweets = $twitter->type('tweet')->filter({
     term => { user => 'mo' }
 })->query({
     field => { 'message.analyzed' => 'baby' }
 })->size(100)->all;

If you need to retrieve large amounts of data, you probably want to scroll through the results, which is much faster and safer than scrolling manually using "from" in ElasticSearchX::Model::Set.

my $iterator = $twitter->type('tweet')->scroll;
while(my $tweet = $iterator->next) {
    # do something with $tweet
}

For extra speed use $twitter->type('tweet')->raw->scroll which will skip the object inflation and give you the raw HashRef.

REINDEXING

ElasticSearch allows you to create aliases for each index. This makes it easy to reindex to a new index, and change the alias once the reindexing is done, to the new index. This is how you do it with ElasticSearchX::Model.

package MyModel;
use Moose;
use ElasticSearchX::Model;

index twitter => ( namespace => 'MyModel', alias_for => 'twitter_v1' );

This will create an index called twitter_v1 in ElasticSearch and an alias twitter. To reindex data, you simply add a second index with a different name but the same document classes:

index twitter_v2 => ( namespace => 'MyModel' );

Now deploy the new index and start reindexing your data to the new index:

$model->deploy;

my $old = $model->index('twitter');
my $new = $model->index('twitter_v2');
my $iterator = $old->type('tweet')->size(1000)->scroll;
while(my $tweet = $iterator->next) {
    $tweet->message('something else');
    $tweet->index($new);
    $tweet->put;
}

Afterwards, you simply remove the twitter_v2 index and set the alias_for attribute on index twitter to twitter_v2. You have to call $model->deploy again, which will automatically update the aliases.

AUTHOR

Moritz Onken

COPYRIGHT AND LICENSE

This software is Copyright (c) 2012 by Moritz Onken.

This is free software, licensed under:

The (three-clause) BSD License