NAME
ElasticSearchX::Model::Tutorial - Tutorial for ElasticSearchX::Model
VERSION
version 0.0.3
INTRODUCTION
In this tutorial we are going to walk through the ElasticSearch example on http://www.elasticsearch.org/. Go ahead and read it first, this gives you a good insight into how ElasticSearch works and how ElasticSearchX::Model can help to make it even more elastic.
DOCUMENTS
ElasticSearch is a document-based storage system. Even though it states that it is schema free, it is not recommended to use ElasticSearch without defining a proper schema or mapping, as ElasticSearch calls it.
ElasticSearchX::Model::Document takes care of that. The ElasticSearch example consists of two types: tweet
and user
. The tweet
type contains the properties user
, post_date
and message
. The user
type contains only the name
property. Using ElasticSearchX::Model::Document this looks like:
package MyModel::Tweet;
use Moose;
use ElasticSearchX::Model::Document;
has id => ( is => 'ro', id => [qw(user post_date)] );
has user => ( is => 'ro', isa => 'Str' );
has post_date => ( is => 'ro', isa => 'DateTime', required => 1, default => sub { DateTime->now } );
has message => ( is => 'rw', isa => 'Str', index => 'analyzed' );
package MyModel::User;
use Moose;
use ElasticSearchX::Model::Document;
has nickname => ( is => 'ro', isa => 'Str', id => 1 );
has name => ( is => 'ro', isa => 'Str' );
You might be wondering why there is an additional id
attribute and a nickname
. The id
attribute in the Tweet class is build dynamically by concatenating the values of user
and post_date
. this value is digested using SHA1 and used as id for the document. If you want to change the message of the tweet, you don't have to delete the old record and add a new one but simply change the message and reindex the document. Since the id will stay the same, the new record will overwrite the old one. Also, you don't have to keep track of incrementing numerical document ids.
In the User
class, the nickname
attribute acts as id. Since it does not depend on the value of any other attribute, the id matches the nickname.
ElasticSearch will assign a random id to the document if there is no id attribute.
MAPPING
Each document belongs to a type. Think of it as a table in a relational database. And each type belongs to an index, which corresponds to a database.
Modeling indices and types with ElasticSearchX::Model is pretty easy and the types have actually already been built: the meta objects of the document classes describe the types. They include all the necessary information to build a type mapping. You can even use MooseX::Types::Structured to build deepy nested structures that will be translated to object
properties in ElasticSearch. DateTime attributes become a Date
type and so on.
Indices are defined in a model class:
package MyModel;
use Moose;
use ElasticSearchX::Model;
index twitter => ( namespace => 'MyModel' );
This is all you need to define the index and its types. The namespace option of the index twitter
will load all classes in the MyModel
namespace and add them to the twitter index. Actually, you don't even have to define the namespace in this case, since the namespace defaults to the name of the model class. You can also load types explicitly by defining a types
option:
index twitter => ( types => [qw(MyModel::Tweet MyModel::User)] );
Make sure that the classes are loaded. See ElasticSearchX::Model::Index for all the available options.
To deploy the indices and mappings to ElasticSearch, simply call
my $model = MyModel->new;
$model->deploy;
This will try to connect to an ElasticSearch instance on 127.0.0.1:9200. See "es" in ElasticSearchX::Model for more information.
INDEXING
Indexing describes the process of adding documents to types.
use DateTime;
my $twitter = $model->index('twitter');
my $timestamp = DateTime->now;
my $tweet = $twitter->type('tweet')->put({
user => 'mo',
post_date => $timestamp,
message => 'Elastic baby!',
}, { refresh => 1 });
$twitter->type('tweet')->count; # 1
The first parameter contains the property/value pairs. The post_date
property is special because it is a DateTime object. Objects are being deflated prior to insertion. This is handled by MooseX::Attribute::Deflator and is configured in ElasticSearchX::Model::Document::Types. You can easily add deflators for other objects.
Since the post_date
property is required and has a default
, you don't even have to it to put
. ElasticSearchX::Model will automatically build values from required attributes. If there is no builder or default, it will throw an exception.
The second parameter to "put" in ElasticSearchX::Model::Document::Set tells ElasticSearch to refresh the index immediately. Otherwise it can take up to one second for the server to refresh and the subsequent call to "count" in ElasticSearchX::Model::Document::Set will return 0
.
If you index large numbers of documents, it is advised to call "refresh" in ElasticSearchX::Model::Index once you are finished and not on every put.
RETRIEVING
Documents can be retrieved either with their id or by providing the properties that define the id:
my $tweet_copy = $twitter->type('tweet')->get($tweet->id);
# or
my $tweet_copy = $twitter->type('tweet')->get({
user => 'mo',
post_date => $timestamp,
});
Objects that have been deflated (i.e. post_date
) will be inflated again. Thus, $tweet_copy->post_date
is a DateTime object again.
If you don't really care about objects or need extra speed, you can set "inflate" in ElasticSearchX::Model::Documents::Set to 0
. This will return the raw response from ElasticSearch.
$twitter->type('tweet')->raw->get($tweet->id);
SEARCHING AND SCROLLING
ElasticSearch is You know, for Search. ElasticSearchX::Model::Set tries to help you with its very verbose query syntax.
my @tweets = $twitter->type('tweet')->filter({
term => { user => 'mo' }
})->query({
field => { 'message.analyzed' => 'baby' }
})->size(100)->all;
If you need to retrieve large amounts of data, you probably want to scroll through the results, which is much faster and safer than scrolling manually using "from" in ElasticSearchX::Model::Set.
my $iterator = $twitter->type('tweet')->scroll;
while(my $tweet = $iterator->next) {
# do something with $tweet
}
For extra speed use $twitter->type('tweet')->raw->scroll
which will skip the object inflation and give you the raw HashRef.
REINDEXING
ElasticSearch allows you to create aliases for each index. This makes it easy to reindex to a new index, and change the alias once the reindexing is done, to the new index. This is how you do it with ElasticSearchX::Model.
package MyModel;
use Moose;
use ElasticSearchX::Model;
index twitter => ( namespace => 'MyModel', alias_for => 'twitter_v1' );
This will create an index called twitter_v1
in ElasticSearch and an alias twitter
. To reindex data, you simply add a second index with a different name but the same document classes:
index twitter_v2 => ( namespace => 'MyModel' );
Now deploy the new index and start reindexing your data to the new index:
$model->deploy;
my $old = $model->index('twitter');
my $new = $model->index('twitter_v2');
my $iterator = $old->type('tweet')->size(1000)->scroll;
while(my $tweet = $iterator->next) {
$tweet->message('something else');
$tweet->index($new);
$tweet->put;
}
Afterwards, you simply remove the twitter_v2
index and set the alias_for
attribute on index twitter
to twitter_v2
. You have to call $model->deploy
again, which will automatically update the aliases.
AUTHOR
Moritz Onken
COPYRIGHT AND LICENSE
This software is Copyright (c) 2011 by Moritz Onken.
This is free software, licensed under:
The (three-clause) BSD License