NAME
Lucy::Analysis::Token - Unit of text.
SYNOPSIS
my $token = Lucy::Analysis::Token->new(
text => 'blind',
start_offset => 8,
end_offset => 13,
);
$token->set_text('mice');
DESCRIPTION
Token is the fundamental unit used by Apache Lucy’s Analyzer subclasses. Each Token has 5 attributes: text
, start_offset
, end_offset
, boost
, and pos_inc
.
The text
attribute is a Unicode string encoded as UTF-8.
start_offset
is the start point of the token text, measured in Unicode code points from the top of the stored field; end_offset
delimits the corresponding closing boundary. start_offset
and end_offset
locate the Token within a larger context, even if the Token’s text attribute gets modified – by stemming, for instance. The Token for “beating” in the text “beating a dead horse” begins life with a start_offset of 0 and an end_offset of 7; after stemming, the text is “beat”, but the start_offset is still 0 and the end_offset is still 7. This allows “beating” to be highlighted correctly after a search matches “beat”.
boost
is a per-token weight. Use this when you want to assign more or less importance to a particular token, as you might for emboldened text within an HTML document, for example. (Note: The field this token belongs to must be spec’d to use a posting of type RichPosting.)
pos_inc
is the POSition INCrement, measured in Tokens. This attribute, which defaults to 1, is a an advanced tool for manipulating phrase matching. Ordinarily, Tokens are assigned consecutive position numbers: 0, 1, and 2 for "three blind mice"
. However, if you set the position increment for “blind” to, say, 1000, then the three tokens will end up assigned to positions 0, 1, and 1001 – and will no longer produce a phrase match for the query "three blind mice"
.
CONSTRUCTORS
new
my $token = Lucy::Analysis::Token->new(
text => $text, # required
start_offset => $start_offset, # required
end_offset => $end_offset, # required
boost => 1.0, # optional
pos_inc => 1, # optional
);
text - A string.
start_offset - Start offset into the original document in Unicode code points.
start_offset - End offset into the original document in Unicode code points.
boost - Per-token weight.
pos_inc - Position increment for phrase matching.
METHODS
get_text
my $text = $token->get_text;
Get the token's text.
set_text
$token->set_text($text);
Set the token's text.
get_start_offset
my $int = $token->get_start_offset();
get_end_offset
my $int = $token->get_end_offset();
get_boost
my $float = $token->get_boost();
get_pos_inc
my $int = $token->get_pos_inc();
get_len
my $int = $token->get_len();
INHERITANCE
Lucy::Analysis::Token isa Clownfish::Obj.