NAME
Lingua::EN::Tokenizer::Offsets - Finds word (token) boundaries, and returns their offsets.
VERSION
version 0.01_02
SYNOPSIS
use Lingua::EN::Tokenizer::Offsets qw/token_offsets get_tokens/;
my $str <<END
Hey! Mr. Tambourine Man, play a song for me.
I'm not sleepy and there is no place I’m going to.
END
my $offsets = token_offsets($str); ## Get the offsets.
foreach my $o (@$offsets) {
my $start = $o->[0];
my $length = $o->[1]-$o->[0];
my $token = substr($text,$start,$length) ## Get a token.
# ...
}
### or
my $tokens = get_tokens($str);
foreach my $token (@$tokens) {
## do something with $token
}
METHODS
tokenize($text)
Takes text as input and returns a tokenized version (space-separated tokens).
get_offsets($text)
Takes text input and returns reference to array containin pairs of character offsets, corresponding to the tokens start and end positions.
get_tokens($text)
Takes text input and splits it into tokens.
adjust_offsets($text,$offsets)
Minor adjusts to offsets (leading/trailing whitespace, etc)
initial_offsets($text)
First naive delimitation of tokens.
offsets2tokens($text,$offsets)
Given a list of token boundaries offsets and a text, returns an array with the text split into tokens.
ACKNOWLEDGEMENTS
Based on the original tokenizer written by Josh Schroeder and provided by Europarl http://www.statmt.org/europarl/.
SEE ALSO
Lingua::EN::Sentence::Offsets, Lingua::FreeLing3::Tokenizer
AUTHOR
Andre Santos <andrefs@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2012 by Andre Santos.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.