NAME
Uplug::PreProcess::Tokenizer
SYNOPSIS
my $tokenizer = new Uplug::PreProcess::Tokenizer( lang => 'en' );
my @tokens = tokenizer->tokenize( 'Mr. Smith says: "What is a text anyway?"' );
my $text = detokenize( '" Big improvement ! " says Mr. Smith .');
IMPLEMENTS
tokenize
Tokenize a given text. Returns a list of tokens.
detokenize
De-tokenize a space-separated text or a list of tokens. Returns plain text.
load_prefixes
Load language specific abbreviations and other non-breaking prefixes.
DESCRIPTION
This module heavily relies on the implementation of the tokenizer and detokenizer used in the Moses toolkit for SMT. All credits go to the original authors (Josh Schroeder and Philipp Koehn).