NAME
Plucene::Analysis::CJKTokenizer - Tokenizer for CJK texts
SYNOPSIS
# isa Plucene::Analysis::Tokenizer
my $next = $chartokenizer->next;
DESCRIPTION
This module tokenizes CJK texts. It creates uni-gram tokens from CJK texts. (See also "PROBLEMS") Because I understand not much of Japanese and Korean, I rudely apply this method to them. Patches are always welcome.
METHODS
next
my $next = $chartokenizer->next;
This will return the next token in the string, or undef at the end of the string.
GLOBAL VARIABLE
Here is one pattern variables that you can modify to customize your tokenizer for a specific collection.
$InCJK
Default pattern for CJK characters. Default value is qr( \p{InCJKUnifiedIdeographs} | \p{InCJKUnifiedIdeographsExtensionA} | \p{InCJKUnifiedIdeographsExtensionB} |
\p{InCJKCompatibilityForms} |
\p{InCJKCompatibilityIdeographs} |
\p{InCJKCompatibilityIdeographsSupplement} |
\p{InCJKRadicalsSupplement} |
\p{InCJKSymbolsAndPunctuation} |
\p{InHiragana} |
\p{InKatakana} |
\p{InKatakanaPhoneticExtensions} |
\p{InHangulCompatibilityJamo} |
\p{InHangulJamo} |
\p{InHangulSyllables}
)x;
PROBLEMS
Currently, I tested bigram tokens, but it keeps failing. Snipped for the current release.
Speed is another issue.
SEE ALSO
Plucene::Analysis::CJKAnalyzer
COPYRIGHT
Copyright (C) 2006 by Yung-chung Lin (a.k.a. xern) <xern@cpan.org>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself