NAME

Plucene::Analysis::CJKTokenizer - Tokenizer for CJK texts

SYNOPSIS

# isa Plucene::Analysis::Tokenizer

my $next = $chartokenizer->next;

DESCRIPTION

This module tokenizes CJK texts. It creates uni-gram tokens from CJK texts. (See also "PROBLEMS") Because I understand not much of Japanese and Korean, I rudely apply this method to them. Patches are always welcome.

METHODS

next

my $next = $chartokenizer->next;

This will return the next token in the string, or undef at the end of the string.

GLOBAL VARIABLE

Here is one pattern variables that you can modify to customize your tokenizer for a specific collection.

$InCJK

Default pattern for CJK characters. Default value is qr( \p{InCJKUnifiedIdeographs} | \p{InCJKUnifiedIdeographsExtensionA} | \p{InCJKUnifiedIdeographsExtensionB} |

 \p{InCJKCompatibilityForms} |
 \p{InCJKCompatibilityIdeographs} |
 \p{InCJKCompatibilityIdeographsSupplement} |

 \p{InCJKRadicalsSupplement} |
 \p{InCJKSymbolsAndPunctuation} |
 
 \p{InHiragana} |
 \p{InKatakana} |
 \p{InKatakanaPhoneticExtensions} |
 
 \p{InHangulCompatibilityJamo} |
 \p{InHangulJamo} |
 \p{InHangulSyllables}
)x;

PROBLEMS

Currently, I tested bigram tokens, but it keeps failing. Snipped for the current release.

Speed is another issue.

SEE ALSO

Plucene

Plucene::Analysis::CJKAnalyzer

MIME::Base64

COPYRIGHT

Copyright (C) 2006 by Yung-chung Lin (a.k.a. xern) <xern@cpan.org>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself