NAME

Plucene::Analysis::CJKTokenizer - Tokenizer for CJK texts

SYNOPSIS

# isa Plucene::Analysis::Tokenizer

my $next = $chartokenizer->next;

DESCRIPTION

This module tokenizes CJK texts. It creates uni-gram tokens from CJK texts. (See also "PROBLEMS") Because I understand not much of Japanese and Korean, I rudely apply this method to them. Patches are always welcome.

METHODS

my $next = $chartokenizer->next;

This will return the next token in the string, or undef at the end of the string.

GLOBAL VARIABLE

Here is one pattern variables that you can modify to customize your tokenizer for a specific collection.

$InCJK

Default pattern for CJK characters. Default value is qr( \p{InCJKUnifiedIdeographs} | \p{InCJKUnifiedIdeographsExtensionA} | \p{InCJKUnifiedIdeographsExtensionB} |

 \p{InCJKCompatibilityForms} |
 \p{InCJKCompatibilityIdeographs} |
 \p{InCJKCompatibilityIdeographsSupplement} |

 \p{InCJKRadicalsSupplement} |
 \p{InCJKSymbolsAndPunctuation} |
 
 \p{InHiragana} |
 \p{InKatakana} |
 \p{InKatakanaPhoneticExtensions} |
 
 \p{InHangulCompatibilityJamo} |
 \p{InHangulJamo} |
 \p{InHangulSyllables}
)x;

PROBLEMS

Currently, I tested bigram tokens, but it keeps failing. Snipped for the current release.

Speed is another issue.

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself

To install Plucene::Analysis::CJKAnalyzer, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Plucene::Analysis::CJKAnalyzer

CPAN shell

perl -MCPAN -e shell
install Plucene::Analysis::CJKAnalyzer

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)