NAME

Lingua::CJK::Tokenizer - CJK Tokenizer

SYNOPSIS

my $tknzr = Lingua::CJK::Tokenizer->new();
$tknzr->ngram_size(5);
$tknzr->max_token_count(100);
$tokens_ref = $tknzr->tokenize("CJK Text");
$tokens_ref = $tknzr->segment("CJK Text");
$tokens_ref = $tknzr->split("CJK Text");
$flag = $tknzr->has_cjk("CJK Text");
$flag = $tknzr->has_cjk_only("CJK Text");

DESCRIPTION

This module tokenizes CJK texts into n-grams.

METHODS

ngram_size

sets the size of returned n-grams

max_token_count

sets the limit on the number of returned n-grams in case input text is too long or of indefinite size

tokenize

tokenizes texts into n-grams

segment

cuts cjk texts into chunks

split

tokenizes texts into uni-grams.

has_cjk

returns true if text has cjk characters

has_cjk_only

returns true if text has only cjk characters

PREREQUISITE

This module requires libunicode by Tom Tromey.

COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the MIT License.

To install Lingua::CJK::Tokenizer, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lingua::CJK::Tokenizer

CPAN shell

perl -MCPAN -e shell
install Lingua::CJK::Tokenizer

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)