The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Lingua::JA::NormalizeText - text normalizer

SYNOPSIS

use Lingua::JA::NormalizeText;
use utf8;

my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu );
my $normalizer = Lingua::JA::NormalizeText->new(@options);

print $normalizer->normalize('鳥が㌧㌦でありんす♥');
# -> 鳥がトンドルです♥

sub dearinsu_to_desu
{
    my $text = shift;
    $text =~ s/でありんす/です/g;

    return $text;
}

# or

use Lingua::JA::NormalizeText qw/nfkc decode_entities/;
use utf8;

my $text = '鳥が㌧㌦でありんす♥';
print dearinsu_to_desu( decode_entities( nfkc($text) ) );
# -> 鳥がトンドルです♥

sub dearinsu_to_desu
{
    my $text = shift;
    $text =~ s/でありんす/です/g;

    return $text;
}

DESCRIPTION

Lingua::JA::NormalizeText normalizes text.

METHODS

new(@options)

Creates a new Lingua::JA::NormalizeText instance.

The following options are available.

OPTION                 SAMPLE INPUT        OUTPUT FOR SAMPLE INPUT
---------------------  ------------------  -----------------------
lc                     DdD                 ddd
uc                     DdD                 DDD
nfkc                   ㌦                  ドル (length: 2)
nfkd                   ㌦                  ドル (length: 3)
nfc
nfd
decode_entities        ♥            ♥
strip_html             <em>あ</em>             あ    
alnum_z2h              ABC123        ABC123
alnum_h2z              ABC123              ABC123
space_z2h
space_h2z
katakana_z2h           ハァハァ            ハァハァ
katakana_h2z           スーハースーハー            スーハースーハー
katakana2hiragana      パンツ              ぱんつ
hiragana2katakana      ぱんつ              パンツ
unify_3dots            はぁ。。。          はぁ…
wave2tilde             〜                  ~
tilde2wave             ~                  〜
wavetilde2long         〜, ~              ー
wave2long              〜                  ー
tilde2long             ~                  ー
fullminus2long         −                   ー
dashes2long            —                   ー
drawing_lines2long     ─                   ー
unify_long_repeats     ヴァーーー          ヴァー
nl2space               (new line)          (space)
unify_long_spaces      (space)(space)      (space)
remove_head_space      (space)あ(space)あ  あ(space)あ
remove_tail_space      ああ(space)(space)  ああ
modernize_kana_usage   ゐヰゑヱ            いイえエ

The order these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied finally.)

External functions are also addable. (See dearinsu_to_desu function of SYNOPSIS section)

normalize($text)

normalizes $text.

AUTHOR

pawa <pawapawa@cpan.org>

SEE ALSO

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.