The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Lingua::JA::TFWebIDF - TF*WebIDF calculator

SYNOPSIS

use Lingua::JA::TFWebIDF;
use utf8;
use feature qw/say/;
use Data::Printer;

my $tfidf = Lingua::JA::TFWebIDF->new(
    api               => 'YahooPremium',
    appid             => $appid,
    fetch_df          => 1,
    Furl_HTTP         => { timeout => 3 },
    driver            => 'TokyoCabinet',
    df_file           => './yahoo.tch',
    pos1_filter       => [qw/非自立 代名詞 数 ナイ形容詞語幹 副詞可能 サ変接続/],
    term_length_min   => 2,
    tf_min            => 2,
    df_min            => 1_0000,
    df_max            => 500_0000,
    ng_word           => [qw/編集 本人 自身 自分 たち さん/],
    fetch_unk_word_df => 0,
    concatenation_max => 100,
);

my %tf = (
    '自然言語処理' => 9,
    '自然言語'     => 6,
    '自然言語理解' => 4,
    '処理'         => 5,
    '解析'         => 4,
);

p $tfidf->tfidf(\%tf)->dump;

p $tfidf->tfidf($text)->dump;
p $tfidf->tf($text)->dump;

for my $result (@{ $tfidf->tfidf($text)->list(20) })
{
    my ($word, $score) = each %{$result};

    say "$word: $score";
}

DESCRIPTION

Lingua::JA::TFWebIDF calculates TF*WebIDF scores.

Compared with Lingua::JA::TFIDF, this module has the following advantages.

  • supports Tokyo Cabinet, Bing API and many options.

  • tfidf function accepts \%tf. (This eases the use of other morphological analyzers.)

METHODS

new( %config || \%config )

Creates a new Lingua::JA::TFWebIDF instance.

The following configuration is used if you don't set %config.

KEY                 DEFAULT VALUE
-----------         ---------------
pos1_filter         [qw/非自立 代名詞 数 ナイ形容詞語幹 副詞可能 接尾/]
pos2_filter         []
pos3_filter         []
ng_word             []
term_length_min     2
term_length_max     30
concatenation_max   30
tf_min              1
df_min              0
df_max              250_0000_0000
fetch_unk_word_df   0

idf_type            1
api                 'Yahoo'
appid               undef
driver              'Storable'
df_file             undef
fetch_df            1
expires_in          365
documents           250_0000_0000
Furl_HTTP           undef
pos(1|2|3)_filter => \@pos

The filters of '品詞細分類'.

concatenation_max => $num

The maximum value of the number of term concatenations.

If 2 is specified, 2 consecutive nouns are concatenated. I recommend that you specify a large value or 0.

If half width spaces or tabs are ignored, you need to replace them with full width spaces.

fetch_df => 0 || 1

1: fetches the DF score of a word which exists in the dictionary of MeCab if DF score of its word is not fetched yet.

0: average DF score is used.

fetch_unk_word_df => 0 || 1

'unk word' is a word which not exists in the dictionary of MeCab.

0: average DF score is used.

idf_type, api, appid, driver, df_file, expires_in, documents, Furl_HTTP

See Lingua::JA::WebIDF.

tfidf( $text || \%tf )

Calculates TF*WebIDF score. If scalar value is set, MeCab separates the value into appropriate morphemes. If you want to use other morphological analyzers, you have to set a hash reference which contains terms and their TF scores.

tf($text)

Calculates TF score via MeCab.

idf, df, purge

See Lingua::JA::WebIDF.

AUTHOR

pawa <pawapawa@cpan.org>

SEE ALSO

Lingua::JA::WebIDF

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.