NAME
Lingua::JA::TFWebIDF - TF*WebIDF calculator
SYNOPSIS
use Lingua::JA::TFWebIDF;
use utf8;
use feature qw/say/;
use Data::Printer;
my $tfidf = Lingua::JA::TFWebIDF->new(
api => 'YahooPremium',
appid => $appid,
fetch_df => 1,
Furl_HTTP => { timeout => 3 },
driver => 'TokyoCabinet',
df_file => './yahoo.tch',
pos1_filter => [qw/非自立 代名詞 数 ナイ形容詞語幹 副詞可能 サ変接続/],
term_length_min => 2,
tf_min => 2,
df_min => 1_0000,
df_max => 500_0000,
ng_word => [qw/編集 本人 自身 自分 たち さん/],
fetch_unk_word_df => 0,
concatenation_max => 100,
);
my %tf = (
'自然言語処理' => 9,
'自然言語' => 6,
'自然言語理解' => 4,
'処理' => 5,
'解析' => 4,
);
p $tfidf->tfidf(\%tf)->dump;
p $tfidf->tfidf($text)->dump;
p $tfidf->tf($text)->dump;
for my $result (@{ $tfidf->tfidf($text)->list(20) })
{
my ($word, $score) = each %{$result};
say "$word: $score";
}
DESCRIPTION
Lingua::JA::TFWebIDF calculates TF*WebIDF scores.
Compared with Lingua::JA::TFIDF, this module has the following advantages.
supports Tokyo Cabinet, Bing API and many options.
tfidf function accepts \%tf. (This eases the use of other morphological analyzers.)
METHODS
new( %config || \%config )
Creates a new Lingua::JA::TFWebIDF instance.
The following configuration is used if you don't set %config.
KEY DEFAULT VALUE
----------- ---------------
pos1_filter [qw/非自立 代名詞 数 ナイ形容詞語幹 副詞可能 接尾/]
pos2_filter []
pos3_filter []
ng_word []
term_length_min 2
term_length_max 30
concatenation_max 30
tf_min 1
df_min 0
df_max 250_0000_0000
fetch_unk_word_df 0
idf_type 1
api 'Yahoo'
appid undef
driver 'Storable'
df_file undef
fetch_df 1
expires_in 365
documents 250_0000_0000
Furl_HTTP undef
- pos(1|2|3)_filter => \@pos
-
The filters of '品詞細分類'.
- concatenation_max => $num
-
The maximum value of the number of term concatenations.
If 2 is specified, 2 consecutive nouns are concatenated. I recommend that you specify a large value or 0.
If half width spaces or tabs are ignored, you need to replace them with full width spaces.
- fetch_df => 0 || 1
-
1: fetches the DF score of a word which exists in the dictionary of MeCab if DF score of its word is not fetched yet.
0: average DF score is used.
- fetch_unk_word_df => 0 || 1
-
'unk word' is a word which not exists in the dictionary of MeCab.
0: average DF score is used.
- idf_type, api, appid, driver, df_file, expires_in, documents, Furl_HTTP
-
See Lingua::JA::WebIDF.
tfidf( $text || \%tf )
Calculates TF*WebIDF score. If scalar value is set, MeCab separates the value into appropriate morphemes. If you want to use other morphological analyzers, you have to set a hash reference which contains terms and their TF scores.
tf($text)
Calculates TF score via MeCab.
idf, df, purge
See Lingua::JA::WebIDF.
AUTHOR
pawa <pawapawa@cpan.org>
SEE ALSO
LICENSE
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.