NAME

Text::TFIDF - Perl extension for computing the TF-IDF measure

SYNOPSIS

use Text::TFIDF;
my $Obj = new Text::TFIDF(file=>[file1,file2...]);
print $Obj->TFIDF($file,$word);

DESCRIPTION

The TF-IDF weight (ie, Frequency-Inverse Document Frequency) weight is used in information retrieval and text mining. It is a statistical measure used to see how important a word is in a document or collection of documents. This module is designed to only work on text documents at this time.

Currently, the module reads everything into memory. This should be altered in the future.

EXPORT

None by default.

new(file=>\@files)

Creates a new module. If the file argument is passed in, populates the module using those files.

TFIDF(file,word)

Computes the TF-IDF weight for the given document and word. If the file is not in the corpus used to populate the module, returns undef

TF(file,word)

Returns the frequency of the given word in the document.

IDF(word)

Returns the inverse document frequency of a word. That is, the ratio of the number of documents in the corpus divided by the number of documents containing the term and taking the logarithm of the result. Since the number of documents containing the term can be zero, we add one to the result to ensure a rational result.

process_files(@files)

Populates the document with the given list of files. This does not replace data currently in the document, rather, it adds to the list.

SEE ALSO

See http://en.wikipedia.org/wiki/Tf-idf for more information.

AUTHOR

Leigh Metcalf, <leigh@fprime.net<gt>

COPYRIGHT AND LICENSE

Copyright (C) 2011 by Leigh Metcalf

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.12.3 or, at your option, any later version of Perl 5 you may have available.