NAME
Freq - A purpose-built inverted text index for making term frequency calculations.
SYNOPSIS
Index documents:
# cat textcorpus.txt | tokenize | indexstream corpus_dir
Create ngram list:
# cat textcorpus.txt | tokenize | ngrams [N-size] [threshold]
Get statistics on word frequencies:
# cat termlist.txt | stats --everything corpus_dir
Get help:
# tokenize --help
# stats --help
# indexstream --help
# ngrams --help
PROGRAMMING API
use Freq;
$index = Freq->open_write( "indexname" );
$index->index_document( "docname", $string );
$index->close();
$index = Freq->open_read( "indexname" );
my ( $words_in_corpus, $docs_in_corpus ) = $index->index_info();
# Find all docs containing a phrase
$hashref = $index->doc_hash( "this phrase and no other phrase" );
# Total number of matches for this phrase/word.
my $matches = $hashref->{MATCHES};
# The consecutive ID of each document.
my @docids = @{ $hashref->{DOCIDS} };
# The number of matches found in each document.
my @docmatches = @{ $hashref->{DOCMATCHES} };
# The number of words between each consecutive match.
my @intervals = @{ $hashref->{INTERVALS} };
# Get matches, doc count, standard deviation of terms/document, standard deviation of intervals/match.
my ($matches, $doc_count, $docsigma, $intsigma ) =
$index->stats("some phrase or other");
$index->close();
DESCRIPTION
Blah blah blah.
EXPORT
None. Use programming API as shown.
AUTHOR
Ira Joseph Woodhead, ira@ejemoni.com
SEE ALSO
DBIx::FullTextSearch, Search::InvertedIndex