NAME

Freq - A purpose-built inverted text index for making term frequency calculations.

SYNOPSIS

Index documents:

# cat textcorpus.txt | tokenize | indexstream corpus_dir

Create ngram list:

# cat textcorpus.txt | tokenize | ngrams [N-size] [threshold] 

Get statistics on word frequencies:

# cat termlist.txt | stats --everything corpus_dir

Get help:

# tokenize --help
# stats --help
# indexstream --help
# ngrams --help

PROGRAMMING API

  use Freq;

  $index = Freq->open_write( "indexname" );
  $index->index_document( "docname", $string );
  $index->close();

  $index = Freq->open_read( "indexname" );
  my ( $words_in_corpus, $docs_in_corpus ) = $index->index_info();

  # Find all docs containing a phrase
  $hashref = $index->doc_hash( "this phrase and no other phrase" );

  # Total number of matches for this phrase/word.
  my $matches = $hashref->{MATCHES};

  # The consecutive ID of each document.
  my @docids = @{ $hashref->{DOCIDS} };

  # The number of matches found in each document.
  my @docmatches = @{ $hashref->{DOCMATCHES} };

  # The number of words between each consecutive match.
  my @intervals = @{ $hashref->{INTERVALS} };

  # Get matches, doc count, standard deviation of terms/document, standard deviation of intervals/match.
  my ($matches, $doc_count, $docsigma, $intsigma ) = 
		$index->stats("some phrase or other");

  $index->close();

DESCRIPTION

Blah blah blah.

EXPORT

None. Use programming API as shown.

AUTHOR

Ira Joseph Woodhead, ira@ejemoni.com

SEE ALSO

DBIx::FullTextSearch, Search::InvertedIndex