NAME

Text::Document - a text document subject to statistical analysis

SYNOPSIS

my $t = Text::Document->new();
$t->AddContent( 'foo bar baz' );
$t->AddContent( 'foo barbaz; ' );

my @freqList = $t->KeywordFrequency();
my $u = Text::Document->new();
...
my $sj = $t->JaccardSimilarity( $u );
my $sc = $t->CosineSimilarity( $u );
my $wsc = $t->WeightedCosineSimilarity( $u, \&MyWeight, $rock );

DESCRIPTION

Text::Document allows to perform simple Information-Retrieval-oriented statistics on pure-text documents.

Text can be added in chunks, so that the document may be incrementally built, for instance by a class like HTML::Parser.

A simple algorithm splits the text into terms; the algorithm may be redefined by subclassing and redefining ScanV.

The KeywordFrequency function computes term frequency over the whole document.

FORESEEN REUSE

The package may be {re}used either by simple instantiation, or by subclassing (defining a descendant package). In the latter case the methods which are foreseen to be redefined are those ending with a V suffix. Redefining other methods will require greater attention.

CLASS METHODS

new

The creator method. The optional arguments are in the (key,value) form and allow to specify whether all keywords are trasformed to lowercase (default) and whether the string representation (WriteToString) will be compressed (default).

my $d = Text::Document->new();
my $dNotCompressed = Text::Document( compressed => 0 );
my $dPreserveCase = Text::Document( lowercase => 0 );

NewFromString

Take a string written by WriteToString (see below) and create a new Text::Document with the same contents; call die whenever the restore is impossible or ill-advised, for instance when the current version of the package is different from the original one, or the compression library in unavailable.

my $b = Text::Document::NewFromString( $str );

The return value is a blessed reference; put in another way, this is an alternative contructor.

The string should have been written by WriteToString; you may of course tweak the string contents, but at this point you're entirely on you own.

INSTANCE METHODS

AddContent

Used as

$d->AddContent( 'foo bar baz foo9' );
$d->AddContent( 'mary had a little lamb' );

Successive calls accumulate content; there is currently no way of resetting the content to zero.

Terms

Returns a list of all distinct terms in the document, in no particular order.

Occurrences

Returns the number of occurrences of a given term.

$d->AddContent( 'foo baz bar foo foo');
my $n = $d->Occurrences( 'foo' ); # now $n is 3

ScanV

Scan a string and return a list of terms.

Called internally as:

my @terms = $self->ScanV( $text );

KeywordFrequency

Returns a reference list of pairs [term,frequency], sorted by ascending frequency.

  my $listRef = $d->KeywordFrequency();
  foreach my $pair (@{$listRef}){
  	my ($term,$frequency) = @{$pair};
	...
  }

Terms in the document are sampled and their frequencies of occurrency are sorted in ascending order; finally, the list is returned to the user.

WriteToString

Convert the document (actually, some parameters and the term counters) into a string which can be saved and later restored with NewFromString.

my $str = $d->WriteToString();

The string begins with a header which encodes the originating package, its version, the parameters of the current instance.

Whenever possible, Compress::Zlib is used in order to compress the bit vector in the most efficient way. On systems without Compress::Zlib, the bit string is saved uncompressed.

JaccardSimilarity

Compute the Jaccard measure of document similarity, which is defined as follows: given two documents D and E, let Ds and Es be the set of terms occurring in D and E, respectively. Define S as the intersection of Ds and Es, and T as their union. Then the Jaccerd similarity is the the number of elements of S divided by the number of elements of T.

It is called as follows:

my $sim = $d->JaccardSimilarity( $e );

If neither document has any terms the result is undef (a rare evenience). Otherwise the similarity is a real number between 0.0 (no terms in common) and 1.0 (all terms in common).

CosineSimilarity

Compute the cosine similarity between two documents D and E.

Let Ds and Es be the set of terms occurring in D and E, respectively. Define T as the union of Ds and Es, and let ti be the i-th element of T.

Then the term vectors of D and E are

Dv = (nD(t1), nD(t2), ..., nD(tN))
Ev = (nE(t1), nE(t2), ..., nE(tN))

where nD(ti) is the number of occurrences of term ti in D, and nE(ti) the same for E.

Now we are at last ready to define the cosine similarity CS:

CS = (Dv,Ev) / (Norm(Dv)*Norm(Ev))

Here (... , ...) is the scalar product and Norm is the Euclidean norm (square root of the sum of squares).

CosineSimilarity is called as

$sim = $d->CosineSimilarity( $e );

It is undef if either D or E have no occurrence of any term. Otherwise, it is a number between 0.0 and 1.0. Since term occurrences are always non-negative, the cosine is obviously always non-negative.

WeightedCosineSimilarity

Compute the weighted cosine similarity between two documents D and E.

In the setting of CosineSimilarity, the term vectors of D and E are

Dv = (nD(t1)*w1, nD(t2)*w2, ..., nD(tN)*wN)
Ev = (nE(t1)*w1, nE(t2)*w2, ..., nE(tN)*wN)

The weights are nonnegative real values; each term has associated a weight. To achieve generality, weights may be defined using a function, like:

  my $wcs = $d->WeightedCosineSimilarity(
  	$e,
	\&function,
	$rock
  );

The function will be called as follows:

my $weight = function( $rock, 'foo' );

$rock is a 'constant' object used for passing a context to the function.

For instance, a common way of defining weights is the IDF (inverse document frequency), which is defined in Text::DocumentCollection. In this context, you can weigh terms with their IDF as follows:

  $sim = $c->WeightedCosineSimilarity(
  	$d,
	\&Text::DocumentCollection::IDF,
	$collection
  );

WeightedCosineSimilarity will call

$collection->IDF( 'foo' );

which is what we expect.

Actually, we should return the square root of IDF, but this detail is not necessary here.

AUTHORS

spinellia@acm.org (Andrea Spinelli)
walter@humans.net (Walter Vannini)

HISTORY

2001-11-02 - initial revision

2001-11-20 - added WeightedCosineSimilarity, suggested by JP Mc Gowan <jp.mcgowan@ucd.ie>

We did not use Storable, because we wanted to fine-tune compression and version compatibility. However, this choice may be easily reversed redefining WriteToString and NewFromString.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 245:

Unknown directive: =head