NAME

Text::AI::CRM114 - Perl interface for CRM114

SYNOPSIS

use Text::AI::CRM114;
my $db = Text::AI::CRM114->new(
  Text::AI::CRM114::OSBF_BAYES,
  8*1024*1024, ["Alice", "Macbeth"]);

$db->learn("Alice", "Alice was beginning to ...");
$db->learn("Macbeth", "When shall we three meet again ...");

my @ret = $db->classify_text("The Mole had been working very hard all the morning ...");

say "Best classification is $ret[1]" unless ($ret[0] != Text::AI::CRM114::OK);

DESCRIPTION

This module provides a simple Perl interface to libcrm114, a library that implements several text classification algorithms.

CONSTANTS

libcrm114 uses several constants as status return values and to set the classification algorithm of a new datablock. -- These constants are accessible in this module's namespace, for example Text::AI::CRM114::OK and Text::AI::CRM114::OSB_WINNOW.

METHODS

Text::AI::CRM114->new($flags, $datasize, $classref)

Creates a new instance.

$flags

sets the classification algorithm, recommended values are

Text::AI::CRM114::OSB_BAYES (default), Text::AI::CRM114::OSB_WINNOW, or Text::AI::CRM114::HYPERSPACE. libcrm114 includes some more algorithms (SVM, PCA, FSCM) which may or may not be production ready.

$datasize

the memory size of learned data (default is 4 Mb). Note that some algorithms have to grow the datasize when learning.

$classref

a list of classes passed by reference (default: ['A', 'B']).

Text::AI::CRM114->readfile($filename)

Creates a new instance by reading a previously saved CRM114 DB from $filename.

$db->getclasses()

Returns a hash reference to the DB's classes. This hash associates the class names (keys) with the internal integer index (values).

$db->writefile($filename)

Writes the DB into a (binary) file.

$db->learn($class, $text)

Learn some text of a given class.

$db->classify($text, $verbatim)

Classify the text.

The normal working mode without the optional $verbatim flag adjusts the return values to be more useful with two classes (e.g. spam/ham). If the flag is given then the values are passed unchanged as they come from libcrm114. In practice this is only relevant if you use more than two classes. (Then you have to consider the success/non-success classes and probably want to add a method to retrieve the single per-class results.)

Returns a list of five scalar values:

$err

A numeric error code, should be Text::AI::libcrm114::OK

$errmsg

A short error message (for error display or logging).

$class

The name of the best matching class.

$prob

The success probability. Normally the probability of the matching class (with 0.5 <= $prob <= 1)

With $verbatim this is the success probability, i.e. with two classes the probability of the first class and with multiple classes the sum of probabilities for all successful classes (with 0 <= $prob <= 1).

$pR

The logarithmic probability ratio i.e. log10($prob) - log10(1-$prob) (theorethic range is 0 <= $pR <= 340, limited by floating point precision; but in practice a p = .99 yields a pR = 2, so high values are rather unusual).

With $verbatim this is the ratio between all success and all non-success probabilities, so for a non-successful result the value can also be negative (range -340 <= $pR <= 340).

ISSUES

This is my first attempt to write a Perl module, so all hints and improvements are appreciated.

I would like to hide the constants from Text::AI::libcrm114. I guess it is impossible to eliminate the error codes (unless one wants to completely hide them from the user and simply croak on every error). But at least for the algorithm selection I consider string arguments, i.e. the user should give us the string OSBF and we map it to Text::AI::libcrm114::OSBF.

I wonder if we should ensure Text::AI::libcrm114::OK maps to 0, as this makes the caller's return value checking easier. Currently this is trivial because it already is 0 in libcrm114. If that should change we would have to insert a rewrite into every XS call to a C function (ugly, but maybe worth it).

I am still not sure if the C memory management works correctly.

Another issue is Unicode support, which is missing in libcrm114, so it might be a good thing to convert unicode strings into some 8-bit encoding. As long as no string contains \0-values nothing bad[tm] will happen, but I assume that Unicode strings will internally cause wrong tokenization (this should be checked in libtre).

SEE ALSO

CRM114 homepage: http://crm114.sourceforge.net/

AI::CRM114, a module using the crm language interpreter: https://metacpan.org/module/AI::CRM114

HISTORY

v0.03 initial CPAN release v0.02 initial push to github

AUTHOR

Martin Schuette, <info@mschuette.name>

COPYRIGHT AND LICENSE

Perl module: Copyright (C) 2012 by Martin Schuette

libcrm114: Copyright (C) 2009-2010 by William S. Yerazunis

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License version 3.