NAME

Mail::SpamTest::Bayesian - Perl extension for Bayesian spam-testing

SYNOPSIS

use Mail::SpamTest::Bayesian;

my $j=Mail::SpamTest::Bayesian->new(dir => '.');
$j->init_db;
$j->merge_mbox_spam($scalar_spam_box);
$j->merge_mbox_nonspam($scalar_nonspam_box);
$message=$j->markup_message($message);

DESCRIPTION

This module implements the Bayesian spam-testing algorithm described by Paul Graham at:

http://www.paulgraham.com/spam.html

In short: the system is trained by exposure to mailboxes of known spam and non-spam messages. These are (1) MIME-decoded, and non-text parts deleted; (2) tokenised. The database files spam.db and nonspam.db contain lists of tokens and the number of messages in which they have occurred; general.db holds a message count.

This module is in early development; it is functional but basic. It is expected that more mailbox parsing routines will be added, probably using Mail::Box; and that ancillary programs will be supplied for use of the module as a personal mail filter.

METHODS

new()

Standard constructor. Pass a hash or hashref with parameters.

Useful parameters: dir -> database directory (.) significant -> number of significant tokens to consider (15) threshold -> spam threshold (0.9) fudgefactor -> Non-spam priority (2)

init_db()

Resets databases. Note that this will not recover space - if you want to delete an existing database, just delete the three files general.db, spam.db and nonspam.db. Call this only once, when you first set up the database.

merge_mbox_spam()

Train the system by giving it a mailbox full of spam.

Pass a scalar or array or arrayref containing raw messages.

merge_mbox_nonspam()

Train the system by giving it a mailbox full of legitimate email.

Pass a scalar or array or arrayref containing raw messages.

merge_message_spam()

As merge_mbox_spam, but for a single message; pass in a scalar.

merge_message_nonspam()

As merge_mbox_nonspam, but for a single message; pass in a scalar.

markup_message()

Test a message for possible spammishness. Pass a scalar containing a single message. Will return the original message with inserted headers:

X-Bayesian-Spam: (YES|NO) (probability%)
X-Bayesian-Test: the significant tests and their weights

test_message()

Pass a scalar containing a single message. Returns a list:

0: spam status (1 for spam, 0 for non spam)
1: probability of spam
2: listref of significant tests

AUTHOR

Roger Burton West, <roger@firedrake.org>

ACKNOWLEDGEMENTS

Erwin Harte provided useful feedback and the de-MIMEing code.

SEE ALSO

perl, BerkeleyDB.