NAME
Mail::SpamTest::Bayesian - Perl extension for Bayesian spam-testing
SYNOPSIS
use Mail::SpamTest::Bayesian;
my $j=Mail::SpamTest::Bayesian->new(dir => '.');
$j->init_db;
$j->merge_mbox_spam($scalar_spam_box);
$j->merge_mbox_nonspam($scalar_nonspam_box);
$message=$j->markup_message($message);
DESCRIPTION
This module implements the Bayesian spam-testing algorithm described by Paul Graham at:
http://www.paulgraham.com/spam.html
In short: the system is trained by exposure to mailboxes of known spam and non-spam messages. These are (1) MIME-decoded, and non-text parts deleted; (2) tokenised. The database files spam.db and nonspam.db contain lists of tokens and the number of messages in which they have occurred; general.db holds a message count.
This module is in early development; it is functional but basic. It is expected that more mailbox parsing routines will be added, probably using Mail::Box; and that ancillary programs will be supplied for use of the module as a personal mail filter.
METHODS
new()
Standard constructor. Pass a hash or hashref with parameters.
Useful parameters: dir -> database directory (.) significant -> number of significant tokens to consider (15) threshold -> spam threshold (0.9) fudgefactor -> Non-spam priority (2)
init_db()
Resets databases. Note that this will not recover space - if you want to delete an existing database, just delete the three files general.db, spam.db and nonspam.db. Call this only once, when you first set up the database.
merge_mbox_spam()
Train the system by giving it a mailbox full of spam.
Pass a scalar or array or arrayref containing raw messages.
merge_mbox_nonspam()
Train the system by giving it a mailbox full of legitimate email.
Pass a scalar or array or arrayref containing raw messages.
merge_message_spam()
As merge_mbox_spam, but for a single message; pass in a scalar.
merge_message_nonspam()
As merge_mbox_nonspam, but for a single message; pass in a scalar.
markup_message()
Test a message for possible spammishness. Pass a scalar containing a single message. Will return the original message with inserted headers:
X-Bayesian-Spam: (YES|NO) (probability%)
X-Bayesian-Test: the significant tests and their weights
test_message()
Pass a scalar containing a single message. Returns a list:
0: spam status (1 for spam, 0 for non spam)
1: probability of spam
2: listref of significant tests
AUTHOR
Roger Burton West, <roger@firedrake.org>
ACKNOWLEDGEMENTS
Erwin Harte provided useful feedback and the de-MIMEing code.