NAME
AI::NaiveBayes1 - Naive Bayes Classification
SYNOPSIS
use AI::NaiveBayes1;
my $nb = AI::NaiveBayes1->new;
$nb->add_table(
"Html Caps Free Spam count
-------------------------------
Y Y Y Y 42
Y Y Y N 32
Y Y N Y 17
Y Y N N 7
Y N Y Y 32
Y N Y N 12
Y N N Y 20
Y N N N 16
N Y Y Y 38
N Y Y N 18
N Y N Y 16
N Y N N 16
N N Y Y 2
N N Y N 9
N N N Y 11
N N N N 91
-------------------------------
");
$nb->train;
print "Model:\n" . $nb->print_model;
print "Model (with counts):\n" . $nb->print_model('with counts');
$nb = AI::NaiveBayes1->new;
$nb->add_instances(attributes=>{model=>'H',place=>'B'},
label=>'repairs=Y',cases=>30);
$nb->add_instances(attributes=>{model=>'H',place=>'B'},
label=>'repairs=N',cases=>10);
$nb->add_instances(attributes=>{model=>'H',place=>'N'},
label=>'repairs=Y',cases=>18);
$nb->add_instances(attributes=>{model=>'H',place=>'N'},
label=>'repairs=N',cases=>16);
$nb->add_instances(attributes=>{model=>'T',place=>'B'},
label=>'repairs=Y',cases=>22);
$nb->add_instances(attributes=>{model=>'T',place=>'B'},
label=>'repairs=N',cases=>14);
$nb->add_instances(attributes=>{model=>'T',place=>'N'},
label=>'repairs=Y',cases=> 6);
$nb->add_instances(attributes=>{model=>'T',place=>'N'},
label=>'repairs=N',cases=>84);
$nb->train;
print "Model:\n" . $nb->print_model;
# Find results for unseen instances
my $result = $nb->predict
(attributes => {model=>'T', place=>'N'});
foreach my $k (keys(%{ $result })) {
print "for label $k P = " . $result->{$k} . "\n";
}
# export the model into a string
my $string = $nb->export_to_YAML();
# create the same model from the string
my $nb1 = AI::NaiveBayes1->import_from_YAML($string);
# write the model to a file (shorter than model->string->file)
$nb->export_to_YAML_file('t/tmp1');
# read the model from a file (shorter than file->string->model)
my $nb2 = AI::NaiveBayes1->import_from_YAML_file('t/tmp1');
See Examples for more examples.
DESCRIPTION
This module implements the classic "Naive Bayes" machine learning algorithm.
Data Structure
An object contains the following fields:
{attributes}
-
List of attribute names.
{attribute_type}{$a}
-
Attribute types - 'real', or not (e.g., 'nominal')
{labels}
-
List of labels.
{attvals}{$a}
-
List of attribute values
{real_stat}{$a}{$v}{$l}{sum}
-
Statistics for real valued attributes; besides 'sum' also: count, mean, stddev
{numof_instances}
-
Number of training instances.
{stat_labels}{$l}
-
Label count in training data.
{stat_attributes}{$a}
-
Statistics for an attribute:
...{$value}{$label}
= count of instances. {smoothing}{$attribute}
-
Attribute smoothing. No smoothing if does not exist. Implemented smoothing:
- /^unseen count=/ followed by number, e.g., 0.5
Attribute Smoothing
For an attribute A one can specify:
$nb->{smoothing}{A} = 'unseen count=0.5';
to provide a count for unseen data. The count is taken into consideration in training and prediction, when any unseen attribute values are observed. Zero probabilities can be prevented in this way. A count other than 0.5 can be provided, but if it is <=0 it will be set to 0.5. The method is similar to add-one smoothing. A special attribute value '*' is used for all unseen data.
METHODS
Constructor Methods
- new()
-
Constructor. Creates a new
AI::NaiveBayes1
object and returns it. - import_from_YAML($string)
-
Constructor. Creates a new
AI::NaiveBayes1
object from a string where it is represented inYAML
. Requires YAML module. - import_from_YAML_file($file_name)
-
Constructor. Creates a new
AI::NaiveBayes1
object from a file where it is represented inYAML
. Requires YAML module.
Non-Constructor Methods
- add_table()
-
Add instances from a table. The first row are attributes, followed by values. If the name of the last attribute is `count', it is interpreted as a repetition count and used appropriatelly. The last attribute (after optionally removing `count') is the class attribute. The attributes and values are separated by white space.
- add_csv_file($filename)
-
Add instances from a CSV file. Primitive format implementation (e.g., no commas allowed in attribute names or values).
- drop_attributes(@attributes)
-
Delete attributes after adding instances.
- set_real(list_of_attributes)
-
Delares a list of attributes to be real-valued. During training, their conditional probabilities will be modeled with Gaussian (normal) distributions.
add_instance(attributes=>HASH,label=>STRING|ARRAY)
-
Adds a training instance to the categorizer.
add_instances(attributes=>HASH,label=>STRING|ARRAY,cases=>NUMBER)
-
Adds a number of identical instances to the categorizer.
- export_to_YAML()
-
Returns a
YAML
string representation of anAI::NaiveBayes1
object. Requires YAML module. export_to_YAML_file( $file_name )
-
Writes a
YAML
string representation of anAI::NaiveBayes1
object to a file. Requires YAML module. print_model( OPTIONAL 'with counts' )
-
Returns a string, human-friendly representation of the model. The model is supposed to be trained before calling this method. One argument 'with counts' can be supplied, in which case explanatory expressions with counts are printed as well.
- train()
-
Calculates the probabilities that will be necessary for categorization using the
predict()
method. predict( attributes => HASH )
-
Use this method to predict the label of an unknown instance. The attributes should be of the same format as you passed to
add_instance()
.predict()
returns a hash reference whose keys are the names of labels, and whose values are corresponding probabilities. labels
-
Returns a list of all the labels the object knows about (in no particular order), or the number of labels if called in a scalar context.
THEORY
Bayes' Theorem is a way of inverting a conditional probability. It states:
P(y|x) P(x)
P(x|y) = -------------
P(y)
and so on...
This is a pretty standard algorithm explained in many machine learning textbooks (e.g., "Data Mining" by Witten and Eibe).
The algorithm relies on estimating P(A|C), where A is an arbitrary attribute, and C is the class attribute. If A is not real-valued, then this conditional probability is estimated using a table of all possible values for A and C.
If A is real-valued, then the distribution P(A|C) is modeled as a Gaussian (normal) distribution for each possible value of C=c, Hence, for each C=c we collect the mean value (m) and standard deviation (s) for A during training. During classification, P(A=a|C=c) is estimated using Gaussian distribution, i.e., in the following way:
1 (a-m)^2
P(A=a|C=c) = ------------ * exp( - ------- )
sqrt(2*Pi)*s 2*s^2
this boils down to the following lines of code:
$scores{$label} *=
0.398942280401433 / $m->{real_stat}{$att}{$label}{stddev}*
exp( -0.5 *
( ( $newattrs->{$att} -
$m->{real_stat}{$att}{$label}{mean})
/ $m->{real_stat}{$att}{$label}{stddev}
) ** 2
);
i.e.,
P(A=a|C=c) = 0.398942280401433 / s *
exp( -0.5 * ( ( a-m ) / s ) ** 2 );
EXAMPLES
Example with a real-valued attribute modeled by a Gaussian distribution (from Witten I. and Frank E. book "Data Mining" (the WEKA book), page 86):
# @relation weather
#
# @attribute outlook {sunny, overcast, rainy}
# @attribute temperature real
# @attribute humidity real
# @attribute windy {TRUE, FALSE}
# @attribute play {yes, no}
#
# @data
# sunny,85,85,FALSE,no
# sunny,80,90,TRUE,no
# overcast,83,86,FALSE,yes
# rainy,70,96,FALSE,yes
# rainy,68,80,FALSE,yes
# rainy,65,70,TRUE,no
# overcast,64,65,TRUE,yes
# sunny,72,95,FALSE,no
# sunny,69,70,FALSE,yes
# rainy,75,80,FALSE,yes
# sunny,75,70,TRUE,yes
# overcast,72,90,TRUE,yes
# overcast,81,75,FALSE,yes
# rainy,71,91,TRUE,no
$nb->set_real('temperature', 'humidity');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>85,humidity=>85,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>80,humidity=>90,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>83,humidity=>86,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>70,humidity=>96,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>68,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>65,humidity=>70,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>64,humidity=>65,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>72,humidity=>95,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>69,humidity=>70,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>75,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>75,humidity=>70,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>72,humidity=>90,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>81,humidity=>75,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>71,humidity=>91,windy=>'TRUE'},label=>'play=no');
$nb->train;
my $printedmodel = "Model:\n" . $nb->print_model;
my $p = $nb->predict(attributes=>{outlook=>'sunny',temperature=>66,humidity=>90,windy=>'TRUE'});
YAML::DumpFile('file', $p);
die unless (abs($p->{'play=no'} - 0.792) < 0.001);
die unless(abs($p->{'play=yes'} - 0.208) < 0.001);
HISTORY
Algorithms::NaiveBayes by Ken Williams was not what I needed so I wrote this one. Algorithms::NaiveBayes is oriented towards text categorization, it includes smoothing, and log probabilities. This module is a generic, basic Naive Bayes algorithm.
THANKS
I would like to thank Yung-chung Lin (xern@ cpan. org) for his implementation of the Gaussian model for continuous variables, and the following people for bug reports, support, and comments (in a random order):
Michael Stevens
Tom Dyson
Dan Von Kohorn
CPAN-testers: Andreas Koenig, Alexandr Ciornii, jlatour, Jost.Krieger, tvmaly, Matthew Musgrove, Michael Stevens
Craig Talbert
and Andrew Brian Clegg.
AUTHOR
Copyright 2003-11 Vlado Keselj http://www.cs.dal.ca/~vlado. In 2004 Yung-chung Lin provided implementation of the Gaussian model for continous variables.
This script is provided "as is" without expressed or implied warranty. This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The module is available on CPAN (http://search.cpan.org/~vlado), and http://web.cs.dal.ca/~vlado/srcperl/. The latter site is updated more frequently.
SEE ALSO
Algorithms::NaiveBayes, perl.