NAME
Algorithm::SVMLight - Perl interface to SVMLight Machine-Learning Package
SYNOPSIS
use Algorithm::SVMLight;
my $s = new Algorithm::SVMLight;
$s->add_instance
(attributes => {foo => 1, bar => 1, baz => 3},
label => 1);
$s->add_instance
(attributes => {foo => 2, blurp => 1},
label => -1);
... repeat for several more instances, then:
$s->train;
# Find results for unseen instances
my $result = $s->predict
(attributes => {bar => 3, blurp => 2});
DESCRIPTION
This module implements a perl interface to Thorsten Joachims' SVMLight package:
SVMLight is an implementation of Vapnik's Support Vector Machine [Vapnik, 1995] for the problem of pattern recognition, for the problem of regression, and for the problem of learning a ranking function. The optimization algorithms used in SVMlight are described in [Joachims, 2002a ]. [Joachims, 1999a]. The algorithm has scalable memory requirements and can handle problems with many thousands of support vectors efficiently.
-- http://svmlight.joachims.org/
Support Vector Machines in general, and SVMLight specifically, represent some of the best-performing Machine Learning approaches in domains such as text categorization, image recognition, bioinformatics string processing, and others.
For efficiency reasons, the underlying SVMLight engine indexes features by integers, not strings. Since features are commonly thought of by name (e.g. the words in a document, or mnemonic representations of engineered features), we provide in Algorithm::SVMLight
a simple mechanism for mapping back and forth between feature names (strings) and feature indices (integers). If you want to use this mechanism, use the add_instance()
and predict()
methods. If not, use the add_instance_i()
(or read_instances()
) and predict_i()
methods.
INSTALLATION
For installation instructions, please see the README file included with this distribution.
METHODS
- new(...)
-
Creates a new
Algorithm::SVMLight
object and returns it. Any named arguments that correspond to SVM parameters will cause their correspondingset_***()
method to be invoked:$s = Algorithm::SVMLight->new( type => 2, # Regression model biased_hyperplane => 0, # Nonbiased kernel_type => 3, # Sigmoid );
See the
set_***(...)
method for a list of such parameters. - set_***(...)
-
The following parameters can be set by using methods with their corresponding names - for instance, the
maxiter
parameter can be set by usingset_maxiter($x)
, where$x
is the new desired value.Learning parameters: type svm_c eps svm_costratio transduction_posratio biased_hyperplane sharedslack svm_maxqpsize svm_newvarsinqp kernel_cache_size epsilon_crit epsilon_shrink svm_iter_to_shrink maxiter remove_inconsistent skip_final_opt_check compute_loo rho xa_depth predfile alphafile Kernel parameters: kernel_type poly_degree rbf_gamma coef_lin coef_const custom
For an explanation of these parameters, you may be interested in looking at the svm_common.h file in the SVMLight distribution.
It would be a good idea if you only set these parameters via arguments to
new()
(see above) or right after callingnew()
, since I don't think the underlying C code expects them to change in the middle of a process. - add_instance(label => $x, attributes => \%y)
-
Adds a training instance to the set of instances which will be used to train the model. An
attributes
parameter specifies a hash of attribute-value pairs for the instance, and alabel
parameter specifies the label. The label must be a number, and typically it should be1
for positive training instances and-1
for negative training instances. The keys of theattributes
hash should be strings, and the values should be numbers (the values of each attribute).All training instances share the same attribute-space; if an attribute is unspecified for a certain instance, it is equivalent to specifying a value of zero. Typically you can save a lot of memory (and potentially training time) by omitting zero-valued attributes.
Each training instance may have a "cost factor" assigned to it, indicating the relative cost of misclassification of the instance. The default is a cost of 1.0; to assign a different cost, pass a
cost_factor
parameter with the desired value.When using a ranking SVM, you may also pass a
query_id
parameter, whose integer value will identify the group of instances in which this instance belongs for ranking purposes.Finally, a
slack_id
parameter may also be passed and it will become theslackid
member of the underlyingDOC
C struct, used in an "OPTIMIZATION" SVM (type==4
). - add_instance_i($label, $name, \@indices, \@values, $query_id=0, $slack_id=0, $cost_factor=1.0)
-
This is just like
add_instance()
, but bypasses all the string-to-integer mapping of feature names. Use this method when you already have your features represented as integers. The$label
parameter must be a number (typically1
or-1
), and the@indices
and@values
arrays must be parallel arrays of indices and their corresponding values. Furthermore, the indices must be positive integers and given in strictly increasing order.If you like
add_instance_i()
, I've got apredict_i()
I bet you'll just love. - read_instances($file)
-
An alternative to calling
add_instance_i()
for each instance is to organize a collection of training data into SVMLight's standard "example_file" format, then call thisread_instances()
method to import the data. Under the hood, this calls SVMLight'sread_documents()
C function. When it's convenient for you to organize the data in this manner, you may see speed improvements. - ranking_callback(\&function)
-
When using a ranking SVM, it is possible to customize the cost of ranking each pair of instances incorrectly by supplying a custom Perl callback function.
For two instances
i
andj
, the custom function will receive four arguments: therankvalue
of instancei
andj
, and thecostfactor
of instancei
andj
. It should return a real number indicating the cost.By default, SVMLight will use an internal C function assigning a cost of the average of the
costfactor
s for the two instances. - train()
-
After a sufficient number of instances have been added to your model, call
train()
in order to actually learn the underlying discriminative Machine Learning model.Depending on the number of instances (and to a lesser extent the total number of attributes), this method might take a while. If you want to train the model only once and save it for later re-use in a different context, see the
write_model()
andread_model()
methods. - is_trained()
-
Returns a boolean value indicating whether or not
train()
has been called on this model. - predict(attributes => \%y)
-
After
train()
has been called, the model may be applied to previously-unseen combinations of attributes. Thepredict()
method accepts anattributes
parameter just likeadd_instance()
, and returns its best prediction of the label that would apply to the given attributes. The sign of the returned label (positive or negative) indicates whether the new instance is considered a positive or negative instance, and the magnitude of the label corresponds in some way to the confidence with which the model is making that assertion. - predict_i(\@indices, \@values)
-
This is just like
predict()
, but bypasses all the string-to-integer mapping of feature names. See alsoadd_instance_i()
. - write_model($file)
-
Saves the given trained model to the file
$file
. The model may later be re-loaded using theread_model()
method. The model is written using SVMLight'swrite_model()
C function, so it will be fully compatible with SVMLight command-line tools likesvm_classify
. - read_model($file)
-
Reads a model that has previously been written with
write_model()
:my $m = Algorithm::SVMLight->new(); $m->read_model($file);
The model file is read using SVMLight's
read_model()
C function, so if you want to, you could initially create the model with one of SVMLight's command-line tools likesvm_learn
. - get_linear_weights()
-
After training a linear model (or reading in a model file), this method will return a reference to an array containing the linear weights of the model. This can be useful for model inspection, to see which features are having the greatest impact on decision-making.
my $arrayref = $m->get_linear_weights();
The first element (position 0) of the array will be the threshold
b
, and the rest of the elements will be the weights themselves. Thus from 1 upward, the indices align with SVMLight's internal indices.If the model has not yet been trained, or if the kernel type is not linear, an exception will be thrown.
- feature_names()
-
Returns a list of feature names that have been fed to
add_instance()
as keys of theattribute
parameter, or in a scalar context the number of such names. - num_features()
-
Returns the number of features known to this model. Note that if you use
add_instance_i()
orread_instances()
, some of the features may never actually have been seen before, because you could add instances with only indices 2, 5, and 37, never having added any instances with the indices in between, butnum_features()
will return 37 in this case. This is because after training, an instance could be passed to thepredict()
method with real values for these previously unseen features. If you just useadd_instance()
instead, you'll probably never run into this issue, and in a scalar contextnum_features()
will look just likefeature_names()
. - num_instances()
-
Returns the number of training instances known to the model. It should be fine to call this method either before or after training actually occurs.
SEE ALSO
Algorithm::NaiveBayes, AI::DecisionTree
AUTHOR
Ken Williams, <kwilliams@cpan.org>
COPYRIGHT AND LICENSE
The Algorithm::SVMLight
perl interface is copyright (C) 2005-2008 Thomson Legal & Regulatory, and written by Ken Williams. It is free software; you can redistribute it and/or modify it under the same terms as perl
itself.
Thorsten Joachims and/or Cornell University of Ithaca, NY control the copyright of SVMLight itself - you will find full copyright and license information in its distribution. You are responsible for obtaining an appropriate license for SVMLight if you intend to use Algorithm::SVMLight
. In particular, please note that SVMLight "is granted free of charge for research and education purposes. However you must obtain a license from the author to use it for commercial purposes."
To avoid any copyright clashes, the SVMLight.patch file distributed here is granted under the same license terms as SVMLight itself.