NAME

IDS::Test - An IDS test framework

SYNOPSIS

A usage synopsis would go here. Since it is not here, read on.

DESCRIPTION

Create an IDS algorithm instance and do training and/or testing with it.

new(datasource, algorithm)
new(datasource, algorithm, GetOpt::Long options)

Set up a new test object with the data source and algorithm specified.

NOTES: If the calling program needs to do command-line argument processing, it must do it through this call. Any GetOpt::Long options supplied override those in the data source and algorithm; beware to ensure that options are unique or that this behavior is what you want.

The data source and algorithm cannot be specified by command-line arguments; the getopt processing happens after the object creation, and it must be this way since other command-line args are data source- or algorithm-specific.

parameters()
parameters(param)
parameters(param, value, ...)

Set and/or retrieve the current parameters (individual or group). These parameters may apply to the data source or to the algorithm (the assumption is that either they are unique, or have the same meaning). The return value is always an hash (actually, an array of key-value pairs).

Print the values of all of the parameters (useful for recording the configuration used for a training or test run). If a filehandle is provided, it is the destination for the dump. Otherwise, the parameters are printed to stdout.

param_options()

Returns the command-line parameter processing options.

default_parameters()

Set the default values for parameters that may be modified by command-line options.

For an IDS::Test object, the command-line options are:

test_verbose

Whether and how much debugging output to produce.

clean_interval

How often, measured in number of data source items processed, should IDS::Algorithm::clean be called. 0 means to not even try to call it. Cleaning may be a form of generalization, or it may be something else (it is up to the algorithm to determine the meaning).

Cleaning occurs in foreach(), so it can occur for training, testing, or both.

generalize

Whether or not to call IDS::Algorithm::generalize() at the end of the training. The meaning of generalization is up to the algorithm. Generalization occurs after all training, and before saving.

add_threshold

How similar should a test result be before adding in a tested data source item. 0 means no adding will be attempted. If the test results in a similarity value >= the threshold, then the item will be added.

This option is one way to allow the IDS algorithm to handle a non-stationary environment.

One implication of this dynamic adding is if add_threshold > 0, then the algorithm *must* support adding into a loaded state. It also implies that you probably want to save the resulting state at the end of the test run.

normal_threshold

Results below this value are considered abnormal. This value only has meaning if group_abnormal or try_alternates is enabled.

group_abnormal

If an item is considered abnormal, then try to group it with others to reduce the overall false positive rate. See the definition for normal_threshold for the definition of abnormal. The idea is that a human admin could look at one example from a group and decide if it is normal or not. I added this ability after dealing with a web robot that did not exist in the training data, and came online in the test data. By grouping, I was able to cut the false positive count from nearly 10,000 for a week (for the web robot alone) to 22 for a week (considering all sources).

abnormal_threshold

This parameter only has meaning if group_abnormal is enabled.

If an item is considered abnormal, this parameter controls how closely should it match other items to be considered part of the class of false positives. Range is 0-1.

try_alternates

If an item is considered abnormal, see if the source can provide an alternate version of the item that may be more normal. I developed this heuristic to try deleting various high-variability lines from a HTTP request in an attempt to make it more normal (and hence acceptable).

IDS::DataSource::alternate() must return either a two-element list (label, ref) consisting of a label describing the new object and a new IDS::DataSource or undef.

add_alternates

Learn the alternate version if the similarity value of the alternate version is >= add_threshold.

twopass

The training process requires two passes across the training data, with the second pass function being next_pass.

Biased comment: one-pass algorithms make for easier implementation in real systems, especially ones that have nonstationary data.

learning_curve

Produce output during training showing how similar the added instance is to the existing model. This data may be useful for producing a learning curve for the algorithm.

train()
train(save)

Run the training process by calling IDS::Algorithn::add for each entry returned by the IDS::DataSource.

If save is defined and "true" (in the perl sense of truth), it is the file name or IO::Handle where the algorithm should save its state.

If curve is defined and "true" (in the perl sense of truth), then before each instance is added, it is first tested. The result is printed, and can be used to produce a learning curve for the algorithm.

learning_curve(tokenref, string, obs)

Produce a learning curve by testing each instance before it is added to the model. The similarity value of the test instance is the return value, which can be printed by foreach. This function is primarily for internal use.

This function can also save the model every n observations.

test()
test(save?)

Run the testing process. The if save is defined, it is the file name (or handle?) where the algorithm should save its state.

testfunc(tokenref, data, instance)

tokenref is a reference to a list of tokens representing the current item.

data is a string representation of the current item.

instance is an instance number.

This function is for use within the IDS::Test object and not for general use.

A testing function, called once per instance to test. This function is broken out of the actual test routine because we may need to do various things when testing that we may not need to do in learning. Examples include handling the parameters add_threshold and try_alternates.

foreach(funcref, funcobj, print?, type)

For each item in the data source, call the function referenced by funcref, which is a method for the funcobj. The call is: funcref(funcobj, tokens, string, n)

Print indicates whether or not to print a result line for this call. Normally, you would want to print when testing and not when learning.

Type indicates whether this is a training or a testing foreach. The variable is passed without intepretation to the IDS::Algorithm::clean function.

AUTHOR INFORMATION

Copyright 2005-2007, Kenneth Ingham. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Address bug reports and comments to: ids_test at i-pi.com. When sending bug reports, please provide the versions of IDS::Test.pm, IDS::Algorithm.pm, IDS::DataSource.pm, the version of Perl, and the name and version of the operating system you are using. Since Kenneth is a PhD student, the speed of the reponse depends on how the research is proceeding.

BUGS

Please report them.

SEE ALSO

IDS::Algorithm, IDS::DataSource