NAME
MaxEntropy - Perl module for Maximum Entropy Modeling
SYNOPSIS
use Statistics::MaxEntropy;
# debugging messages; default 0
$Statistics::MaxEntropy::debug = 0;
# normalise frequencies in events file; default 1
$Statistics::MaxEntropy::normalise = 1;
# maximum number of iterations for IIS; default 100
$Statistics::MaxEntropy::NEWTON_max_it = 100;
# minimal distance between new and old x for Newton's method;
# default 0.001
$Statistics::MaxEntropy::NEWTON_min = 0.001;
# maximum number of iterations for Newton's method; default 100
$Statistics::MaxEntropy::KL_max_it = 100;
# minimal distance between new and old x; default 0.001
$Statistics::MaxEntropy::KL_min = 0.001;
# configuration of Statistics::MaxEntropy::random_init
# maximum number of candidates randomly generated
$Statistics::MaxEntropy::rand_max_candidates = 20;
# maximum number of features randomly generated
$Statistics::MaxEntropy::rand_max_features = 20;
# initialisation for GIS, IIS, or FI
Statistics::MaxEntropy::init($events_file,
$parameters_file,
$candidates_file);
# random initialisation for GIS, IIS, or FI
Statistics::MaxEntropy::random_init($random_events_file,
$random_parameters_file,
$random_candidates_file);
# Generalised Iterative Scaling
Statistics::MaxEntropy::GIS();
# Improved Iterative Scaling
Statistics::MaxEntropy::IIS();
# Feature Induction algorithm
Statistics::MaxEntropy::FI($nr_candidates_to_add, $iterative_scaler);
# writes new events, candidates, and parameters files
Statistics::MaxEntropy::done($new_events_file,
$new_parameters_file,
$new_candidates_file,
$information_file);
DESCRIPTION
This module is an implementation of the Generalised and Improved Iterative Scaling (GIS, IIS) algorithms and the Feature Induction (FI) algorithm as defined in (Darroch and Ratcliff 1972, (Della Pietra et al. 1997)). The purpose of the scaling algorithms is to find the maximum entropy distribution given a set of events and (optionally) an initial distribution. Also a set of candidate features may be specified; then the FI algorithm may be applied to find and add the candidate feature(s) that give the largest `gain' in terms of Kullback Leibler divergence when it is added to the current set of features.
Events are specified in terms of a set of feature functions (properties) f_1...f_k that map each event to {0,1}: an event is a string of bits. In addition of each event its frequency is given. We assume the event space to have a probability distribution that can be described by
VARIABLES
$Statistics::MaxEntropy::debug
If set to "1", lots of debug information, and intermediate results will be output. Default: "0".
$Statistics::MaxEntropy::normalise
If set to "1", frequencies in the events file are normalised; this is required if they not sum to "1"; default: "1".
$Statistics::MaxEntropy::NEWTON_max_it
Sets the maximum number of iterations in Newton's method. Newton's method is applied to find the new parameters \alpha_i of the features f_i. Default: "100".
$Statistics::MaxEntropy::NEWTON_min
Sets the minimum difference between x' and x in Newton's method; if either the maximum number of iterations is reached or the difference between x' and x is small enough, the iteration is stopped. Default: "0.001".
$Statistics::MaxEntropy::KL_max_it
Sets the maximum number of iterations applied in the IIS algorithm. Default: "100".
$Statistics::MaxEntropy::KL_min
Sets the minimum difference between KL divergences of two distributions in the IIS algorithm; if either the maximum number of iterations is reached or the difference between the divergences is enough, the iteration is stopped. Default: "0.001".
$Statistics::MaxEntropy::rand_max_candidates
Configures
Statistics::MaxEntropy::random_init
; it sets the maximum number candidate features that are randomly generated for each event. Values "2" and higher are allowed.$Statistics::MaxEntropy::rand_max_features
Configures
Statistics::MaxEntropy::random_init
; it sets the maximum number of features that are randomly generated for each event.The minimum number of events that are randomly generates is "2". The maximum number of events is bounded from above by 2^{number of features randomly chosen}.
FUNCTIONS
Statistics::MaxEntropy::init($events_file,
$parameters_file,
$candidates_file);
The events file is required. The candidate and initial parameter files are optional. If the initial parameter file is specified as the empty string then random parameters from [0,1] are chosen.
Statistics::MaxEntropy::random_init($random_events_file,
$random_parameters_file,
$random_candidates_file);
All three files are required. Your program dies, if you leave one or more unspecified. A caveat:
Statistics::MaxEntropy::GIS();
Calls the Generalised Iterative Scaling algorithm. See (Darroch and Ratcliff 1972).
Statistics::MaxEntropy::IIS();
Calls the Improved Iterative Scaling algorithm. If no events were loaded or generated, your program exits. See (Della Pietra et al. 1997).
Statistics::MaxEntropy::FI($nr_candidates_to_add, $iterative_scaler);
Calls the Feature Induction algorithm. The parameter
$nr_candidates_to_add
is for the number of candidates it should add. If this number is greater than the number of candidates, all candidates are added. If it is not specified, it "1" is used. The parameter$iterative_scaler
can be used to provide one of scaling algorithmGIS
orIIS
. The references to the sub's should be used as follows:$iterative_scaler = \&GIS; $nr_candidates_to_add = 2; FI($nr_candidates_to_add, $iterative_scaler);
If it is not specified, the improved algorithm is used.
Statistics::MaxEntropy::done($new_events_file,
$new_parameters_file,
$new_candidates_file,
$information_file);
All parameters are optional. The candidate numbers that were added are printed to the information file. The new events file contains the events extended with the new features, the new candidates file contains the set of candidates minus the candidates added. To the distribution file the new parameters are printed. The information file is used to print the indices of the candidates that were added to. Before a new call of
init
orrandom_init
, callingStatistics::MaxEntropy::done
is a good thing to do.
SYNTAX OF INPUT/OUTPUT FILES
Lines that start with a `#' and empty lines are ignored.
Below we give the syntax of in and output files.
EVENTS FILE (input/output)
Syntax of the event file (n features, and m events); the following holds for features:
each line is an event;
each column represents a feature function; the co-domain of a feature function is {0,1};
no space between feature columns;
constant features (i.e. columns that are completely 0 or 1) are forbidden;
2 or more events should be specified (this is in fact a consequence of the previous requirement;
The frequency of each event precedes the feature columns. Features are indexed from right to left. This is a consequence of how Bit::Vector
reads bit strings. Each fij is a bit and freqi a float in the following schema:
freq1 <white> f1n ... f13 f12 f11 <newline>
. .
. .
. .
freqi <white> fin ... fi3 fi2 fi1 <newline>
. .
. .
. .
freqm <white> fmn ... fm3 fm2 fm1
(m events, n features)
PARAMETERS FILE (input/output)
Syntax of the initial parameters file; one parameter per line:
par1 <newline>
.
.
.
pari <newline>
.
.
.
parn
(n features)
The syntax of the output distribution is the same.
CANDIDATES FILE (input)
The syntax of the candidate feature file is more or less the same as that for the events file:
each line is an event (events specified in the same order as the events file);
each column is a feature;
constant feature functions are forbidden;
values are 0 or 1;
no space between features;
fij are bits:
f1c ... f13 f12 f11 <newline>
.
.
.
fic ... fi3 fi2 fi1 <newline>
.
.
.
fmc ... fm3 fm2 fm1
(m events, c candidate features)
INFO FILE (output)
The indices of the candidates added to the field, one per line. Bit strings are always indexed right to left.
BUGS
It's slow.
Statistics::MaxEntropy
communicates through files mainly. The reason for is a practical one: for my own purposes I needed communication through files; I'm using the module for large event bases (corpora), and I'm not interested in (large) arrays that tell me what candidates have been added, what parameters were found, or how the events file looks like in an array. My (other) software components will read the files again.
ENVIRONMENT
No environment variables are used.
AUTHOR
SEE ALSO
perl(1), Bit::Vector, and POSIX.
DIAGNOSTICS
The module dies with an appropriate message if
it cannot open a specified events file;
if you specified a constant feature function (in the events file or the candidates file); this may happen sometimes if you call
Statistics::MaxEntropy::random_init
;if
Statistics::MaxEntropy::FI
is called and no candidates are given;if
Statistics::MaxEntropy:IIS
is called and no events are given;if the events file, candidates file, or the parameters file is not consistent; causes (a.o.): insufficient features, or too many features for some event; inconsistent candidate lines; insufficient, or to many event lines in the candidates file.
REFERENCES
(Darroch and Ratcliff 1972) J. Darroch and D. Ratcliff, Generalised Iterative Scaling for log-linear models, Ann. Math. Statist., 43, 1470-1480, 1972.
(Jaynes 1997) E.T. Jaynes, Probability theory: the logic of science, 1997, unpublished manuscript.
(Della Pietra, Della Pietra and Lafferty 1997) Stephen Della Pietra, Vincent Della Pietra, and John Lafferty, Inducing features of random fields, In: Transactions Pattern Analysis and Machine Intelligence, 19(4), April 1997.
COPYRIGHT
Statistics::MaxEntropy
comes with ABSOLUTELY NO WARRANTY and may be copied only under the terms of the GNU General Public License (version 2, or later), which may be found in the distribution.