NAME
Treex::Tool::Parser::MSTperl::ModelLabelling
VERSION
version 0.08055
DESCRIPTION
This is an in-memory represenation of a labelling model, extended from Treex::Tool::Parser::MSTperl::ModelBase.
FIELDS
Inherited from base package
Fields inherited from Treex::Tool::Parser::MSTperl::ModelBase.
- config
-
Instance of Treex::Tool::Parser::MSTperl::Config containing settings to be used for the model.
Currently the settings most relevant to the model are the following:
- EM_EPSILON
- labeller_algorithm
-
See "labeller_algorithm" in Treex::Tool::Parser::MSTperl::Config.
- labelledFeaturesControl
-
See "labelledFeaturesControl" in Treex::Tool::Parser::MSTperl::Config.
- SEQUENCE_BOUNDARY_LABEL
-
See "SEQUENCE_BOUNDARY_LABEL" in Treex::Tool::Parser::MSTperl::Config.
- featuresControl
-
Provides access to labeller features, especially enabling their computation. Intance of Treex::Tool::Parser::MSTperl::FeaturesControl.
Label scoring
- emissions
-
Emission scores for Viterbi. They follow the edge-based factorization and provide scores for various labels for an edge based on its features.
The structure is:
emissions->{feature}->{label} = score
Scores may or may not be probabilities, based on the algorithm used. Also based on the algorithm they may be MIRA-computed or they might be obtained by standard MLE.
- transitions
-
Transition scores for Viterbi. They follow the first order Markov chain edge-based factorization and provide scores for various labels for an edge probably based on its features and always based on previous edge label.
Scores may or may not be probabilities, based on the algorithm used. Also based on the algorithm they may be obtained by standard MLE or they might be MIRA-computed.
The structure is:
transitions->{label_prev}->{label_this} = prob
or
transitions->{feature}->{label_prev}->{label_this} = score
Transitions smoothing
In some algorithms linear combination smoothing is used for transition probabilities. The resulting transition probability is then obtained as:
PROB(label|prev_label) =
smooth_bigrams * transitions->{prev_label}->{label} +
smooth_unigrams * unigrams->{label} +
smooth_uniform
- smooth_bigrams
- smooth_unigrams
- smooth_uniform
-
The actual smoothing parameters computed by EM algorithm. Each of them is between 0 and 1 and together they sum up to 1.
- uniform_prob
-
Unifrom probability of a label, computed as
1 / ( keys %{ $self-
unigrams } )>.Set in
compute_smoothing_params
. - unigrams
-
Basic MLE from data, the structure is
unigrams->{label} = prob
To be used for transitions smoothing and/or backoff (can be used both for emissions and transitions) It also contains the
SEQUENCE_BOUNDARY_LABEL
prob (the SEQUENCE_BOUNDARY_LABEL is counted once for each sequence) which might be unappropriate in some cases (eg. for emission probs). - EM_heldout_data
-
Just an array ref with the sentences that represent the heldout data to be able to run the EM algorithm in
prepare_for_mira()
. Used only in training.
METHODS
Inherited
Subroutines inherited from Treex::Tool::Parser::MSTperl::ModelBase.
Load and store
- store
- store_tsv
- load
- load_tsv
Overriden
Subroutines overriding stubs in Treex::Tool::Parser::MSTperl::ModelBase.
Load and store
- $data = get_data_to_store(), $data = get_data_to_store_tsv()
-
Returns the model data, containing the following fields:
unigrams
,transitions
,emissions
,smooth_uniform
,smooth_unigrams
,smooth_bigrams
,uniform_prob
- load_data($data), load_data_tsv($data)
-
Tries to get all necessary data from
$data
(seeget_data_to_store
to see what data are stored). Also does basic checks on the data, eg. for non-emptiness, but nothing sophisticated. Is algorithm-sensitive, i.e. if some data are not needed for the algorithm used, they do not have to be contained in the hash.
Training support
- prepare_for_mira
-
Called after preprocessing training data, before entering the MIRA phase.
Function varies depending on algorithm used. Usually recomputes counts stored in
emissions
,transitions
andunigrams
to probabilities that have been computed byadd_emission
,add_transition
andadd_unigram
. Also callscompute_smoothing_params
to estimate smoothing parameters for smoothing of transition probabilities. - get_feature_count
-
Only to provide information about the model. Returns number of features in the model (where a "feature" can stand for various things depending on the algorithm used).
Technical methods
- BUILD
-
my $model = Treex::Tool::Parser::MSTperl::ModelLabelling->new( config => $config, );
Creates an empty model. If you are training the model, this is probably what you want, otherwise you can use
load
orload_tsv
to load an existing labelling model from a file.However, most often you would probably use a model for a labeller (Treex::Tool::Parser::MSTperl::Labeller) or a labelling trainer (Treex::Tool::Parser::MSTperl::TrainerLabelling) which both automatically create the model on build. The labeller also provides wrapping methods "load_model" in Treex::Tool::Parser::MSTperl::Labeller and "load_model_tsv" in Treex::Tool::Parser::MSTperl::Labeller which you can call to load the model from a file. (Btw. as you might expect, the trainer provides methods "store_model" in Treex::Tool::Parser::MSTperl::TrainerLabelling and "store_model_tsv" in Treex::Tool::Parser::MSTperl::TrainerLabelling.)
MLE on training data
emissions
and transitions
can be either MIRA-trained or estimated directly from training data using MLE (Maximum Likelihood Estimate). unigrams
are always estimated by MLE.
- add_unigram ($label)
-
Increment count for the label in
unigrams
. - add_transition ($label_this, $label_prev)
- add_transition ($label_this, $label_prev, $feature)
-
Increment count for the transition in
transitions
, possible including a feature on "this" edge if the algorithm uses features with transitions. - add_emission ($feature, $label)
-
Increment count for this label on an edge with this feature in
emissions
. - compute_probs_from_counts ($self->emissions)
-
Takes a hash reference with label counts and chnages the counts to probabilities (this is the actual MLE). May be called in
prepare_for_mira
onemissions
,transitions
andunigrams
.
EM algorithm
- compute_smoothing_params()
-
The main method containing an implementation of the Expectation Maximization Algorithm to compute smoothing parameters (
smooth_bigrams
,smooth_unigrams
,smooth_uniform
) for transition probabilities smoothing by linear combination of bigram, unigram and uniform probability. Iteratively tries to find such parameters that the probabilities from training data (transitions
,unigrams
anduniform_prob
) combined together by the smoothing parameters model well enough the heldout data (EM_heldout_data
), i.e. tries to maximize the probability of the heldout data given the training data probabilities by adjusting the smoothing parameters values.Uses
EM_EPSILON
as a stopping criterion, i.e. stops when the sum of absolute values of changes to all smoothing parameters are lower than the value ofEM_EPSILON
. - count_expected_counts_all()
- count_expected_counts_tree($root_node)
- count_expected_counts_sequence($labels_sequence)
-
Support methods to
compute_smoothing_params
, in the order in which they call each other.
Scoring
A bunch of methods to score the likelihood of a label being assigned to an edge based on the edge's features and the label assigned to the previous edge.
- get_all_labels()
-
Returns (a reference to) an array of all labels found in the training data (eg.
['Subj', 'Obj', 'Atr']
). - get_label_score($label, $label_prev, $features)
-
Computes a score of assigning the given label to an edge, given the features of the edge and the label assigned to the previous edge.
Always a higher score means a more likely label for the edge. Some algorithms may give a negative score.
Is semantically equivalent to calling
get_emission_score
andget_transition_score
and then combining it together somehow. - get_emission_score($label, $feature)
-
Computes the "emission score" of assigning the given label to an edge, given one of the feature of the edge and disregarding the label assigned to the previous edge.
- get_transition_score($label_this, $label_prev, $feature)
-
Computes the "transition score" of assigning the given label to an edge, given the label assigned to the previous edge and possibly also one of the features of the edge but NOT including the emission score returned by
get_emission_score
. - $result = get_transition_probs_array ($label_this, $label_prev)
-
Returns (a reference to) an array of the probabilities of the transition from label_prev to label_this (to be smoothed together), having the following structure:
$result->[0] = uniform prob $result->[1] = unigram prob $result->[2] = bigram prob
- $result = get_emission_scores($features)
-
Get scores of assigning each of the possible labels to an edge based on all the features of the edge. Is semantically equivalent to doing:
foreach label foreach feature get_emission_score(label, feature)
The structure is:
$result->{label} = score
Actually only serves as a switch for several implementations of the method (
get_emission_scores_basic_MIRA
andget_emission_scores_no_MIRA
); the method to be used is selected based on the algorithm being used. - get_emission_scores_basic_MIRA($features)
-
A
get_emission_scores
implementation used with algorithms where the emission scores are computed by MIRA (this is currently the most successful implementation). - get_emission_scores_no_MIRA($features)
-
A
get_emission_scores
implementation using only MLE. Probably obsolete now.
Changing the scores
Methods used by the trainer (Treex::Tool::Parser::MSTperl::TrainerLabelling) to adjust the scores to whatever seems to be the best idea at the moment. Used only in MIRA training (MLE uses add_unigram
, add_emission
, add_transition
and compute_probs_from_counts
instead).
- set_feature_score($feature, $score, $label, $label_prev)
-
Sets the specified emission score (if label_prev is not set) or transition score (if it is) to the given value (
$score
). - update_feature_score($feature, $update, $label, $label_prev)
-
Updates the specified emission score (if label_prev is not set) or transition score (if it is) by the given value (
$update
), i.e. adds that value to the current value.
AUTHORS
Rudolf Rosa <rosa@ufal.mff.cuni.cz>
COPYRIGHT AND LICENSE
Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.