NAME

Lingua::Interset::Tagset - The root class for all physical tagsets covered by DZ Interset 2.0.

VERSION

version 3.016

SYNOPSIS

package Lingua::Interset::MY::Tagset;
use Moose;
extends 'Lingua::Interset::Tagset';
use Lingua::Interset::FeatureStructure;

sub decode
{
    my $self = shift;
    my $tag = shift;
    my $fs = Lingua::Interset::FeatureStructure->new();
    ...
    return $fs;
}

sub encode
{
    my $self = shift;
    my $fs = shift; # Lingua::Interset::FeatureStructure
    my $tag;
    ...
    return $tag;
}

sub list
{
    my $self = shift;
    return ['NOUN', 'VERB', 'OTHER'];
}

1;

DESCRIPTION

DZ Interset is a universal framework for reading, writing, converting and interpreting part-of-speech and morphosyntactic tags from multiple tagsets of many different natural languages.

The Tagset class is the inheritance root for all classes describing physical tagsets (sets of strings of characters). It defines decoding of tags, encoding and list of known tags.

ATTRIBUTES

permitted_structures

A Lingua::Interset::Trie object that represents all feature structures permitted by this tagset. These are structures that result from decoding one of the known tags returned by the list() method.

This data structure is used to implement strict encoding (see the encode_strict() method).

permitted_values

Reference to a hash that contains all feature values set by the decode() method for at least one of the known tags. If the tagset permits $value of $feature, then

$driver->permitted_values->{$feature}{$value} != 0

Note that a value that is permitted in one context may not be permitted in another. (For example, plural number could be allowed for a noun but not for an adverb.) Unlike in permitted_structures, this hash just ignores context.

METHODS

get_tagset_id()

Returns the tagset id that should be set as the value of the 'tagset' feature during decoding. Every derived class must implement this method, even though the derived class is also responsible for setting the value in its decode() method.

The ID should correspond to the last two parts in package name, lowercased. Specifically, it should be the ISO 639-2 language, followed by :: and a language-specific tagset ID. Example: cs::multext.

decode()

my $fs  = $driver->decode ($tag);

Takes a tag (string) and returns a Lingua::Interset::FeatureStructure object with corresponding feature values set.

Every derived class must implement this method. The Tagset class contains an empty implementation, which will throw an exception if inherited and called.

encode()

my $tag = $driver->encode ($fs);

Takes a Lingua::Interset::FeatureStructure object and returns the tag (string) in the given tagset that corresponds to the feature values. Note that some features may be ignored because they cannot be represented in the given tagset.

Every derived class must implement this method. The Tagset class contains an empty implementation, which will throw an exception if inherited and called.

encode_strict()

my $tag = $driver->encode_strict ($fs);

Takes a feature structure (Lingua::Interset::FeatureStructure) and returns a tag that matches the contents of the feature structure.

Unlike encode(), encode_strict() always returns a known tag, i.e. one that is returned by the list() method of the Tagset object. Many tagsets consist of structured tags, i.e. they can be defined as a compact representation of a feature structure (a set of attribute-value pairs). It is in principle possible to encode such combinations of features and values that did not appear in the original tagset. For example, a tagset for Czech is unlikely to contain a tag saying that a word is preposition and at the same time setting non-empty value for gender. Yet it is possible to create such a tag because the tagset encodes part of speech and gender independently.

If this is undesirable behavior, the application should call encode_strict() instead of encode(). Then it will be guaranteed that the resulting tag is one of those returned by list(). Nevertheless, think twice whether you really need the guarantee, as it does not come for free. The necessity to replace forbidden feature values by permitted ones may sometimes lead to surprising or confusing results.

This method is implemented directly within the Tagset class, relying on custom implementations of list(), decode() and encode().

list()

my $list_of_tags = $driver->list();

Returns the reference to the list of all known tags in this particular tagset. This is not directly needed to decode, encode or convert tags but it is very useful for testing and advanced operations over the tagset. Note however that many tagset drivers contain only an approximate list, created by collecting tag occurrences in some corpus.

Every derived class must implement this method. The Tagset class contains an empty implementation, which will throw an exception if inherited and called.

SEE ALSO

Lingua::Interset, Lingua::Interset::FeatureStructure, Lingua::Interset::Trie

AUTHOR

Dan Zeman <zeman@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

This software is copyright (c) 2019 by Univerzita Karlova (Charles University).

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.