NAME
Treex::PML::Instance - Perl extension for loading/saving PML data
SYNOPSIS
use Treex::PML::Instance;
Treex::PML::AddResourcePath( "$ENV{HOME}/my_pml_schemas" );
my $pml = Treex::PML::Instance->load({ filename => 'foo.xml' });
my $schema = $pml->get_schema;
my $data = $pml->get_root;
$pml->save();
DESCRIPTION
This class provides a simple implementation of a PML instance.
EXPORT
None by default.
The following export tags are available:
- :constants
-
Imports the following constants:
- LM
-
name of the "<LM>" (list-member) tag
- AM
-
name of the "<AM>" (alt-member) tag
- PML_NS
-
XML namespace URI for PML instances
- PML_SCHEMA_NS
-
XML namespace URI for PML schemas
- SUPPORTED_PML_VERSIONS
-
space-separated list of supported PML-schema version numbers
- :diagnostics
-
Imports internal _die, _warn, and _debug diagnostics commands.
CONFIGURATION
The option 'config' of the methods load() and save() can provide a parsed configuration file. The configuration file is a PML instance whose PML schema is defined in the file pmlbackend_conf_schema.xml
distributed with Treex::PML in Treex/PML/Backend/pmlbackend_conf_schema.xml
.
This file can set defaults for some options of load() and save() and it can also define rules for pre-processing the input documents before parsing them as PML and for post-processing the output documents after serializing them as PML. Currently only XSLT 1.0, Perl and external-command pre-processing and XSLT 1.0 post-processing are implemented.
The PMLTransform
backend, when intialized (e.g. by calling by calling AddBackend('PMLTransform')
), automatically loads the first configuration file named pmlbackend_conf.xml
it finds in the Treex::PML
's resource paths. Additionally, it searches for all configuration files named pmlbackend_conf.inc
in the resource paths and merges their transformation rules into in-memory image of the main configuration file. Then, PMLTransform
uses this resulting configuration for all load/save operations.
IMPORTANT NOTE: it is recommended to add the PMLTransform
backend as the last I/O backend since its test() method automatically accepts any XML file (with the prospect of attempting to transform it during the read() phase)! So it must be added into the I/O backends list after all other backends working with XML-based formats.
Here is an example of a configuration file (see the schema for more details).
<?xml version="1.0" encoding="utf-8"?>
<pmlbackend xmlns="http://ufal.mff.cuni.cz/pdt/pml/">
<head>
<schema href="pmlbackend_conf_schema.xml"/>
</head>
<options>
<load>
<validate_cdata>1</validate_cdata>
<use_resources>1</use_resources>
</load>
<save>
<indent>4</indent>
<validate_cdata>1</validate_cdata>
<write_single_LM>1</write_single_LM>
</save>
</options>
<transform_map>
<transform id="alpino" test="alpino_ds[@version='1.1' or @version='1.2']">
<in type="xslt" href="alpino2pml.xsl"/>
<out type="xslt" href="pml2alpino.xsl"/>
</transform>
<transform id="sdata" root="sdata" ns="http://ufal.mff.cuni.cz/pdt/pml/">
<in type="perl" command="require SDataMerge; return SDataMerge::transform(@_);"/>
</transform>
<transform id="tei" test="*[namespace-uri()='http://www.tei-c.org/ns/1.0']">
<in type="pipe" command="tei2pml.sh">
<param name="--stdin" />
<param name="--stdout" />
</in>
</transform>
</transform_map>
</pmlbackend>
METHODS
- Treex::PML::Instance->new ()
-
NOTE: Don't call this constructor directly, use Treex::PML::Factory->createPMLInstance() instead!
Create a new empty PML instance object.
- Treex::PML::Instance->load (\%opts)
- $pml->load (\%opts)
-
NOTE: Don't call this method as a constructor directly, use Treex::PML::Factory->createPMLInstance() instead!
Read a PML instance from file, filehandle, string, or DOM. This method may be used both on an existing object (in which case it operates on and returns this object) or as a constructor (in which case it creates a new
Treex::PML::Instance
object and returns it). Possible options are:{ filename => $filename, # and/or fh => \*FH, # or string => $xml_string, # or dom => $document, # (XML::LibXML::Document) config => $cfg_pml, # (Treex::PML::Instance) parser_options => \%opt, # (XML::LibXML parser options) no_trees => $bool, no_references => $bool, no_knit => $bool, selected_references => { name => $bool, ... }, selected_knits => { name => $bool, ... } }
where
filename
may be used either by itself or in combination with any offh
,string
, ordom
, which are otherwise mutually exclusive. Theconfig
option may be used to pass aTreex::PML::Instance
with the parsed PML backend configuration file (see "CONFIGURATION"). Theparser_options
option may be used to pass a HASH reference containing options for the XML::LibXML parser (depending on implementation, these will be used to configure either an XML::LibXML::Reader or an XML::LibXML::Parser). Ifno_trees
is true, then the roles #TREES, #NODE and #CHILDNODES are ignored. The optionselected_references
determines which reffiles (with non-empty readas attribute) to read; if true, the reffile with a given name is read, if false, it is never read; if a value is not given for some reffile, the reffile is read unless theno_references
flag is on. The optionsselected_knits
andno_knits
determine data from which reffiles can be copied into this document following the rules for the role #KNIT. Their meaning is just like that forselected_references
andno_references
. Moreover,no_references
impliesno_knit
, unlessno_knit
is explicitly specified. - $pml->get_status ()
-
Returns 1 if the last load() was successful.
- $pml->save (\%opts)
-
Save PML instance to a file or file-handle. Possible options are:
filename, fh, config, refs_save, write_single_LM
. If bothfilename
andfh
are specified,fh
is used, but the filename associated with theTreex::PML::Instance
object is changed tofilename
. If neither is given, the filename currently associated with theTreex::PML::Instance
object is used. Theconfig
option may be used to pass aTreex::PML::Instance
representing the parsed PML backend configuration file (see "CONFIGURATION"). Therefs_save
option may be used to specify which reference files should be saved along with theTreex::PML::Instance
and where to. The value ofrefs_save
, if given, should be a HASH reference mapping reference IDs to the target URLs (filenames). Ifrefs_save
is given, only those references listed in the HASH are saved along with theTreex::PML::Instance
. Ifrefs_save
is undefined or not given, all references are saved (to their original locations). In both cases, only files declared as readas='dom' or readas='pml' can be saved. - $pml->convert_to_fsfile (fsfile)
-
Translates the current
Treex::PML::Instance
object to aTreex::PML::Document
object (using Treex::PML::Document MetaData and AppData fields for storage of non-tree data). If fsfile argument is not provided, creates a newTreex::PML::Document
object, otherwise operates on a given fsfile. Returns the resultingTreex::PML::Document
object. - $pml->convert_from_fsfile (fsfile)
- Treex::PML::Instance->convert_from_fsfile (fsfile)
-
Translates a
Treex::PML::Document
object to aTreex::PML::Instance
object. Non-tree data are fetched from Treex::PML::Document MetaData and AppData fields. If called on an instance, modifies and returns the instance, otherwise creates and returns a new instance. - Treex::PML::Instance::get_data ($obj,$path)
-
Retrieve a possibly nested value from the attribute data structure of $obj. The path argument uses an XPath-like expression of the form
step1/step2/...
where each step (depending on the value retrieved by the preceding part of the expression) can be one of:
- name of a member of a structure
-
to retrieve that member
- name of an attribute of a container
-
to retrieve that attribute
- name of an element of a sequence
-
to retrieve the first element of that name
- index of the form [n]
-
to retrieve n-th element /counting from 1/ from a list, sequence, or an alternative
- combination of name and index of the form name[n]
-
to retrieve n-th element named 'name' from a sequence
- combination of index and name of the form [n]name
-
to retrieve the n-th element of a sequence provided the n-th element's name is 'name'
In the preceding cases, [n] can be negative, in which case the retrieved value is the n-th element from the end of the list or sequence.
If a step of the form [n] is not given for a list or alternative value then [1] is assumed and the next step is processed.
If the value retrieved by some step is undefined or the step does not match the data type of the value retrieved by the preceding steps, the evaluation is stopped and undef is returned.
For example,
my $value = Treex::PML::Instance::get_data($obj,'foo/bar[2]/[-4]/baz/[5]bam');
is roughly equivalent to
my $el = $obj->{foo}->values('bar')->[1]->[-4]->{baz}->[4]; my $value = $el->name eq 'bam' ? $el->value : undef;
but without the side effect of creating array or hash structures where there is none. To be more specific, if, say $obj->{x} is not defined, then the Perl expression
if ($obj->{x}[3]{y}) {...}
automatically causes a side-effect of creating an ARRAY reference in $obj->{x} and a HASH reference in the fourth element of this ARRAY. An analogous construct
Treex::PML::Instance::get_data($obj,'foo/[4]/baz');
simply returns undef without either of these side-effects.
The following behave the same (provided that the path /foo/bar[2] retrieves a list, sequence or an alternative and /foo/bar[2]/[1]/baz retrieves a sequence):
my $value = Treex::PML::Instance::get_data($obj,'foo/bar[2]/[1]/baz/[1]bam'); my $value = Treex::PML::Instance::get_data($obj,'foo/bar[2]/baz/bam');
- Treex::PML::Instance::get_all($obj, $path)
-
This function returns all matches of a given attribute path on the object. It works just as
Treex::PML::Instance
::get_data except that it recurses into all values of a list, alt or sequence instead of just the first one on attribute-path steps that do not give an exact index. Furthermore, unlikeTreex::PML::Instance::get_data
, this functions does expands trailing Lists and Alts, which means this: If the path leads to a List or Alt value, the members values are returned instead; this replacement is applied recursively.The expansion of trailing Lists and Alts can be prevented by appending a slash followed by a dot to the attribute path ("$path/.").
- Treex::PML::Instance::set_data ($obj,$path,$value,$strict?)
-
Store a given value to a possibly nested attribute of $obj specified by path. The path argument uses the XPath-like syntax described above for the method
Treex::PML::Instance::get_data
. If $strict==0 and a non-index step is to be processed on an alternative or list, then step [1] is assumed and the 1st element of the list or alternative is used for further processing of the path expression (except when this occurs in the last step, in which case the entire list or alternative is overwritten by the given value). If $strict==1 and a non-index step is to be processed on an alternative or list, a warning is issued and undef is returned. If $strict==2, the same approach as with $strict==1 is taken, but croak is used instead of warn. - $pml->for_each_match( { path1 => callback1, path2 => callback2,...})
- Treex::PML::Instance::for_each_match( $obj, { path1 => callback1, path2 => callback2,...}, \%opts )
-
This function traverses a given PML data structure and dispatches callbacks at all occurrences of given attribute paths.
If called on other object that
Treex::PML::Instance
(i.e. Treex::PML::Struct, Treex::PML::List, etc.), the corresponding data type (Treex::PML::Schema::* object) can be provided in the \%opts argument as{ type => $type_decl }
The callback gets one argument: a hash reference of the form
{ value => $matched_obj, path => $matched_obj_path, type => $obj_type_decl }
where $matched_obj_path is full canonical path to the matching object. The type key is present in hash only if
for_each_match
was called on aTreex::PML::Instance
or if Treex::PML::Schema type of the initial object was given in \%opts.The path syntax is as described in
Treex::PML::Instance::get_data
, with the following differences:1. Path steps of the form [n] or name[n], where n is a number, are not supported (but steps of the form [n]name work).
2. Additionally, steps can be separated with //. Like in XPath, this indicates a descendant axis, that allows arbitrary structures between the steps. I.e. a//z matches any data matched by a/z, a/b/z, /a/b/c/z, etc. One can also use // at the very beginning of an expression (//a/b) to match arbitrarily nested occurrence of a/b (e.g. one matching x/y/z/a/b).
- Treex::PML::Instance::get_all_matches($obj,$path,\%opts)
- Treex::PML::Instance::get_all_matches($obj,\@path_list,\%opts)
-
This function returns all data matching given path or, if the second argument is an array reference, any of given paths. The path(s), as well as $obj and \%opts argument are as in
Treex::PML::Instance::for_each_match
. The function returns an array in array context and an array reference in scalar context. - Treex::PML::Instance::count_matches($obj,$path,\%opts)
- Treex::PML::Instance::count_matches($obj,\@path_list,\%opts)
-
Like
Treex::PML::Instance::get_all_matches
, but returns only the number of matching objects (without creating any intermediate list). - Treex::PML::Instance::traverse_data($object, $type_decl, $callback, \%options)
-
Traverses the nested PML content of the given Treex::PML data object (
Treex::PML::Instance
, Treex::PML::Node, Treex::PML::Struct, etc.). The second argument must be the type of $object, i.e. a Treex::PML::Schema::Decl (or derived). The $callback is an CODE reference (anonymous function) which will get called for each nested value with the following arguments: the value, type declaration for the value (a Treex::PML::Schema::Decl), and the value of $options{data} passed in by the caller to this method.Options:
no_childnodes
: do not descend into child nodes (role #CHILDNODES)no_trees
: do not descend into lists or sequences with the role #TREEdata
: user data passed to the callback - $class_or_instance->validate_object($object, $decl, \%options)
-
Convenience function which currently just calls:
$decl->validate_object($object,\%options).
in order to determine, if the object conforms to the data type declaration.
- $pml->hash_id (id,object,warn)
-
Hash a given object under a given ID. If warn is true, then a warning is issued if the ID already wash hashed with a different object.
- $pml->lookup_id (id)
-
Lookup an object by ID.
- $pml->get_filename ()
-
Return the filename (string) or URL (URI object) of the PML instance.
- $pml->get_url ()
-
Return URL of the PML instance as URI object.
- $pml->set_filename (filename)
-
Change filename of the PML instance.
- $pml->get_transform_id ()
-
Return ID of the XSL-based transformation specification which was used to convert between an original non-PML format and PML (and back).
- $pml->set_transform_id (transform)
-
Set ID of an XSL-transformation specification which is to be used for conversion from PML to an external non-PML format (and back).
- $pml->get_schema ()
-
Return
Treex::PML::Schema
object associated with the PML instance. - $pml->set_schema (schema)
-
Associate a
Treex::PML::Schema
with the PML instance (this method should not be used for an instance containing data). - $pml->get_schema_url ()
-
Return URL of the PML schema file associated with the PML instance.
- $pml->set_schema_url (url)
-
Change URL of the PML schema file associated with the PML instance.
- $pml->get_root ()
-
Return the root data structure.
- $pml->set_root (object)
-
Set the root data structure.
- $pml->get_trees ()
-
Return a Treex::PML::List object containing data structures with role '#NODE' belonging in the first block (list or sequence) with role '#TREES' occuring in the PML instance.
- $pml->get_trees_prolog ()
-
If the PML instance consists of a sequence with role '#TREES', return a Treex::PML::Seq object containing the maximal (but possibly empty) initial segment of this sequience consisting of elements with role other than '#NODE'.
- $pml->get_trees_epilog ()
-
If the PML instance consists of a sequence with role '#TREES', return a Treex::PML::Seq object containing all elements of the sequence following the first maximal contiguous subsequence of elements with role '#NODE'.
- $pml->get_trees_type ()
-
Return the type declaration associated with the list of trees.
- $pml->get_references_hash ()
-
Returns a HASHref mapping file reference IDs to URLs.
- $pml->set_references_hash (\%map)
-
Set a given HASHref as a map between refrence IDs and URLs.
- $pml->get_ref_ids_by_name ($name)
-
Returns a list of reference IDs associated with a given name.
- $pml->get_refs_by_name ($name)
-
Returns a list of references associated with a given name.
- $pml->get_reffiles ()
-
Returns a list of hash references. Each element represents a document referenced from the current instance. The list contains only references that were associated with a name (pre-declared in the PML schema). However, a 'name' can be associated with several document references. The elements in the list returned by this method have the following keys:
- readas
-
the value of the 'readas' attribute of the corresponding PML schema declaration
- name
-
the symbolic name of the (type of the) reference as declared in the PML schema
- href
-
an URI of the target document
- id
-
an ID use in the current PML instance to refer to the target document
- $pml->get_refname_hash ()
-
Returns a HASHref mapping file reference names to reference IDs. Each value of the hash is either a ID string (if there is just one reference with a given name) or a Treex::PML::Alt containing all IDs associated with a given name.
- $pml->set_refname_hash (\%map)
-
Set a given HASHref as a map between refrence IDs and URLs.
- $pml->get_ref (id)
-
Return a DOM or
Treex::PML::Instance
object representing the referenced resource with a given ID (applies only to resources declared as readas='dom' or readas='pml'). - $pml->set_ref (id,object)
-
Use a given DOM or
Treex::PML::Instance
object as a resource of the currentTreex::PML::Instance
with a given ID (note that this may break knitting).
SEE ALSO
Prague Markup Language (PML) format: http://ufal.mff.cuni.cz/jazz/PML/
Tree editor TrEd: http://ufal.mff.cuni.cz/tred
Related packages: Treex::PML, Treex::PML::Schema, Treex::PML::Document
COPYRIGHT AND LICENSE
Copyright (C) 2006-2010 by Petr Pajas, 2010-2024 Jan Stepanek
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.2 or, at your option, any later version of Perl 5 you may have available.