NAME
Treex::Core::Document - representation of a text and its linguistic analyses in the Treex framework
VERSION
version 2.20210102
DESCRIPTION
A document consists of a sequence of bundles, mirroring a sequence of natural language sentences (typically, but not necessarily, originating from the same text). Attributes (attribute-value pairs) can be attached to a document as a whole.
Note that the references from the bundles to the containing document are weak, so make sure you always keep a reference to the document in scope to prevent the contents of the document from being garbage-collected.
ATTRIBUTES
Treex::Core::Document
's instances have the following attributes:
- description
-
Textual description of the file's content that is stored in the file.
- loaded_from
- path
- file_stem
- file_number
The attributes can be accessed using semi-affordance accessors: getters have the same names as attributes, while setters start with set_
. For example, the attribute path
has a getter path()
and a setter set_path($path)
METHODS
Constructor
- my $new_document = Treex::Core::Document->new;
-
creates a new empty document object.
- my $new_document = Treex::Core::Document->new( { pmldoc => $pmldoc } );
-
creates a
Treex::Core::Document
instance from an already existing Treex::PML::Document instance - my $new_document = Treex::Core::Document->new( { filename => $filename } );
-
loads a
Treex::Core::Document
instance from a .treex file
Access to zones
Document zones are instances of Treex::Core::DocZone, parametrized by language code and possibly also by another free label called selector, whose purpose is to distinguish zones for the same language but from a different source.
- my $zone = $doc->create_zone( $langcode, ?$selector );
- my $zone = $doc->get_zone( $langcode, ?$selector );
- my $zone = $doc->get_or_create_zone( $langcode, ?$selector );
Access to bundles
- my @bundles = $document->get_bundles();
-
Returns the array of bundles contained in the document.
- my $new_bundle = $document->create_bundle();
-
Creates a new empty bundle and appends it at the end of the document.
- my $new_bundle = $document->new_bundle_before( $existing_bundle );
-
Creates a new empty bundle and inserts it in front of the existing bundle.
- my $new_bundle = $document->new_bundle_after( $existing_bundle );
-
Creates a new empty bundle and inserts it after the existing bundle.
Node indexing
- $document->index_node_by_id( $id, $node );
-
The node is added to the document's indexing table
id2node
(it is done automatically in Treex::Core::Node::set_attr() if the attribute name is 'id
'). When usingundef
in the place of the second argument, the entry for the given id is deleted from the hash. - my $node = $document->get_node_by_id( $id );
-
Return the node which has the value
$id
in its 'id
' attribute, no matter to which tree and to which bundle in the given document the node belongs to.It is prohibited in Treex for IDs to point outside of the current document. In rare cases where your data has such links, we recommend you to split the documents differently or hack it by dropping the problematic links.
- $document->id_is_indexed( $id );
-
Return
true
if the givenid
is already present in the indexing table. - $document->get_all_node_ids();
-
Return the array of all node identifiers indexed in the document.
- $document->get_references_to_id( $id );
-
Return all references leading to the given node id in a hash (keys are reference types, e.g. 'alignment', 'a/lex.rf' etc., values are arrays of nodes referencing this node).
- $document->remove_refences_to_id( $id );
-
Remove all references to the given node id (calls remove_reference() on each referencing node).
Serializing
- my $document = load($filename, \%opts)
-
Loads document from
$filename
given%opts
using Treex::PML::Document::load() - $document->save($filename)
-
Saves document to
$filename
using Treex::PML::Document::save(), or by the Storable module if the file's extension is .streex.gz. - Treex::Core::Document->retrieve_storable($filename)
-
Loading a document from the .streex (Storable) format.
Other
AUTHOR
Zdeněk Žabokrtský <zabokrtsky@ufal.mff.cuni.cz>
Martin Popel <popel@ufal.mff.cuni.cz>
Ondřej Dušek <odusek@ufal.mff.cuni.cz>
COPYRIGHT AND LICENSE
Copyright © 2011-2012 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.