NAME

ODF::lpOD::Document - General ODF package handling and metadata

DESCRIPTION

This manual page describes the odf_document, the common features of any odf_part of a odf_document, and the particular features of the odf_meta and odf_manifest parts (that handle the global document metadata and the manifest of the associated container).

Every odf_document is associated with a odf_container that encapsulates all the physical access logic. On the other hand, every odf_document is made of several components so-called parts. The lpOD API is mainly focused on parts that describe the global metadata, the text content, the layout and the structure of the document, and that are physically stored according to an XML schema. The common lpOD class for these parts is odf_xmlpart (whose Perl implementation is the ODF::lpOD::XMLPart package).

lpOD provides specialized classes for the conventional ODF XML parts, namely odf_meta, odf_content, odf_styles, odf_settings, odf_manifest; some of them provide methods dedicated to get or set the document metadata.

In order to process particular pieces of content in the most complex parts, i.e. odf_content and odf_styles, the odf_element class and its various specialized derivatives are available. They are described in other chapters of the lpOD documentation.

Document initialization

Any access to a document requires a valid odf_document instance, that may be created from an existing document or from scratch, using one of the constructors introduced below. Once created, this instance gives access to individual parts through the get_part() method.

odf_get_document(uri)

This function creates a read-write document instance. The returned object is associated to a physical existing ODF resource, which may be updated. The required argument is the URI of the resource.

Note: in the present implementation, the URI argument must be either a file path or a IO::Handle corresponding to an open file or socket. The physical resource must be a well formed compressed ODF file, such as those natively produced by OpenOffice.org or compatible office software suites.

Example:

my $doc = odf_get_document("C:\MyDocuments\test.odt");

If the save method of odf_document is later used without explicit target, the document is wrote back to the same resource.

odf_new_document_from_template(uri)

Same as odf_get_document, but the ODF resource is used in read only mode, i.e. it's used as a template in order to generate other ODF physical documents.

Some metadata of the new document are intialized to the following values:

  • the creation and modification dates are set to the current date;

  • the creator and initial creator are set to the owner of the current process as reported by the operating system (if this information is available);

  • the number of editing cycles is set to 1;

  • the idenfication string of the current lpOD distribution is used as the generator identifier string;

Each piece of metadata may be changed later by the application.

odf_new_document(doc_type)

Unlike other constructors, this one generates a odf_document instance from scratch. Technically, it's a variant of odf_new_document_from_template, but the default template (provided with the lpOD library) is used. The required argument specifies the document type, that must be 'text', 'spreadsheet', 'presentation', or 'drawing'. The new document instance is not persistent; no file is created before an explicit use of the save method.

The following example creates a spreadsheet document instance:

my $doc = odf_new_document_from_type('spreadsheet');

The real content of the instance depends on the default template.

A set of valid template ODF files (created using OpenOffice.org) is transparently installed with the standard lpOD distribution. Advanced users may use their own template files. To do so, they have to replace the ODF files present in the templates subdirectory of the lpOD installation; the path to the lpOD installation may be retrieved through the lpod-installation_path> common function. The user-provided template files must have the same names.

Some metadata are initialized in the same way as with odf_new_document_from_template.

Document MIME type check and control

get_mimetype

Returns the MIME type of the document (i.e. the full string that identifies the document type). An example of regular ODF MIME type is:

application/vnd.oasis.opendocument.text

set_mimetype(new_mimetype)

Allows the user to force a new arbitrary MIME type (not to use in ordinary lpOD applications !).

Access to individual document parts

get_part(name)

Generic odf_document method allowing access to any part of a previously created document intance, including parts that are not handled by lpOD. The lpOD library provides symbolic constants that represent the ODF usual XML parts: CONTENT, STYLES, META, MANIFEST, SETTINGS.

This instruction returns the CONTENT part of a document as a odf_content object:

$content = $document->get_part(CONTENT);

With MIMETYPE as argument, get_part() returns the MIME type of the document as a text string, i.e. the same result as get_mimetype().

This method may be used in order to get any other document part, such an image or any other non-XML part. To do so, the real path of the needed part must be specified instead of one of the XML part symbolic names. As an example, the instruction below returns the binary content of an image:

$img = $document->get_part('Pictures/logo.jpg');

In such a case, the method returns the data as an uninterpreted sequence of bytes.

(Remember that images files included in an ODF package are stored in a Pictures folder.)

Returns undef if case of failure.

get_parts

Returns the list of the document parts.

Accessing data inside a part

Everything in the part is stored as a set of odf_element instances. So, for complex parts (such as CONTENT) or parts that are not explictly covered in the present documentation, the applications need to get access to an "entry point" that is a particular element. The most used entry points are the root and the body. Every part handler provides the get_root() and get_body() methods, each one returning a odf_element instance, that provides all the element-based features (including the creation, insertion or retrieval of other elements that may become in turn working contexts).

For those who know the ODF XML schema, two part-based methods allow the selection of elements according to XPath expressions, namely get_element() and get_element_list(). The first one requires an XPath expression and a positional number; it returns the element corresponding to the given position in the result set of the XPath expression (if any). The second one returns the full result set (i.e. a list of odf_element instances). For example, the instructions below return respectively the first paragraph and all the paragraphs of a part (assuming $part is a previously selected document part):

my $paragraph = $part->get_element('text:p', 0);
my @paragraphs = $part->get_element_list('text:p');

Note that the position argument of get_element is zero-based, and that it may be a negative value (if so, it specifies a position counted backward from the last matching element, -1 being the position of the last one).

So a large part of the lpOD functionality is described with the odf_element class, i.e. ODF::lpOD::Element.

Global document metadata

From the handler provided by the get_meta document method, several metadata of the document may be directly get or set.

Simple metadata accessors

Most metadata are just text strings. The user may read or write each one using a get_xxx or set_xxx accessor, where "xxx" is the lpOD name of a particular property. The presently supported simple properties are:

  • creation_date: the date of the initial version of the document, expressed in ISO-8601 date format

  • creator: the name of the user who created the current version of the document

  • description: the long description of the document

  • editing_cycles: the number of edit sessions (may be regarded as a version number)

  • editing_duration: the total editing time through interactive software, expressed as a time delta in ISO-8601 format

  • generator: the signature of the application that created the document

  • initial_creator: the name of the user who created the first version of the document

  • language: the ISO code of the main language used in the document

  • modification_date: the date of the last modification (i.e. ot the current version)

  • subject: the subject (or short description) of the document

  • title: the title of the document.

When used without argument, some set accessors may automatically set default values, according to the capabilities of the runtime environment. For set_creation_date() and set_modification_date(), the default is the current system date. For set_creator() and set_initial_creator(), the default is the identifier of the current system user. For set_generator() the default is the system name of the current program (as it would appear in a command line) or, if not available, the current process identifier. If the execution environment can't provide such informations, no default value is provided. set_editing_cycles(), without argument, increments the editing_cycles indicator by 1.

Both set_creation_date and set_modification_date allow the user to provide the date in the ODF-compliant (ISO-8601) format, or in numeric format (like the Perl time format). In the second case, the provided time is automatically converted in the required format. The corresponding get_ accessors always return the dates in their storage format. However, the lpOD library provides a numeric_date that translates a regular ISO date into a Perl numeric time value (a symmetric iso_date global function translates a Perl time into a ISO date).

Examples of use:

$meta->set_title("The lpOD Cookbook");
$meta->set_creator("The lpOD Project team");
$meta->set_modification_date(time);
my $old_version = $meta->get_editing_cycles;
$meta->set_editing_cycles($old_version + 1);

Document statistics

The global document statistics (as defined in the §3.1.18 of the ODF 1.1 specification) may be get or set using the get_statistics and set_statistics accessors. The first one returns the statistic properties as a hash reference. The second one takes a hash reference with the same structure, containing the attribute names and values. The following example displays the page count of the document (assuming it's a text document):

my $meta = $document->get_meta;
my $stat = $meta->get_statistics;
say $meta->{'meta:page-count'};

Note that nothing prevents the applications from using get_statistics to set any arbitrary figures.

Keywords

The document metadata include a list of keywords (possibly empty). This list may be used or changed.

get_keywords

Knowing that a document may be "tagged" by one or more keywords, odf_meta provides a get_keywords method that returns the list of the current keywords as a comma-separated string.

set_keywords(string_of_keywords)

set_keywords allows the user to set a full list of keywords, provided as a single comma-separated string; the provided list replaces any previously existing keyword; this method, used without argument or with an empty string, just removes all the keywords. Example:

$meta->set_keywords("ODF, OpenDocument, Python, Perl, Ruby, XML")

The spaces after the commas are ignored, and it's not possible to set a keyword that contains comma(s) through set_keywords.

set_keyword(keyword)

set_keyword appends a new, given keyword to the list; it's neutral if the given keyword is already present; it allows commas in the given keyword (but we don't recommend such a practice).

check_keyword(keyword)

check_keyword returns TRUE if its argument (which may be a regular expression) matches an existing keyword, or FALSE if the keyword is not present.

remove_keyword(expression)

remove_keyword deletes any keyword that matches the argument (which may be a regular expression).

User-defined metadata

Each user-defined metadata element has a unique name (or key), a value and a datatype.

get_user_field(name)

Retrieves a user-defined field according to its name (that should be unique for the document). In scalar context, returns the value of the field. In array context, returns the value and the data type.

The regular ODF datatypes are float, date, time, boolean, and string.

get_user_fields

The odf_meta API provides a get_user_fields method that returns a list whose each element is a hash ref whose (self-documented) keys are name, value, and type.

As an example, the following loop displays the name, the value and the type of each use field in the matadata part of a document:

my $doc = odf_get_document($source);
my $meta = $doc->get_meta;
foreach my $uf ($meta->get_user_fields) {
        say "Name   " . $uf->{name} .
            "Value  " . $uf->{value} .
            "Type   " . $uf->{type}
        }

set_user_field(name, value, type)

Creates or changes a user field. The first argument is the name (identifier). The last argument is the data type, which must be ODF-compliant (see get_user_field). If the type is not specified, it's default value is 'string'. If the type is date, the value is automatically converted in ISO-8601 format if provided as a numeric time value.

Examples:

$meta->set_user_field("Development status", "Working draft");
$meta->set_user_field("Security status", "Classified");
$meta->set_user_field("Ready for release", FALSE, "boolean");

How to persistently update a document

Every part may be updated using specific methods that creates, change or remove elements, but this methods don't produce any persistent effect.

The updates done in a given part may be either exported as an XML string, or returned to the odf_document instance from which the part depends. With the first option, the user is responsible of the management of the exported XML (that can't be used as is through a typical office application), and the original document is not persistently changed. The second option instructs the odf_document that the part has been changed and that this change should be reflected as soon as the physical resource is wrote back. However, a part-based method can't directly update the resource. The changes may be made persistent through a save() method of the odf_document object.

serialize

This part-based method returns a full XML export of the part. The returned XML string may be stored somewhere and used later in order to create or replace a part in another document, or to feed another application.

A pretty named option may be provided. If set to TRUE, this option specifies that the XML export should be as human-readable as possible.

The example below returns a conveniently indented XML representation of the content part of a document:

$doc = odf_get_document("C:\MyDocuments\test.odt");
$part = $doc->get_part(CONTENT);
$xml = part->serialize(pretty => TRUE);

store

This part-based method stores the present state (possibly changed) of the part in a temporary, non-persistent space, waiting for the execution of the next call of the document-based save method.

The following example selects the CONTENT part of a document, removes the last paragraph of this content, then sends back the changed content to the document, that in turn is made persistent:

$content = $document->get_part(CONTENT);
$p = $content->get_body->get_paragraph(-1);
$p->delete;
$content->store;
$document->save;

Like serialize(), store() allows the pretty option.

add_file

This document-based method stores an external file "as is" in the document container, without interpretation. The mandatory argument is the path of the source file.

Optional named parameters path and type are allowed; path specifies the destination path in the ODF package, while type is the MIME type of the added resource.

As an example, the instruction below inserts a binary image file available in the current directory in the "Thumbnails" folder of the document package:

$document->add_file("logo.png", path => "Thumbnails/thumbnail.png");

If the path parameter is omitted, the destination folder in the package is either Pictures if the source is identified as an image file (caution: such a recognition may not work with any image type in any environment) or the root folder.

The following example creates an entry whose every property is specified:

$document->add_file(
        "portrait.jpg",
        path    => "Pictures/portrait.jpg",
        type    => "image/jpeg"
        );

The return value is the destination path.

This method may be used in order to import an external XML file as a replacement of a conventional ODF XML part without interpretation. As an example, the following instruction replaces the STYLES part of a document by an arbitrary file:

$document->add_file("custom_styles.xml", path => STYLES);

Note that the physical effet of add_file() is not immediate; the file is really added (and the source is really required) only when the save() method, introduced below, is called. As a consequence, any update that could be done in a document part loaded using add_file() is lost. According to the same logic, a document part loaded using add_file() is never available in the current document instance; it becomes available if the current instance is made persistent through a save() call then a new instance is created using the saved package with odf_get_document.

set_part

Allows the user to create or replace a document part using data in memory. The first argument is the target ODF part, while the second one is the source string.

del_part

Deletes a part in the document package. The deletion is physically done through the subsequent call of save(). The argument may be either the symbolic constant standing for a conventional ODF XML part or the real path of the part in the package.

The following sequence replaces (without interpretation) the current document content part by an external content:

$document->del_part(CONTENT);
$document->add_file("/somewhere/stuff.xml", path => CONTENT);

Note that the order of these instructions is not significant; when save() is called, it executes all the deletions then all the part insertions and/or updates.

save

This method is provided by the odf_document. If the document instance is associated with a regular ODF resource available for update (meaning that it has been created using odf_get_container and that the user has a write access to the resource), the resource is wrote back and reflects all the changes previously committed by one or more document parts using their respective store methods.

As an example, the sequence below updates a ODF file according to changes made in the META and CONTENT parts:

my $doc = odf_get_document("/home/users/jmg/report.odt");
my $meta = $doc->get_part(META);
my $content = $doc->get_part(CONTENT);
# meta updates are made here
$meta->store;
# content updates are made here
$content->store;
$document->save;

An optional target parameter may be provided to save(). If set, this parameter specifies an alternative destination for the file (it produces the same effect as the "File/Save As" feature of a typical office software). The target option is always allowed, but it's mandatory with odf_document instances created using a odf_new_document_from... constructor.

Manifest

The manifest part of a document holds the list of the files included in the container associated to the odf_document. It's represented by a odf_manifest object, that is a particular odf_xmlpart.

Each included file is represented by a odf_file_entry object, whose properties are

  • path: full path of the file in the container;

  • type : the media type (or MIME type) of the file.

Initialization

A odf_manifest instance is created through the get_part() method of odf_document, with MANIFEST as part selector:

$manifest = $document->get_part(MANIFEST);

Entry access

The full list of manifest entries may be obtained using get_entries().

It's possible to restrict the list with an optional type parameter whose value is a string of a regular expression. If type is set, then the method returns the entries whose media type string matches the given expression.

As an example, the first instruction below returns the entries that correspond to XML parts only, while the next one returns all the XML entries, including those whose type is not "text/xml" (such as "application/rdf+xml"), and the last returns all the "image/xxx" entries (whatever the image format):

@xmlp_entries = $manifest->get_entries(type => 'text/xml');
@xml_entries = $manifest->get_entries(type => 'xml');
@image_entries = $manifest->get_entries(type => 'image');

An individual entry may be selected according to its path, knowing that the path is the entry identifier. The get_entry() method, whose mandatory argument is the path, does the job. The following instruction returns the entry that stands for a given image resource included in the package (if any):

$img_entry = $manifest->get_entry('Pictures/13BE2000BDD8EFA.jpg');

Entry creation and removal

Once selected, an entry may be deleted using the generic delete method. The del_entry() method, whose mandatory argument is an entry path, deletes the corresponding entry, if any. If the given entry doesn't exist, nothing is done. The return value is the removed entry, or undef.

A new entry may be added using the set_entry() method. This method requires a unique path as its mandatory argument. A type optional named parameter may be provided, but is not required; without type specification, the media type remains empty. This method returns the new entry object, or a null value in case of failure. The example below adds an entry corresponding to an image file:

$manifest->set_entry('Pictures/xyz.jpg', type => 'image/jpeg');

If set_entry() is called with the same path as an existing entry, the old entry is removed and replaced by the new one.

If the entry path is a folder, i.e. if its last character is "/", then the media type is automatically set to an empty value. However, this rule doesn't apply to the root folder, i.e. "/", whose type should be the MIME type of the document.

Beware: adding or removing a manifest entry doesn't automatically add or remove the corresponding file in the container, and there is no automatic consistency check between the real content of the part and the manifest.

Entry property handling

An individual manifest entry is a odf_file_entry object, that is a particular odf_element object.

It provides the get_path(), set_path(), get_type(), set_type() accessors, to get or set the path and type properties. There is no check with set_type(), so the user is responsible for the consistency between the given type and the real content of the corresponding file. On the other hand, set_path() fails if the given path is already used by another entry; but there is no other check regarding this property, so the user must check the consistency between the given path and the real path of the corresponding resource.

If set_path() puts a path whose last character is "/", the media type of the entry is automatically set to an empty string. However, for users who know exactly what they do, set_type() allows to force a non-empty type after set_path().

COPYRIGHT & LICENSE

Copyright (c) 2010 Ars Aperta, Itaapy, Pierlis, Talend.

This work was sponsored by the Agence Nationale de la Recherche (http://www.agence-nationale-recherche.fr).

lpOD is free software; you can redistribute it and/or modify it under the terms of either:

a) the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. lpOD is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with lpOD. If not, see http://www.gnu.org/licenses/.

b) the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

1 POD Error

The following errors were encountered while parsing the POD:

Around line 285:

Non-ASCII character seen before =encoding in '§3.1.18'. Assuming UTF-8