NAME

XML::Essex::Model - Essex objects representing SAX events and DOM trees

SYNOPSIS

Used internally by Essex, see below for external API examples.

DESCRIPTION

A description of all of the events explicitly supported so far. Unsupported events are still handled as anonymous events, see XML::Essex::Event for details.

A short word on abbreviations

A goal of essex is to allow code to be as terse or verbose as is appropriate for the job at hand. So almost every object may be abbreviated. So start_element may be abbreviated as start_elt for both the isa() method/function and for class creation.

All objects are actually blessed in to classes using the long name, like XML::Essex::start_element even if you use an abbreviation like XML::Essex::start_elt-new> to create them.

Stringification

All events are stringifiable for debugging purposes and so that attribute values, character data, comments, and processing instructions may be matched with Perl string operations like regular expressions and index(). It is usually more effective to use EventPath, but you can use stringified values for things like complex regexp matching.

It is unwise to match other events with string operators because no XML escaping of data is done, so "<" in an attribute value or character data is stringified as "<", so print()ing out the three events associated with the sequence "<foo>&lt;bar/></foo>" will look like "<foo><bar/></foo>", obviously not what the document intended. Given the rarity of such constructs in real life XML, though, this is sufficient for debugging purposes, and does make it easy to match against strings.

Ordinarily, you tell get() what kind of object you want using an EventPath expression:

my $start_element = get "start_element::*";

You can also just get() whatever's next in the document or use a union expression. In this case, you may need to see what you've gotten. The isa() method (see below) and the isa() functions (see XML::Essex) should be used to figure out what type of object is being used before relying on the stringification:

get until isa "chars" and /Bond, James Bond/;
get until type eq "characters" and /Bond, James Bond/;
get until isa( "chars" ) && /Bond, James Bond/;
get "text()" until /Bond, James Bond/;

This makes it easier to match characters data, but other methods should be used to select things like start tags and elements:

get "start_element::*" until $_->name eq "address" && $_->{id} eq $id;
get "start_element::address" until $_->{id} eq $id;

The lack of escaping only affects stringification of objects, for instance:

warn $_;  ## See what event is being dealt with right now
/Bond, James Bond/  ## Match current event

. Things are escaped properly when the put operator is used, using put() emits properly escaped XML.

Some observervations:

  • Stringifying an event does not produce a well formed chunk of XML. Events must be emitted through a downstream filter.

  • Events with no natural XML representation--like start_document--stringify as their name: "start_document()". If it's not listed on this page, it stringifies this way.

  • Whitespace is inserted only where manditory inside XML constructs, and is a single space. It is left unmolested in character data, comments, processing instructions (other than <?xml ...?>, which is parsed by all XML parsers).

  • Attributes in start_element events are stringified in alphabetical order according to Perl's sort() function.

  • Processing instructions, including the <?xml...?> declaration, often have things that look like attributes but are not, so the items above about whitespace and attribute sort order do not apply. Actually, the <?xml ... ?> declaration is well defined and there will be only a single whitespace character, though the pseudo-attributes version, encoding and standalone will not be sorted.

  • No escapes are used. See above.

  • Character data is catenated, including mixed data and CDATA, in to single strings. CDATA sections are tracked and may be analyzed.

  • Namespaces are stringified according to any prefixes that have been registered, otherwise they stringify in james clark notation ("{}foo"), except for the empty namespace URI, which alway stringifies as "" (ie no prefix). See XML::Essex's Namspaces section for details.

Common Methods

All of the objects in the model provide the following methods. These methods are exported as functions from the XML::Essex module for convenience (those functions are wrappers around these methods).

isa

Returns TRUE if the object is of the type, abbreviated type, or class passed. So, for an object encapsulating a characters event, returns TRUE for any of:

XML::Essex::Event             ## The base class for all events
XML::Essex::start_document    ## The actuall class name
start_document                ## The event type
start_doc                     ## The event type, abbreviated
class

Returns the class name, such as XML::Essex::start_document.

type

Returns the class name, such as start_document.

types

Returns the class name, the type name and any abbreviations. The abbreviations are sorted from longest to shortest.

start_document

aka: start_doc

my $e = start_doc \%values;    ## %values is not defined in SAX1/SAX2

Stringifies as: start_document($reserved)

where $reserved is a character string that may sometime include info passed in the start_document event, probably formatted as attributes.

xml_decl

aka: (no abbreviations)

my $e = xml_decl;

my $e = xml_decl
    Version    => "1",
    Encoding   => "UTF-8",
    Standalone => "yes";

my $e = xml_decl {
    Version    => "1",
    Encoding   => "UTF-8",
    Standalone => "yes"
};

Stringifies as: <?xml version="$version" encoding="$enc" standalone="$yes_or_no"?>

Note that this does not follow the sorted attribute order behavior of start_element, as the seeming attributes here are not attributes, like processing instructions that have pretend attributes.

end_document

aka: end_doc

my $e = end_doc \%values;    ## %values is not defined in SAX1/SAX2

Stringifies as: end_document($reserved)

where $reserved is a character string that may sometime include info passed in the end_document event, probably formatted as attributes.

start_element

aka: start_elt

my $e = start_elt foo => { attr => "val"  };
my $e = start_elt $start_elt;  ## Copy constructor
my $e = start_elt $end_elt;    ## end_elt deconstructor
my $e = start_elt $elt;        ## elt deconstructor

Stringifies as: <foo attr1="$val1" attr2="val2">

The element name and any attribute names are prefixed according to namespace mappings registered in the Essex processor, the prefixes they had in the source document are ignored. If no prefix has been mapped, jclark notation ({http:...}foo) is used. Then they are sorted according to Perl's sort() function, so jclarked attribute names come last, as it happens.

TODO: Support attribute ordering via consecutive {...} sets.

Attributes may be accessed using hash dereferences:

get "start_element::*" until $_->{id} eq "10";  ## No namespace prefix
get "start_element::*" until $_->{"{}id"} eq "10";
get "start_element::*" until $_->{"{http://foo/}id"} eq "10";
get "start_element::*" until $_->{"foo:id"} eq "10";

and the attribute names may be obtained by:

keys %$_;

. Keys are returned in no predictable order, see Namespaces for details on the three formats keys may be returned in.

Methods

name

Returns the name of the node according to the namespace stringification rules.

jclark_name

Returns the name of the node in James Clark notation.

jclark_keys
my @keys = $e->jclark_keys

Returns a list of attribute names in jclark notation ("{...}name").

attribute

aka: attr

my $name_attr = $start_elt->{name};
my $attr      = attr $name;
my $attr      = attr $name => $value;
my $attr      = attr {
    LocalName    => $local_name,
    NamespaceURI => $ns_uri,
    Value        => $value,
};

Stringifies as its value: harvey

This is not a SAX event, but an object returned from within element or start_element objects that gives you access to the NamespaceUri, LocalName, and Value fields of the attribute. Does not give access to the Name or Prefix fields present in SAX events.

If you create an attribute with an undefined value, it will stringify as the undefined value. Attributes that are created without an explicit undefined Value field will be given the defaul value of "", including attributes that are autovivified. This allows

get "*" until $_->{id} eq "10";

to work. This has the side effect of addingan id="" attribute to all elements without an id attribute. To avoid the side effect, use the exists function to detect nonexistant attributes:

get "*" until exists $_->{id} and $_->{id} eq "10";

end_element

aka: end_elt

my $e = end_element "foo";
my $e = end_element $start_elt;
my $e = end_element $end_elt;
my $e = end_element $elt;

Stringifies as: </foo>

See start_element for details on namespace handling.

element

aka: elt

my $e = elt foo => "content", $other_elt, "more content", $pi, ...;
my $e = elt foo => { attr1 => "val1" }, "content", ...;

Stringifies as: <foo attr1="val1">content</foo>

Never stringifies as an empty element tag (<foo/>), although downstream filters and handlers may choose to do that.

Constructs an element. An element is a sequence of events between a matching start_element and end_element, inclusive.

Attributes may be accessed using Perl hash dereferencing, as with start_element events, see "start_element" for details.

Content may be accessed using Perl array dereferencing:

my @content = @$_;
unshift @$_, "prefixed content";
push    @$_, "appended content";

Note that

my $elt2 = elt $elt1;   ## doesn't copy content, just name+attra

only copies the name and attributes, it does not copy the content. To copy content do either of:

my $elt2 = elt $elt1, @$elt1;
my $elt2 = $elt1->clone;

This is because the first parameter is converted to a start/end_element pair and any content is ignored. This is so that:

my $elt2 = elt $elt1, "new content";

creates an element with the indicated content.

Methods

jclark_keys

Returns the names of attributes as a list of JamesClarkified keys, just like start_element's jclark_keys().

name

Returns the name of the node according to the namespace stringification rules.

jclark_name

Returns the name of the node in James Clark notation.

characters

aka: chars

my $e = chars "A stitch", " in time", " saves nine";
my $e = chars {
    Data => "A stitch in time saves nine",
};

Stringifies like a string: A stitch in time saves nine.

Character events are aggregated.

TODO: make that aggregation happen.

comment

aka: (no abbreviation)

my $e = comment "A stitch in time saves nine";
my $e = comment {
    Data => "A stitch in time saves nine",
};

Stringifies like a string: A stitch in time saves nine.

Implementation Details

References and blessed, tied or overloaded SAX events.

Instances of the Essex object model classes carry a reference to the original data (SAX events), rather than copying it. This means that there are fewer copies (a good thing; though there is an increased cost of getting at any data in the events) and that upstream filters may send blessed, tied, or overloaded objects to us and they will not be molested unless the Essex filter messes with them. There is also an implementation reason for this, it makes overloading hash accesses like $_-{}> easier to implement.

Passing an Essex event to a constructor for a new Essex event does result in a deep copy of the referenced data (via XML::Essex::Event::clone()).

Class files, or the lack thereof

The objects in the Essex object model are not available independantly as class files. You must use XML::Essex::Model to get at them. This is because there is a set of event types used in almost all SAX filters and it is cheaper to compile one file containing these than to open multiple files.

This does not mean that all classes are loaded when the XML::Essex::Model is use()ed or require()ed, rare events are likely to be autoloaded.

Class names

In order to allow

my $e = XML::Essex::start_elt( ... );

to work as expected--in case the calling package prefers not to import start_elt(), for instance--the objects in the model are all in the XML::Essex::Event::... namespace, like XML::Essex::Event::start_element.

TODO

Allow escaping to be configured
Allow " vs. ' for attr quotes to be configured.
Allow CDATA to be tested for, either by stringifying it or by allowing it to be returned as an array or something.

COPYRIGHT

Copyright 2002, R. Barrie Slaymaker, Jr., All Rights Reserved

LICENSE

You may use this module under the terms of the BSD, Artistic, oir GPL licenses, any version.

AUTHOR

Barrie Slaymaker <barries@slaysys.com>