NAME

YAX::Parser - fast pure Perl tree and stream parser

SYNOPSIS

use YAX::Parser;

my $xml_str = <<XML
  <?xml version="1.0" ?>
  <doc>
    <content id="42"><![CDATA[
       This is a cdata section, so >>anything goes!<<
    ]]>
    </content>
    <!-- comments are nodes too -->
  </doc>
XML

# tree parse - the common case
my $xml_doc = YAX::Parser->parse( $xml_str );
my $xml_doc = YAX::Parser->parse_file( $path );

# shallow parse
my @tokens = YAX::Parser->tokenize( $xml_str );

# stream parse 
YAX::Parser->stream( $xml_str, $state, %handlers )
YAX::Parser->stream_file( '/some/file.xml', $state, %handlers );

DESCRIPTION

This module implements a fast DOM and stream parser based on Robert D. Cameron's regular expression shallow parsing grammar and technique. It doesn't implement the full W3C DOM API by design. Instead, it takes a more pragmatic approach. DOM trees are constructed with everything being an object except for attributes, which are stored as a hash reference.

We also borrow some ideas from browser implementations, in particular, nodes are keyed in a table in the document on their id attributes (if present) so you can say:

my $found = $xml_doc->get( $node_id );

Parsing is usually done by calling class methods on YAX::Parser, which, if invoked as a tree parser, returns an instance of YAX::Document

my $xml_doc = YAX::Parser->parse( $xml_str );

METHODS

See the "SYNOPSIS" for, here's just the list for now:

parse( $xml_str )

Parse $xml_str and return a YAX::Document object.

parse_file( $path )

Same as above by read the file at $path for the input.

stream( $xml_str, $state, %handlers )

Although not its main focus, YAX::Parser also provides for stream parsing. It tries to be a bit more sane than Expat, in that it allows you to specify a state holder which can be anything and is passed as the first argument to the handler functions. A typical case is to use a hash reference with a stack (for tracking nesting):

my $state = { stack => [ ] };

all handler functions are optional, but the full list is:

my %handlers = (
    text => \&handle_text,          # called for text nodes
    elmt => \&handle_element_open,  # called for open tags
    elcl => \&handle_element_close, # called for tag close
    decl => \&handle_declaration,   # called for declarations
    proc => \&handle_proc_inst,     # called for processing instructions
    pass => \&handle_passthrough,   # called when no handlers match
);

an element handler is passed the state, tag name and attributes hash:

sub handle_element_open {
    my ( $state, $name, %attributes ) = @_;
    if ( $name eq 'a' and $attributes{href} ) {
        ... 
    }
}

element close handlers take two arguments: state and tag name:

sub handle_element_close {
    my ( $state, $name ) = @_;
    die "not well formed" unless pop @{ $state->{stack} } eq $name;
}

all other handlers take the state and the entire matched token

sub handle_proc_inst {
    my ( $state, $token ) = @_;
    $token =~ /^<\?(.*?)\?>$/;
    my $instr = $1;
    ...
}
stream_file( $path, $state, %handlers )

Same as above by read the file at $path for the input.

tokenize( $xml_str )

Useful for quick and dirty tokenizing of $xml_str. Returns a list of tokens.

SEE ALSO

YAX::Document, YAX::Node

LICENSE

This program is free software and may be modified and distributed under the same terms as Perl itself.

AUTHOR

Richard Hundt