NAME

XML::TiePYX - Read or write XML data in PYX format via tied filehandle

SYNOPSIS

use XML::TiePYX;

tie *XML,'XML::TiePYX','file.xml'

open IN,'file.xml' or die $!;
tie *XML,'XML::TiePYX',\*IN,Condense=>0;

my $text='<tag xmlns="http://www.omsdev.com">text</tag>';
tie *XML,'XML::TiePYX',\$text,Namespaces=>1;

tie *XML,'XML::TiePYX',\*STDOUT;
print XML "(start\n","-Hello, world!\n",")start\n";

DESCRIPTION

XML::TiePYX lets you use a tied filehandle to read from or write to an XML file or string. PYX is a line-oriented, parsed representation of XML developed by Sean McGrath (http://www.pyxie.org). Each line corresponds to one "event" in the XML, with the first character indicating the type of event:

(

The start of an element; the rest of the line is its name.

A

An attribute; the rest of the line is the attribute's name, a space, and its value.

)

The end of an element; the rest of the line is its name.

-

Literal text (characters). The rest of the line is the text.

?

A processing instruction. The rest of the line is the instruction's target, a space, and the instruction's value.

Newlines in attribute values, text, and processing instruction values are represented as the literal sequence '\n' (that is, a backslash followed by an 'n'). By default, consecutive runs of characters are always gathered into a single text event when reading, but this behavior can be disabled. Comments are *not* available through PYX.

Just as SAX is an API well suited to "push"-mode XML parsing, PYX is well- suited to "pull"-mode parsing where you want to capture the state of the parse through your program's flow of code rather than through a bunch of state variables. This module uses incremental parsing to avoid the need to buffer up large numbers of events.

This module implements an (unofficial) extension to the PYX format to allow namespace processing. If namespaces are enabled, an element or attribute name will be prefixed by its namespace URI (*NOT* any namespace prefix used in the document) enclosed in curly braces. A name with no namespace will be prefixed with {}. At the present time, this module does not implement namespace processing in output mode; attempting to write '(', ')', or 'A' lines that contain a namespace URI in curly braces will merely result in generating ill-formed element or attribute names.

INTERFACE

tie *tied_handle, 'XML::TiePYX', source, [Option=>value,...]

tied_handle is the filehandle which the PYX events will be read from or written to.

source is either a reference to a string containing the XML, the name of a file containing the XML, or an open IO::Handle or filehandle glob reference which the XML can be read or written to.

The Options can be any options allowed by XML::Parser and XML::Parser::Expat, as well as four module-specific options:

Validating

This will provide a validating parse by using XML::Checker::Parser in place of XML::Parser if set to a true value.

Condense

Causes all consecutive runs of character data to be gathered up into a single PYX event if set to a true value (the default). If set false, multiple consecutive character data events may occur in the stream (which may be desirable when dealing with large chunks of text). This option has no effect when writing.

Latin

If set to a true value, causes Unicode characters in the range 128-255 to be returned as ISO-Latin-1 characters rather than UTF-8 characters when reading, and an XML declaration specifying an encoding of "ISO-8859-1" to be output when writing.

Catalog

Specifies the URL of a catalog to use for resolving public identifiers and remapping system identifiers used in document type declarations or external entity references. This option requires XML::Catalog to be installed.

The tied filehandle may be read from with either the diamond operator (<HANDLE>), getc(), or read(). The diamond operator always returns a line at a time regardless of the setting of $/. It may be written to with print() or printf(); it is necessary to print one or more complete PYX lines at a time. This module does not support read/write mode.

EXAMPLE

This program (psectp.plx in the distribution) prints a numbered outline from an XML file in which an <outline> can contain zero or more <sect>s, each with a title attribute, and each <sect> can contain zero or more nested <sect>s or <para>s containing text, as in the sects.otl file included with the distribution. The -c option makes it print just a table of contents.

This is actually a traditional recursive-descent parser using PYX events as tokens.

#!/usr/bin/perl -w

use strict;
use XML::TiePYX;
use Text::Wrap;
use Getopt::Std;

my (@sectnums,%opts);

getopts('c',\%opts);

die "usage: psect [-c] file\n" unless @ARGV==1;

tie *XML,'XML::TiePYX',$ARGV[0];
die "illegal structure" unless get_event() =~ /^\(outline/;
push @sectnums,0;
print_sect() while get_event() =~ /^\(sect/;
die unless /^\)outline/;
close XML;

sub print_sect {
  <XML>=~/^Atitle (.*)/ or die "missing title";
  ++$sectnums[-1];
  print ' ' x (4*$#sectnums),join('.',@sectnums)," $1\n";
  print "\n" unless $opts{c};
  push @sectnums,0;
  while (get_event() !~ /^\)sect/) {
    /^\(sect/ and print_sect(),next;
    /^\(para/ and print_para(),next;
    die "illegal structure";
  }
  pop @sectnums;    
}

sub print_para() {
  die "illegal structure" unless <XML> =~ /^-(.*)/;
  $_=$1;
  s/\\n/ /g;
  s/^\s+//;
  s/\s+$//;
  print wrap((' ' x (4*($#sectnums-1))) x 2,$_),"\n\n" unless $opts{c};
  die "illegal structure" unless <XML> =~ /^\)para/;
}

sub get_event {
  $_=<XML>;
  $_=<XML> if /^-(\s|\\n)*$/;
  $_;
}

RATIONALE

There's already an XML::PYX module (written by Matt Sergeant) available, so why another PYX implementation? Mainly because XML::PYX is intended to be used in a standalone PYX-outputting program which you open as a pipe. That works very well under Unix, aside from the overhead of forking a separate process, but is problematic on Win32 systems for a variety of niggling reasons: the standalone script is supplied as a batch file, whose output can't be properly redirected into a pipe unless you invoke it as 'perl /perl/bin/pyx|' instead of just 'pyx|'. Both Win95 and Win98, as well as possibly other Win32 systems, implement pipes using temporary files and the reading process can't start reading until the writing process is done writing, which means that if you're parsing a huge file you may have to wait a long time before getting *any* output. The ability to guarantee a single character data event for any run of characters can often simplify processing. And finally, when I wrote this the only supported namespace- aware way to parse XML was the raw handlers interface of XML::Parser, which is needlessly complicated for simple applications (there are, of course, those who would argue that "simple applications" and "namespace-aware" are mutually-exclusive categories).

BUGS

The Validating option does not work correctly, as XML::Checker::Parser does not implement the parse_start() method.

Error handling leaves much to be desired.

AUTHOR

Eric Bohlman (ebohlman@netcom.com, ebohlman@omsdev.com)

COPYRIGHT

Copyright 2000 Eric Bohlman. All rights reserved.

This program is free software; you can use/modify/redistribute it under the same terms as Perl itself.

SEE ALSO

XML::PYX
XML::Parser
XML::Parser::Expat
XML::Checker
XML::Catalog
perl(1).