XML::DT a Perl XML down translate module
With XML::DT, I think that:
. it is simple to do simple XML processing tasks :)
. it is simple to have the XML processor stored in a single variable
(see example 4)
. it is simple to translate XML -> Perl user controlled complex structure
with a compact "-type" definition (see last section)
Feedback welcome -> jj@di.uminho.pt
XML::DT a Perl XML down translate module
This document is also available in HTML (pod2html'ized): http://www.di.uminho.pt/~jj/perl/XML/XML-DT.readme.html
. design to do simple and compact translation/processing of XML document
. it includes some features of omnimark and sgmls.pm; functional approach
. it includes functions to automatic build user controlled complex Perl
structures (see "working with structures" section)
. it was build to show my NLP Perl students that it is easy to work with XML
. home page and download: http://www.di.uminho.pt/~jj/perl/XML/DT.html
HOW IT WORKS:
. the user must define a handler and call the basic function :
dt($filename,%handler) or dtstring($string,%handler)
. the handler is a HASH mapping element names to functions. Handlers can
have a "-default" function , and a "-end" function
. in order to make it smaller each function receives 3 args as global variables
$c - contents
$q - element name
%v - attribute values
. the default "-default" function is the identity. The function "toxml" makes
the original XML text based on $c, $q and %v values.
. see some advanced features in the last examples
SOME simple (naive) examples:
INDEX:
1. change to lowercase attribute named "a" in element "e"
2. better solution
3. make some statistics and output results in HTML (using side effects)
4. In a HTML like XML document, substitute <contents/>...<contents> by the
real table of contents (a dirty solution...)
5. a more realistic example: from XML gcapaper DTD to latex
WORKING WITH STRUCTURES INSTEAD OF STRINGS...
6. Build the natural Perl structure of the following document (ARRAY,HASH)
7. Multi map on...
1. change to lowercase the contents of the attribute named "a" in element "e"
use XML::DT ;
my $filename = shift;
print dt($filename,
( e => sub{ "<e a='". lc($v{a}). "'>$c</e>" }));
2. A better solution of the previous example
Ex.1 wouldn't work if we have more attributes in element e. A better solution is
print dt($filename,
( e => sub{ $v{a} = lc($v{a});
toxml();}));
3. make some statistics and output results in HTML (using side effects)
use XML::DT ;
my $filename = shift;
%handler=( -default => sub{$elem_counter++;
$elem_table{$q}++;"";} # $q -> element name
);
dt($filename,%handler);
print "<H3>We have found $elem_counter elements in document</H3>";
print "<TABLE><TH>ELEMENT<TH>OCCURS\n";
foreach $elem (sort keys %elem_table)
{print "<TR><TD>$elem<TD>$elem_table{$elem}\n";}
print "</TABLE>";
4. In a HTML like XML document, substitute <contents/>...<contents> by the real table of contents (a dirty solution...)
%handler=( h1 => sub{ $index .= "\n$c"; toxml();},
h2 => sub{ $index .= "\n\t$c"; toxml();},
h3 => sub{ $index .= "\n\t\t$c"; toxml();},
contents => sub{ $c="__CLEAN__"; toxml();},
-end => sub{ $c =~ s/__CLEAN__/$index/; $c});
print dt($filename,%handler)
5. a more realistic example: from XML gcapaper DTD to latex
notes:
. "TITLE" is processed in context dependent way!
. output in ISOLATIN1 (this is dirty but my LaTeX doesn't support UNICODE)
. a stack of authors was necessary because LaTeX structure was different
from input structure...
. this example was partially created by the function mkdtskel
Perl -MXML::DT -e 'mkdtskel "f.xml"' > f.pl
and took me about one hour to tune to real LaTeX/XML example.
NAME gcapaper2tex.pl - a Perl script to translate XML gcapaper DTD to latex
SYNOPSIS gcapaper2tex.pl mypaper.xml > mupaper.tex
use XML::DT ;
my $filename = shift;
my $beginLatex = '\documentclass{article} \begin{document} ';
my $endLatex = '\end{document}';
%handler=(
'-outputenc' => 'ISO-8859-1',
'-default' => sub{"$c"},
'RANDLIST' => sub{"\\begin{itemize}$c\\end{itemize}"},
'AFFIL' => sub{""}, # delete affiliation
'TITLE' => sub{
if(inctxt('SECTION')){"\\section{$c}"}
elsif(inctxt('SUBSEC1')){"\\subsection{$c}"}
else {"\\title{$c}"}
},
'GCAPAPER' => sub{"$beginLatex $c $endLatex"},
'PARA' => sub{"$c\n\n"},
'ADDRESS' => sub{"\\thanks{$c}"},
'PUB' => sub{"} $c"},
'EMAIL' => sub{"(\\texttt{$c}) "},
'FRONT' => sub{"$c\n"},
'AUTHOR' => sub{ push @aut, $c ; ""},
'ABSTRACT' => sub{
sprintf('\author{%s}\maketitle\begin{abstract}%s\end{abstract}',
join ('\and', @aut) ,
$c) },
'CODE.BLOCK' => sub{"\\begin{verbatim}\n$c\\end{verbatim}\n"},
'XREF' => sub{"\\cite{$v{REFLOC}}"},
'LI' => sub{"\\item $c"},
'BIBLIOG' =>sub{"\\begin{thebibliography}{1}$c\\end{thebibliography}\n"},
'HIGHLIGHT' => sub{" \\emph{$c} "},
'BIO' => sub{""}, #delete biography
'SURNAME' => sub{" $c "},
'CODE' => sub{"\\verb!$c!"},
'BIBITEM' => sub{"\n\\bibitem{$c"},
);
print dt($filename,%handler);
WORKING WITH STRUCTURES INSTEAD OF STRINGS...
the "-type" definition defines the way to build structures in each case:
. "HASH" or "MAP" -> make an hash with the sub-elements;
keys are the sub-element names; warn on repetitions;
returns the hash reference.
. "ARRAY" or "SEQ" -> make an ARRAY with the sub-elements
returns an array reference.
. "MULTIMAP" -> makes an HASH of ARRAY; keys are the sub-element
. MMAPON(name1, ...) -> similar to HASH but accepts repetitions of
the sub-elements "name1"... (and makes an array with them)
. STR ->(DEFAULT) concatenates all the sub-elements returned values
all the sub-element should return strings to be concatenated
6. Build the natural Perl structure of the following document
<institution>
<id>U.M.</id>
<name>University of Minho</name>
<tels>
<item>1111</item>
<item>1112</item>
<item>1113</item>
</tels>
<where>Portugal</where>
<contacts>J.Joao; J.Rocha; J.Ramalho</contacts>
</institution>
use XML::DT;
%handler = ( -default => sub{$c},
-type => { institution => 'HASH',
tels => 'ARRAY' },
contacts => sub{ [ split(";",$c)] },
);
$a = dt("ex10.2.xml", %handler);
$a is a reference to an HASH:
{ 'tels' => [ 1111, 1112, 1113 ],
'name' => 'University of Minho',
'where' => 'Portugal',
'id' => 'U.M.',
'contacts' => [ 'J.Joao', ' J.Rocha', ' J.Ramalho' ] };
7. Christmas card...
We have the following address book:
<people>
<person>
<name> name0 </name>
<address> address00 </address>
<address> address01 </address>
</person>
<person>
<name> name1 </name>
<address> address10 </address>
<address> address11 </address>
</person>
</people>
Now we are going to build a structure to store the address book and write a Christmas card to the first address of everyone
#!/usr/bin/perl
use XML::DT;
%handler = ( -default => sub{$c},
person => sub{ mkchristmascard($c); $c},
-type => { people => 'ARRAY',
person => MMAPON('address')});
$people = dt("ex11.1.xml", %handler);
print $people->[0]{address}[1]; # prints address01
sub mkchristmascard{ my $x=shift;
open(A,"|lpr") or die;
print A <<".";
$x->{name}
$x->{address}[0]
Dear $x->{name}
Merry Christmas from Braga Perl mongers\n
.
close A;
}