XML::DT a Perl XML down translate module

With XML::DT, I think that:

. it is simple to do simple XML processing tasks :)
. it is simple to have the XML processor stored in a single variable
    (see example 4)
. it is simple to translate XML -> Perl user controlled complex structure 
    with a compact "-type" definition  (see last section)

Feedback welcome -> jj@di.uminho.pt

XML::DT a Perl XML down translate module

This document is also available in HTML (pod2html'ized): http://www.di.uminho.pt/~jj/perl/XML/XML-DT.readme.html

. based on XML::Parser (tree mode).
. design to do simple and compact translation/processing of XML document
. it includes some features of omnimark and sgmls.pm; functional approach
. it includes functions to automatic build user controlled complex Perl 
      structures (see "working with structures" section)
. it was build to show my NLP Perl students that it is easy to work with XML
. home page and download:  http://www.di.uminho.pt/~jj/perl/XML/DT.html

HOW IT WORKS:

. the user must define a handler and call the basic function : 
     dt($filename,%handler) or dtstring($string,%handler)
. the handler is a HASH mapping element names to functions. Handlers can 
     have a "-default" function , and a "-end" function
. in order to make it smaller each function receives 3 args as global variables
     $c - contents
     $q - element name
     %v - attribute values
. the default "-default" function is the identity. The function "toxml" makes
     the original XML text based on $c, $q and %v values.
. see some advanced features in the last examples

SOME simple (naive) examples:

INDEX:
1. change to lowercase attribute named "a" in element "e"
2. better solution 
3. make some statistics and output results in HTML (using side effects)
4. In a HTML like XML document, substitute <contents/>...<contents> by the 
    real table of contents (a dirty solution...)
5. a more realistic example: from XML gcapaper DTD to latex

WORKING WITH STRUCTURES INSTEAD OF STRINGS...

6. Build the natural Perl structure of the following document (ARRAY,HASH)
7. Multi map on...

1. change to lowercase the contents of the attribute named "a" in element "e"

use XML::DT ;
my $filename = shift;

print dt($filename,
         ( e => sub{ "<e a='". lc($v{a}). "'>$c</e>" }));

2. A better solution of the previous example

Ex.1 wouldn't work if we have more attributes in element e. A better solution is

print dt($filename, 
         ( e => sub{ $v{a} = lc($v{a}); 
                     toxml();}));

3. make some statistics and output results in HTML (using side effects)

use XML::DT ;
my $filename = shift;

%handler=( -default => sub{$elem_counter++;
                           $elem_table{$q}++;"";} # $q -> element name
);

dt($filename,%handler);

print "<H3>We have found $elem_counter elements in document</H3>";
print "<TABLE><TH>ELEMENT<TH>OCCURS\n";
foreach $elem (sort keys %elem_table)
   {print "<TR><TD>$elem<TD>$elem_table{$elem}\n";}
print "</TABLE>";

4. In a HTML like XML document, substitute <contents/>...<contents> by the real table of contents (a dirty solution...)

%handler=( h1 => sub{ $index .= "\n$c";     toxml();},
           h2 => sub{ $index .= "\n\t$c";   toxml();},
           h3 => sub{ $index .= "\n\t\t$c"; toxml();},
           contents => sub{ $c="__CLEAN__"; toxml();},
           -end => sub{ $c =~ s/__CLEAN__/$index/; $c});

print dt($filename,%handler)

5. a more realistic example: from XML gcapaper DTD to latex

notes:

. "TITLE" is processed in context dependent way!
. output in ISOLATIN1 (this is dirty but my LaTeX doesn't support UNICODE)
. a stack of authors was necessary because LaTeX structure was different
    from input structure...
. this example was partially created by the function mkdtskel 
      Perl -MXML::DT -e 'mkdtskel "f.xml"' > f.pl
    and took me about one hour to tune to real LaTeX/XML example.

NAME gcapaper2tex.pl - a Perl script to translate XML gcapaper DTD to latex

SYNOPSIS gcapaper2tex.pl mypaper.xml > mupaper.tex

use XML::DT ;
my $filename = shift;
my $beginLatex = '\documentclass{article} \begin{document} ';
my $endLatex = '\end{document}';

%handler=(
    '-outputenc' => 'ISO-8859-1',
    '-default'   => sub{"$c"},
     'RANDLIST' => sub{"\\begin{itemize}$c\\end{itemize}"},
     'AFFIL' => sub{""},                              # delete affiliation
     'TITLE' => sub{
                  if(inctxt('SECTION')){"\\section{$c}"}
               elsif(inctxt('SUBSEC1')){"\\subsection{$c}"}
               else                    {"\\title{$c}"}
            },
     'GCAPAPER' => sub{"$beginLatex $c $endLatex"},
     'PARA' => sub{"$c\n\n"},
     'ADDRESS' => sub{"\\thanks{$c}"},
     'PUB' => sub{"} $c"},
     'EMAIL' => sub{"(\\texttt{$c}) "},
     'FRONT' => sub{"$c\n"},
     'AUTHOR' => sub{ push @aut, $c ; ""},
     'ABSTRACT' => sub{
        sprintf('\author{%s}\maketitle\begin{abstract}%s\end{abstract}',
                join ('\and', @aut) ,
                $c) },
     'CODE.BLOCK' => sub{"\\begin{verbatim}\n$c\\end{verbatim}\n"},
     'XREF' => sub{"\\cite{$v{REFLOC}}"},
     'LI' => sub{"\\item $c"},
     'BIBLIOG' =>sub{"\\begin{thebibliography}{1}$c\\end{thebibliography}\n"},
     'HIGHLIGHT' => sub{" \\emph{$c} "},
     'BIO' => sub{""},                                  #delete biography
     'SURNAME' => sub{" $c "},
     'CODE' => sub{"\\verb!$c!"},
     'BIBITEM' => sub{"\n\\bibitem{$c"},
);
print dt($filename,%handler); 

WORKING WITH STRUCTURES INSTEAD OF STRINGS...

the "-type" definition defines the way to build structures in each case:

 . "HASH" or "MAP" -> make an hash with the sub-elements;
      keys are the sub-element names; warn on repetitions;
      returns the hash reference.
 . "ARRAY" or "SEQ" -> make an ARRAY with the sub-elements
      returns an array reference.
 . "MULTIMAP" -> makes an HASH of ARRAY; keys are the sub-element
 . MMAPON(name1, ...) -> similar to HASH but accepts repetitions of
      the sub-elements "name1"... (and makes an array with them)
 . STR  ->(DEFAULT) concatenates all the sub-elements returned values
      all the sub-element should return strings to be concatenated

6. Build the natural Perl structure of the following document

<institution>
  <id>U.M.</id>
  <name>University of Minho</name>
  <tels>
    <item>1111</item> 
    <item>1112</item>
    <item>1113</item>
  </tels>
  <where>Portugal</where>
  <contacts>J.Joao; J.Rocha; J.Ramalho</contacts>
</institution>

use XML::DT;
%handler = ( -default => sub{$c},
             -type    => { institution => 'HASH',
                           tels        => 'ARRAY' },
             contacts => sub{ [ split(";",$c)] },
           );

$a = dt("ex10.2.xml", %handler);

$a is a reference to an HASH:

{ 'tels' => [ 1111, 1112, 1113 ],
  'name' => 'University of Minho',
  'where' => 'Portugal',
  'id' => 'U.M.',
  'contacts' => [ 'J.Joao', ' J.Rocha', ' J.Ramalho' ] };

7. Christmas card...

We have the following address book:

<people>
  <person>
      <name> name0 </name>
      <address> address00 </address>    
      <address> address01 </address>
  </person>
  <person>
      <name> name1 </name>
      <address> address10 </address>    
      <address> address11 </address>
  </person>
</people>

Now we are going to build a structure to store the address book and write a Christmas card to the first address of everyone

#!/usr/bin/perl
use XML::DT;
%handler = ( -default => sub{$c},
             person   => sub{ mkchristmascard($c); $c},
             -type    => { people => 'ARRAY',
                           person => MMAPON('address')});

$people = dt("ex11.1.xml", %handler);

print $people->[0]{address}[1];     # prints  address01

sub mkchristmascard{ my $x=shift;
  open(A,"|lpr") or die;
  print A <<".";
  $x->{name} 
  $x->{address}[0]
  
  Dear $x->{name}
    Merry Christmas from Braga Perl mongers\n
.

close A;
}