NAME

File::ANVL - A Name Value Language routines

SYNOPSIS

 use File::ANVL;       # to import routines into a Perl script

 xgetlines(             # read from $filehandle (defaults to *ARGV) up to
         $filehandle   # blank line; returns record read or undef on EOF;
         );            # record may be all whitespace (almost EOF)

 trimlines(            # strip initial whitespace from record, often just
         $record,      # returned by getlines(), and return remainder;
	 $r_wslines,   # optional ref to line count in trimmed whitespace
	 $r_rrlines ); # optional ref to line count of real record lines

 anvl_recarray(        # split $record into array of lineno-name-value
         $record,      # triples, first triple being <anvl, beta, "">
         $r_elems,     # reference to returned array
         $lineno,      # starting line number (default 1)
         $opts );      # options/default, eg, comments/0, autoindent/1

 erc_anvl_expand_array(# change short ERC ANVL array to long form ERC
         $r_elems );   # reference to array to modify in place

 anvl_arrayhash(       # hash indices from recarray or expand_array
         $r_elems,     # reference to original array (not modified)
         $r_hash );    # reference to hash (you undef to initialize)

 anvl_valsplit(        # split ANVL value into an array of subvalues
         $value,       # input value; arg 2 is reference to returned
         $r_svals );   # array of arrays of returned values

 anvl_decode( $str );  # decode ANVL-style %xy chars in string

 anvl_name_naturalize( # convert name from sort-friendly to natural
         $name );      # word order using ANVL inversion points

 anvl_om(              # read and process records from *ARGV
         $om,          # a File::OM formatting object
   {                   # a hash reference to various options
   autoindent => 0,    # don't (default do) correct sloppy indention
   elem_order => 0,    # ordered element name list (default all) to output
   comments => 1,      # do (default don't) preserve input comments
   verbose => 1,       # output record and line numbers (default don't)
   ... } );            # other options listed later

 anvl_opt_defaults();  # return hash reference with factory defaults

 *DEPRECATED*
 anvl_rechash(         # split ANVL record into hash of elements
         $record,      # input record; arg 2 is reference to returned
         $r_hash,      # hash; a value is scalar, or array of scalars
         $strict );    # if more than one element shares its name

 anvl_recsplit(        # split record into array of name-value pairs;
         $record,      # input record; arg 2 is reference to returned
         $r_elems,     # array; optional arg 3 (default 0) requires
         $strict );    # properly indented continuation lines
 anvl_encode( $str );  # ANVL-encode string

 *REPLACED*
 # instead of anvl_fmt use File::OM::ANVL object's 'elems' method
 $elem = anvl_fmt(     # format ANVL element, wrapping to 72 columns
         $name,        # $name is what goes to left of colon (:)
         $value,       # $value is what goes to right of colon
	 ... );        # more name/value pairs may follow

DESCRIPTION

This is documentation for the ANVL Perl module, which provides a general framework for data represented in the ANVL format. ANVL (A Name Value Language) represents elements in a label-colon-value format similar to email headers. Specific conversions, based on an "output multiplexer" File::OM, are possible to XML, Turtle, JSON, CSV, and PSV (Pipe Separated Value), and Plain unlabeled text.

The OM package can also be used to build records from scratch in ANVL or other the formats. Below is an example of how to create a particular kind of ANVL record known as an ERC (which uses Dublin Kernel metadata). For the formats ANVL, Plain, and XML, the returned text string by default is wrapped to 72 columns.

use File::OM;
my $om = File::OM->new("ANVL");
$anvl_record = $om->elems(
    "erc", "",
    "who", $creator,
    "what", $title,
    "when", $date,
    "where", $identifier)
    . "\n";    # 2nd newline in a row terminates ANVL record

The getlines() function reads from $filehandle up to a blank line and returns the lines read. This is a general function for reading "paragraphs", which is useful for reading ANVL records. If unspecified, $filehandle defaults to *ARGV, which makes it easy to take input from successive file arguments specified on the command line (or from STDIN if none) of the calling program.

For convenience, trimlines() is often used to process the record just returned by getlines(). It strips leading whitespace, optionally counts lines, and returns undef if the passed record is undefined or contains only whitespace, both being equivalent to end-of-file (EOF).

These functions treat whitespace specially. Input is read up until at least one non-whitespace character and a blank line (two newlines in a row) or EOF is reached. If EOF is reached and the record would contain only whitespace, undef is returned. Input line counts for preliminary trimmed whitespace ($wslines) and real record lines ($rrlines) can be returned through optional scalar references given to trimlines(). These functions work together to permit the caller access to all inputs, to accurate line counts, and a familiar "loop until EOF" paradigm, as in

while (defined trimlines(getlines(), \$wslcount, \$rrlcount)) ...

The anvl_recarray() function splits an ANVL record into ANVL elements, returning them via the array reference given as the second argument. The n-th returned ANVL element corresponds to three Perl array elements as follows:

INDEX   CONTENT
3n + 0  input file line number
3n + 1  n-th ANVL element name
3n + 2  n-th ANVL element value

This means, for example, that the first two ANVL element names would be found at Perl array indices 4 and 7. The first triple is special; array elements 0 and 2 are undefined unless the record begins with an unlabeled value (not strictly ANVL), such as,

Smith, Jo
home: 555-1234
work: 555-9876

in which case they contain the line number and value, respectively. Array element 1 always contains a string naming the format of the input, such as, "ANVL", "JSON", "XML", etc.

The remaining triples are free form except that the values will have been drawn from the original format and possibly decoded. The first item ("lineno") in each remaining triple is a number followed by a character, for example, "34:" or "6#". The number indicates the line number (or octet offset, depending on the origin format) of the start of the element. The character is either ':' to indicate a real element or '#' to indicate a comment; if the latter, the element name has no defined meaning and the comment is contained in the value. Here's example code that reads a 3-element record and reformats it.

($msg = File::ANVL::anvl_recarray('
a: b c
d:  e
  f
g:
  h i
'     and die "anvl_recarray: $msg";  # report what went wrong
for ($i = 4; $i < $#elems; $i += 3)
    { print "[$elems[$i] <- $elems[$i+1]]  "; }

which prints

[a <- b c]  [d <- e f]  [g <- h i]

An optional third argument to anvl_recarray gives the starting line number (default 1). An optional fourth argument is a reference to a hash containing options; the argument { comments => 1, autoindent => 0 } will cause comments to be kept (stripped by default) and recoverable indention errors to be flagged as errors (corrected to continuation lines by default). This function returns the empty string on success, or a message beginning "warning: ..." or "error: ...".

erc_anvl_expand_array() inspects and possibly modifies in place the kind of element array resulting from a call to anvl_recarray(). It returns the empty string on success, otherwise an error message. This routine is useful for transforming a short form ERC ANVL record into long form, for example, expanding erc: a | b | c | d into the equivalent,

erc:
who: a
what: b
when: c
where: d

The anvl_arrayhash() function takes the kind of element array resulting from a call to anvl_recarry or erc_anvl_expand_array() and modifies the hash reference given as the second argument by storing, for each element name, a list of integers corresponding to the triples that bear that name. You should always undefine the hash first or you may see unexpected results. So to print the value (the 2nd array element past the start of the triple) of the first instance (index 0) of "who",

anvl_arrayhash(\@elems, \%hash);
print "First who: ", $elems[ $hash{who}->[0] + 2 ], "\n";

The anvl_valsplit() function splits an ANVL value into sub-values (svals) and repeated values (rvals), returning them as an array of arrays via the array reference given as the second argument. The top-level of the array represents svals and the next level represents rvals. This function returns the empty string on success, or a message beginning "warning: ..." or "error: ...".

The anvl_decode() function takes an ANVL-encoded string and returns it after converting encoded characters to the standard representaion (e.g., %vb becomes `|'). Some decoding, such as for the expansion block below,

print anvl_decode('http://example.org/node%{
            ? db = foo
            & start = 1
            & end = 5
            & buf = 2
            & query = foo + bar + zaf
       %}');

will affect an entire region. This code prints

http://example.org/node?db=foo&start=1&end=5&buf=2&query=foo+bar+zaf

The anvl_name_naturalize() function takes an ANVL string (aval) and returns it after inversion at any designated inversion points. The input string will be returned if it does not end in a comma (`,'). The more terminal commas, the more inversion points tried. For example, the calls

anvl_name_naturalize("Smith, Pat,");
anvl_name_naturalize("McCartney, Paul, Sir,,")
anvl_name_naturalize("Hu Jintao,")

take sort-friendly strings (commonly used to make ANVL records easy to sort) and return the natural word order strings,

Pat Smith
Sir Paul McCartney
Hu Jintao

The anvl_om() routine takes a formatting object created by a call to File::OM($format), reads a stream of ANVL records, processes each element, and calls format-specific methods to build the output. Those methods are typically affected by transferring command line options in at object creation time.

use File::ANVL;
use File::OM;
my $fmt = $opt{format};       
$om = File::OM->new($opt{format},      # from command line
    {comments => $opt{comments}) or    # from command line
        die "unknown format $fmt";

Options control various aspects of reading ANVL input records. The 'autoindent' option (default on) causes the parser to recover if it can when continuation lines are not properly indented. As a special case, if the first line of the record has no label, leaving 'autoindent' on will cause anvl_recarray() to preserve it's value and line number in the first triple, which anvl_om() will detect and pass through with the synthesized name '_'.

The 'elem_order' option (default undefined) can be used to control which elements are output and their ordering. If set to a reference to an array of element names, which may contain repeated names, the specified elements (and no others) are output in the specified order. Normally, all elements present in the array are output. Under the CSV and PSV formats, element order is by default inferred by the ordering of elements found in the first record.

The 'comments' options (default off) causes input comments to be preserved in the output, format permitting. The 'verbose' option inserts record and line numbers in comments. Pseudo-comments will be created for formats that don't natively define comments (JSON, Plain).

Like the individual OM methods, anvl_om() returns the built string by default, or the return status of print using the file handle supplied as the 'outhandle' options (normally set to '') at object creation time, for example,

{ outhandle => *STDOUT }

The way anvl_om() works is roughly as follows.

$om->ostream();                                    # open stream
... { # loop over all records, eg, $recnum++
$anvlrec = trimlines(getlines());
last         unless $anvlrec;
$err = anvl_recarray($anvlrec, $$o{elemsref}, $startline, $opts);
$err         and return "anvl_recarray: $err";
...
$om->orec($anvlrec, $recnum, $startline);          # open record
...... { # loop over all elements, eg, $elemnum++
$om->elem($name, $value, $elemnum, $lineno);       # do element
...... }
$om->crec($recnum);                                # close record
... }
$om->cstream();                                    # close stream

DEPRECATED: The anvl_rechash() function splits an ANVL record into elements, returning them via the hash reference given as the second argument. A hash key is defined for each element name found. Under that key is stored the corresponding element value, or an array of values if more than one occurrence of the element name was encountered. This function returns the empty string on success, or a message beginning "warning: ..." or "error: ...".

DEPRECATED: The anvl_recsplit() function splits an ANVL record into elements, returning them via the array reference given as the second argument. Each returned element is a pair of elements: a name and a value. An optional third argument, if true (default 0), rejects unindented continuation lines, a common formatting mistake. This function returns the empty string on success, or message beginning "warning: ..." or "error: ...". Here's an example that extracts and uses the first returned element.

($msg = anvl_recsplit($record, $elemsref)
    and die "anvl_recsplit: $msg";  # report what went wrong
print scalar($$elemsref), " elements found\n",
    "First element label is $$elemsref[0]\n",
    "First element value is $$elemsref[1]\n";

SEE ALSO

A Name Value Language (ANVL) http://www.cdlib.org/inside/diglib/ark/anvlspec.pdf

A Metadata Kernel for Electronic Permanence (PDF) http://journals.tdl.org/jodi/article/view/43

HISTORY

This is a beta version of ANVL tools. It is written in Perl.

AUTHOR

John A. Kunze jak at ucop dot edu

COPYRIGHT AND LICENSE

Copyright 2009-2011 UC Regents. Open source BSD license.

PREREQUISITES

Perl Modules: File::OM

Script Categories:

UNIX : System_administration