NAME

Tree::Parser - Module to parse formatted files into tree structures

SYNOPSIS

use Tree::Parser;

# create a new parser object with some input
my $tp = Tree::Parser->new($input);

# use the built in tab indent filters
$tp->useTabIndentedFilters();

# use the built in space indent filters
$tp->useSpaceIndentedFilters(4); 

# use the built in dot-seperated numbers filters
$tp->useDotSeperatedLevelFilters();

# use the nested parens filter
$tp->useNestedParensFilters();

# create your own filter
$tp->setParseFilter(sub {
    my ($line_iterator) = @_;
    my $line = $line_iterator->next();
    my ($id, $tabs, $desc) = $line =~ /(\d+)(\t*)(.*)/;
    my $depth = length $tabs;
    return ($depth, { id => $id, desc => $desc } );
});

# parse our input and get back a tree
my $tree = $tp->parse();

# create your own deparse filter
# (which is in the inverse of our
# custom filter above)
$tp->setDeparseFilter(sub { 
    my ($tree) = @_;
    my $info = $tree->getNodeValue();
    return ($info->{id} . ("\t" x $tree->getDepth()) . $info->{desc});
});

# deparse our tree and get back a string
my $tree_string = $tp->deparse();

DESCRIPTION

This module can parse various types of input (formatted and containing hierarchal information) into a tree structures. It can also deparse the same tree structures back into a string. It accepts various types of input, such as; strings, filenames, array references. The tree structure is a hierarchy of Tree::Simple objects.

The parsing is controlled through a parse filter, which is used to process each "line" in the input (see setParseFilter below for more information about parse filters).

The deparseing as well is controlled by a deparse filter, which is used to covert each tree node into a string representation.

This module can be viewed (somewhat simplistically) as a serialization tool for Tree::Simple objects. Properly written parse and deparse filters can be used to do "round-trip" tree handling.

METHODS

Constructor

new ($tree | $input)

The constructor is used primarily for creating an object instance. Initializing the object is done by the _init method (see below).

Input Processing

setInput ($input)

This method will take varios types of input, and pre-process them through the prepareInput method below.

prepareInput ($input)

The prepareInput method is used to pre-process certain types of $input. It accepts any of the follow types of arguments:

  • an Array::Iterator object

    This just gets passed on through.

  • an array reference containing the lines to be parsed

    This type of argument is used to construct an Array::Iterator instance.

  • a filename

    The file is opened, its contents slurped into an array, which is then used to construct an Array::Iterator instance.

    NOTE: we used to only handle files with the .tree extension, however that was annoying, so now we accept any file name.

  • a string

    The string is expected to have at least one embedded newline or be in the nested parens format.

It then returns an Array::Iterator object ready for the parser.

setFileEncoding($encoding)

This allows you to specify the $encoding that the file should be read using. This is only only applicable when your input is a file.

Filter Methods

useTabIndentedFilters

This will set the parse and deparse filters to handle tab indented content. This is for true tabs \t only. The parse and deparse filters this uses are compatible with one another so round-triping is possible.

Example:

1.0
    1.1
    1.2
        1.2.1
2.0
    2.1
3.0
    3.1
        3.1.1
useSpaceIndentedFilters ($num_spaces)

This will set the parse and deparse filters to handle space indented content. The optional $num_spaces argument allows you to specify how many spaces are to be treated as a single indent, if this argument is not specified it will default to a 4 space indent. The parse and deparse filters this uses are compatible with one another so round-triping is possible.

Example:

1.0
  1.1
  1.2
    1.2.1
2.0
  2.1
3.0
  3.1
    3.1.1
useDotSeparatedLevelFilters (@level_identifiers)

This will set the parse and deparse filters to handle trees which are described in the following format:

1 First Child
1.1 First Grandchild
1.2 Second Grandchild
1.2.1 First Child of the Second Grandchild
1.3 Third Grandchild
2 Second Child 

There must be at least one space seperating the level identifier from the level name, all other spaces will be considered part of the name itself.

The parse and deparse filters this uses are compatible with one another so round-triping is possible.

The labels used are those specified in the @level_identifiers argument. The above code uses the default level identifiers (1 .. 100). But by passing the following as a set of level identifiers: 'a' .. 'z', you can successfully parse a format like this:

a First Child
a.a First Grandchild
a.b Second Grandchild
a.b.a First Child of the Second Grandchild
a.c Third Grandchild
b Second Child

Currently, you are restricted to only one set of level identifiers. Future plans include allowing each depth to have its own set of identifiers, therefore allowing formats like this: 1.a or other such variations (see "TO DO" section for more info).

useDotSeperatedLevelFilters

This old mispelled method name is kept for backwards compat.

useNestedParensFilters

This will set the parse and deparse filters to handle trees which are described in the following format:

(1 (1.1 1.2 (1.2.1) 1.3) 2 (2.1))

The parser will count the parentheses to determine the depth of the current node. This filter can also handle double quoted strings as values as well. So this would be valid input:

(root ("tree 1" ("tree 1 1" "tree 1 2") "tree 2"))

This format is currently somewhat limited in that the input must all be on one line and not contain a trailing newline. It also does not handle embedded escaped double quotes. Further refinement and improvement of this filter format is to come (and patches are always welcome).

It should be noted that this filter also cannot perform a roundtrip operation where the deparsed output is the exact same as the parsed input because it does not treat whitespace as signifigant (unless it is within a double quoted string).

setParseFilter ($filter)

A parse filter is a subroutine reference which is used to process each element in the input. As the main parse loop runs, it calls this filter routine and passes it the Array::Iterator instance which represents the input. To get the next element/line/token in the iterator, the filter must call next, the element should then be processed by the filter. A filter can if it wants advance the iterator further by calling next more than once if nessecary, there are no restrictions as to what it can do. However, the filter must return these two values in order to correctly construct the tree:

the depth of the node within the tree
Followed by either of the following items:
the value of the node

This value will be used as the node value when constructing the new tree. This can basically be any scalar value.

an instance of either a Tree::Simple object, or some derivative of Tree::Simple

If you need to perform special operations on the tree instance before it get's added to the larger hierarchy, then you can construct it within the parse filter and return it. An example of why you might want to do this would be if you wanted to set the UID of the tree instance from something in the parse filter.

The following is an example of a very basic filter which simply counts the number of tab characters to determine the node depth and then captures any remaining character on the line.

$tree_parser->setParseFilter(sub {
    my ($iterator) = @_;
    my $line = $iterator->next();
    # match the tables and all that follows it
    my ($tabs, $node) = ($line =~ /(\t*)(.*)/);
    # calculate the depth by seeing how long
    # the tab string is.
    my $depth = length $tabs;
    # return the depth and the node value
    return ($depth, $node);
}); 
setDeparseFilter ($filter)

The deparse filter is the opposite of the parse filter, it takes each element of the tree and returns a string representation of it. The filter routine gets passed a Tree::Simple instance and is expected to return a single string. However, this is not enforced we actually will gobble up all the filter returns, but keep in mind that each element returned is considered to be a single line in the output, so multiple elements will be treated as mutiple lines.

Here is an example of a deparse filter. This can be viewed as the inverse of the parse filter example above.

$tp->setDeparseFilter(sub { 
    my ($tree) = @_;
    return ("\t" x $tree->getDepth()) . $tree->getNodeValue();
});

Accessors

getTree

This method returns the tree held by the parser or set through the constructor.

Parse/Deparse

parse

Parsing is pretty automatic once everthing is set up. This routine will check to be sure you have all you need to proceed, and throw an execption if not. Once the parsing is complete, the tree will be stored interally as well as returned from this method.

deparse

This method too is pretty automatic, it verifies that it has all its needs, throwing an exception if it does not. It will return an array of lines in list context, or in scalar context it will join the array into a single string seperated by newlines.

Private Methods

_init ($tree | $input)

This will initialize the slots of the object. If given a $tree object, it will store it. This is currently the prefered way in which to use subclasses of Tree::Simple to build your tree with, as this object will be used to build any other trees (see "TO DO" for more information). If given some other kind of input, it will process this through the prepareInput method.

_parse

This is where all the parsing work is done. If you are truely interested in the inner workings of this method, I suggest you refer to the source. It is a very simple algorithm and should be easy to understand.

_deparse

This is where all the deparsing work is done. As with the _parse method, if you are interested in the inner workings, I suggest you refer to the source.

TO DO

Enhance the Nested Parens filter

This filter is somewhat limited in its handling of embedded newlines as well as embedded double quotes (even if they are escaped). I would like to improve this filter more when time allows.

Enhance the Dot Seperated Level filter

I would like to enhance this built in filter to handle multi-level level-identifiers, basically allowing formats like this:

1 First Child
1.a First Grandchild
1.b Second Grandchild
1.b.I First Child of the Second Grandchild
1.b.II Second Child of the Second Grandchild
1.c Third Grandchild
2 Second Child
Make Tree::Simple subclasses more easy to handle

Currently in order to have Tree::Parser use a subclass of Tree::Simple to build the heirarchy with, you must pass a tree into the constructor, and then set the input manually. This could be handled better I think, but right now I am not 100% how best to go about it.

BUGS

None that I am aware of. Of course, if you find a bug, let me know, and I will be sure to fix it. This module, in an earlier form, has been and is being used in production for approx. 1 year now without incident. This version has been improved and the test suite added.

CODE COVERAGE

I use Devel::Cover to test the code coverage of my tests, below is the Devel::Cover report on this module's test suite.

---------------------------- ------ ------ ------ ------ ------ ------ ------
File                           stmt branch   cond    sub    pod   time  total
---------------------------- ------ ------ ------ ------ ------ ------ ------
Tree/Parser.pm                100.0   87.9   81.2  100.0  100.0  100.0   94.6
---------------------------- ------ ------ ------ ------ ------ ------ ------
Total                         100.0   87.9   81.2  100.0  100.0  100.0   94.6
---------------------------- ------ ------ ------ ------ ------ ------ ------

SEE ALSO

This module is not an attempt at a general purpose parser by any stretch of the imagination. It is basically a very flexible special purpose parser, it only builds Tree::Simple heirarchies, but your parse filters can be as complex as nessecary. If this is not what you are looking for, then you might want to consider one of the following modules:

Parse::RecDescent

This is a general purpose Recursive Descent parser generator written by Damian Conway. If your parsing needs lean towards the more complex, this is good module for you. Recursive Descent parsing is known to be slower than other parsing styles, but it tends to be easier to write grammers for, so there is a trade off. If speed is a concern, then you may just want to skip perl and go straight to C and use yacc.

Parse::Yapp

As an alternative to Recursive Descent parsing, you can do LALR parsing. It is faster and does not have some of the well known (and avoidable) problems of Recursive Descent parsing. I have never actually used this module, but I have heard good things about it.

Parse::FixedLength

If all you really need to do is process a file with fixed length fields in it, you can use this module.

Parse::Tokens

This class will help you parse text with embedded tokens in it. I am not very familiar with this module, but it looks interesting.

There are also a number of specific parsers out here, such as HTML::Parser and XML::Parser, which do one thing and do it well. If you are looking to parse HTML or XML, don't use my module, use these ones, it just makes sense. Use the right tool for the job basically.

DEPENDENCIES

This module uses two other modules I have written:

Tree::Simple
Array::Iterator

ACKNOWLEDGEMENTS

Thanks to Chad Ullman for reporting RT Bug #12244 and providing code and test case for it.
Thanks to Gerd for reporting RT Bug #13041 and providing code to fix it.

AUTHOR

stevan little, <stevan@iinteractive.com>

COPYRIGHT AND LICENSE

Copyright 2004-2007 by Infinity Interactive, Inc.

http://www.iinteractive.com

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.