NAME

HTML::PullParser::Nested - Wrapper around HTML::PullParser with awareness of tag nesting.

SYNOPSIS

use HTML::PullParser::Nested;

my $p = HTML::PullParser::Nested->new(
    doc         => \ "<html>...<ul><li>abcd<li>efgh<li>wvyz</ul>...<ul><li>1<li>2<li>9</ul></html>",
    start       => "'S',tagname,attr,attrseq,text",
    end         => "'E',tagname,text",
    text        => "'T',text,is_cdata",
    );

while (my $token = $p->get_token()) {
    if ($token->[0] eq "S" && $token->[1] eq "ul") {
        $p->push_nest($token);
        print "List:\n";
        while (my $token = $p->get_token()) {
            if ($token->[0] eq "S" && $token->[1] eq "li") {
                print $p->get_token()->[1], "\n";
            }
        }
        print "\n";
        $p->pop_nest();
    }
}

DESCRIPTION

This class is a wrapper around HTML::PullParser with awareness of the nesting of tags.

There is a cursor, which points to the current position within the document. It should be thought of as pointing to the start of the next token, or to 'EOL' (eof of level).

Tokens can be read sequentially, and the cursor will be advanced after each read. They can also be unread, reversing any effects of their having been read.

As noted, the class is aware of tag nesting, giving the concept of nesting levels. Level 1 encompasses the whole document. As any point a new nesting level can be pushed on, specifying a tag type. In effect, the parser then behaves as if it had instead been opened on a document only containing the content up the closing tag. It is then possible to pop a nesting level, which then moves the cursor to the start of the closing tag and switches to the parent nesting level.

METHODS

new(file => $file, %options), new(doc => \$doc, %options)

Constructor. %options gets passed to the encapsulated HTML::PullParser object and largely has the same restrictions. As HTML::PullParser::Nested needs to be able to process tokens returned by HTML::PullParser, there are some restrictions on the argspecs for each token type. Firstly, so that the token type can be identified, either event, or distinct literal strings must be present at the same array index in the argspec for each returned token type. For start and end tags, tagname must also be present somewhere.

get_token()

Read and return the next token and advance the cursor. If the cursor points to eol, undef will be returned on the first read attempt, and an error raised thereafter.

unget_token(@tokens)

Reverse the effects of get_token().

eol()

End of level flag. Returns true after get_token() has returned undef to signify end of level.

push_nest($token)

Push a new nesting level onto the stack. $token should be on start tag. The current level will now correspond of all tags up to the corresponding close tag.

The corresponding closing tag is determined by counting the start and end tags of the current nesting level. This means that if

<a>
    <b>
        <a>
        <a>
        <a>
    </b>
</a>

is encountered whilst the current nesting level is tracking <a> tags, the parser will end either end up 3 tags deeper or at the same depth depending whether push_nest(), pop_nest() are called for the <b> tag.

It is safe to call push_nest() twice for the same tag type.

pop_nest()

Pop a nesting level from the stack. Skips to the end of the current nesting level if necessary.

SEE ALSO

HTML::PullParser

AUTHOR

Christopher Key <cjk32@cam.ac.uk>

COPYRIGHT AND LICENCE

Copyright (C) 2010 Christopher Key <cjk32@cam.ac.uk>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.