NAME

HTML::Content::HTMLTokenizer - Perl module to tokenize HTML documents.

SYNOPSIS

use HTML::Content::HTMLTokenizer;

my $tokenizer = new HTML::Content::HTMLTokenizer('TAG','WORD');

open(HTML,"index.html");
my $doc = join("",<HTML>);
close(HTML);

my ($word_count_arr_ref,$tag_count_arr_ref,$token_type_arr_ref,$token_hash_ref) = $tokenizer->Tokenize($doc);

DESCRIPTION

HTML::Content::HTMLTokenizer has one main method, Tokenize, which tokenizes a HTML document into a sequence of 'TAG' and 'WORD' tokens.

Methods

my $tokenizer = new HTML::Content::HTMLTokenizer($tagMarker,$wordMarker)

Initializes HTML::Content::HTMLTokenizer.

$tagMarker - String that will represent tags in the token sequence returned from Tokenize.

$wordMarker - String that will represent words in the token sequence returned from Tokenize.
my (\@WordCount,\@TokenCount,\@Sequence,\%Tokens) = $tokenizer->Tokenize(\$htmldocument);

$WordCount[$i] is the number of word tokens before or at the ith token in the input HTML document.

$TagCount[$i] is the number of tag tokens before or at the ith token in the input HTML document.

$Sequence[$i] is the type of token at the ith spot in the input HTML document. Either $tagMarker or $wordMarker.

$Tokens{$i} is the word at the ith spot in the input HTML document. This is defined only if there is a word at the ith spot in the document.

AUTHOR

Jean Tavernier (jj.tavernier@gmail.com)

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)