NAME

Alvis::Canonical - Perl extension for converting documents in various formats into the Alvis canonical format for documents

SYNOPSIS

use Alvis::Canonical;

# Create a new instance, specify the conversion of both numeric and 
# symbolic character entities to Unicode characters
my $C=Alvis::Canonical->new(convertCharEnts=>1,
                            convertNumEnts=>1);
if (!defined($C))
{
    die("Unable to instantiate Alvis::Canonical.");
}

# Convert an HTML document text in UTF-8 to the canonical format.
# Specify that you want the title and baseURL as well, if any can be
# determined.
my ($txt,$header)=$C->HTML($html,
                           {title=>1,
        		     baseURL=>1});
if (!defined($txt))
{
   die $C->errmsg();
}

DESCRIPTION

Assumes the input is in UTF-8 and does NOT contain '\0's (or rather that they carry no meaning and are removable).

METHODS

new()

Available options:

  warnings         Issue warnings about badly faulty original HTML where
                   we have to resort to an heuristic solution.
                   Puts a warning to STDERR documenting the error and
                   the solution. Default: no.
  convertCharEnts  Convert HTML symbolic character entities to UTF-8 
                   characters? Default: yes.
  convertNumEnts   Convert HTML numerical character entities to UTF-8 
                   characters? Default: yes.
  sourceEncoding   the encoding of the source documents. Default: undef,
                   which means it is guessed.  
   
my $C=Alvis::Canonical->new(convertCharEnts=>1,
                            convertNumEnts=>1);
if (!defined($C))
{
  die die("Unable to instantiate Alvis::Canonical.");
}

HTML($html,$options)

Converts dirty HTML to a valid Alvis canonicalDocument. $options is a mechanism for returning the title and base URL of the document. If their extraction is desired, set fields 'title' and 'baseURL' to a defined value. If you know the encoding of the source document, set option 'sourceEncoding', e.g.

my ($txt,$header)=$C->HTML($html,
                          {title=>1,
       		     baseURL=>1,
                           sourceEncoding=>'iso-8859-2'});

errmsg()

Returns a stack of error messages, if any. Empty string otherwise.

SEE ALSO

Alvis::Convert

AUTHOR

Kimmo Valtonen, <kimmo.valtonen@hiit.fi>

COPYRIGHT AND LICENSE

Copyright (C) 2006 by Kimmo Valtonen

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.