NAME

html2plain.pl - HTML to plain text converter

SYNOPSIS

  html2plain.pl [options] [source directory ...]

Options:

  --html-ext                HTML file identifying filename extension
  --out-ext                 output filename extension
  --out-dir                 output directory
  --N-per-out-dir           # of records per output directory
  --source-encoding         the encoding of the HTML files
  --[no]assert-html         assert that the document is HTML
  --[no]symbolic-char-entities-to-chars
                            convert symbolic character entities to UTF-8
                            characters
  --[no]numerical-char-entities-to-chars
                            convert numerical character entities to UTF-8
                            characters
  --[no]clean-whitespace    remove redundant whitespace
  --[no]assert-assumptions  assert that the document is in UTF-8 and contains
                            before actually converting to plain text
  --help                    brief help message
  --man                     full documentation
  --[no]warnings            warnings output flag
  

OPTIONS

--html-ext
Sets the HTML file identifying filename extension. 
Default value: 'html'.
--out-ext
Sets the output filename extension. 
Default value: 'plain'.
--out-dir
Sets the output directory. Default value: '.'.
--N-per-out-dir
Sets the # of records per output directory. Default value: 1000.
--source-encoding
Specifies the encoding of the HTML files. Default value undef,
which means that the encoding is guessed for each document.
--[no]assert-html
Specifies whether it is asserted that the document actually looks like
HTML before trying to convert. Default: yes.
--[no]symbolic-char-entities-to-chars
Specifies whether symbolic character entities are converted to 
UTF-8 characters. Default: yes.
--[no]numerical-char-entities-to-chars
Specifies whether numerical character entities are converted to 
UTF-8 characters. Default: yes.
--[no]clean-whitespace
Specifies whether redundant whitespace is removed from the output.
Default: yes.
--[no]assert-assumptions
Specifies whether assumptions about the source are validated before
trying to convert (that it is in UTF-8 (converted to internally) and
contains no '\0's. Default: yes.
--help
Prints a brief help message and exits.
--man
Prints the manual page and exits.
--[no]warnings
Output (or suppress) warnings. Default value: yes.

DESCRIPTION

Goes recursively through the HTML files under the source directory
and converts their textual content to plain text files. 
The output is in UTF-8.