NAME
html2plain.pl - HTML to plain text converter
SYNOPSIS
html2plain.pl [options] [source directory ...]
Options:
--html-ext HTML file identifying filename extension
--out-ext output filename extension
--out-dir output directory
--N-per-out-dir # of records per output directory
--source-encoding the encoding of the HTML files
--[no]assert-html assert that the document is HTML
--[no]symbolic-char-entities-to-chars
convert symbolic character entities to UTF-8
characters
--[no]numerical-char-entities-to-chars
convert numerical character entities to UTF-8
characters
--[no]clean-whitespace remove redundant whitespace
--[no]assert-assumptions assert that the document is in UTF-8 and contains
before actually converting to plain text
--help brief help message
--man full documentation
--[no]warnings warnings output flag
OPTIONS
- --html-ext
-
Sets the HTML file identifying filename extension. Default value: 'html'.
- --out-ext
-
Sets the output filename extension. Default value: 'plain'.
- --out-dir
-
Sets the output directory. Default value: '.'.
- --N-per-out-dir
-
Sets the # of records per output directory. Default value: 1000.
- --source-encoding
-
Specifies the encoding of the HTML files. Default value undef, which means that the encoding is guessed for each document.
- --[no]assert-html
-
Specifies whether it is asserted that the document actually looks like HTML before trying to convert. Default: yes.
- --[no]symbolic-char-entities-to-chars
-
Specifies whether symbolic character entities are converted to UTF-8 characters. Default: yes.
- --[no]numerical-char-entities-to-chars
-
Specifies whether numerical character entities are converted to UTF-8 characters. Default: yes.
- --[no]clean-whitespace
-
Specifies whether redundant whitespace is removed from the output. Default: yes.
- --[no]assert-assumptions
-
Specifies whether assumptions about the source are validated before trying to convert (that it is in UTF-8 (converted to internally) and contains no '\0's. Default: yes.
- --help
-
Prints a brief help message and exits.
- --man
-
Prints the manual page and exits.
- --[no]warnings
-
Output (or suppress) warnings. Default value: yes.
DESCRIPTION
Goes recursively through the HTML files under the source directory
and converts their textual content to plain text files.
The output is in UTF-8.