NAME
untemplate - analyze several HTML documents based on the same template
VERSION
version 0.019
SYNOPSIS
untemplate [options] HTML1 HTML2 [HTML3] [...]
DESCRIPTION
Takes multiple HTML documents generated using the same template and attempts to extract only the data inserted into original template.
Accepts URL if AnyEvent::Net::Curl::Queued is present.
OPTIONS
- --help
-
This.
- --encoding=name
-
Specify the HTML document encoding (
latin1
,utf8
). UTF-8 is assumed by default. - --[no]color
-
Enable syntax highlight for XPath. By default, enabled automatically on interactive terminals.
- --16
-
Use 16 system colors. By default, try to use 256-color ANSI palette.
- --[no]html
-
Disables the
--color
option and highlights using HTML/CSS. - --[no]partial
-
Enable the display of "partial" templates, that is, nodes present in some documents. By default, only the nodes present in all documents are displayed.
- --[no]shrink
-
Shrink the XPath to the minimal unique identifier. For example:
/html/body[@id='cpansearch']/form[@class='searchbox']/input[@name='query']
Could be shortened as:
//input[@name='query']
The shrinking is enabled by default.
- --[no]strict
-
Strict mode disables grouping by
id
,class
orname
attributes. The grouping is enabled by default. - --unmangle=regex
-
Specify regex(es) to unmangle
id
/class
attributes. Some CMS (WordPress) insert unique identifiers into HTML elements, like:<body class="post-id-12345">
This tend to break HTML tree analysis. To fix the above case, use
--unmangle 'post-id-\d+'
. Multiple unmanglers are accepted (--unmangle a --unmangle b
).
EXAMPLES
untemplate --color http://bash.org/?1839 http://bash.org/?2486 | less -R
CAVEATS
Trying to untemplate HTML documents not based on the same template, the results will be empty.
Unfortunately, employing any kind of document identifier as part of element class/id (common practice in WordPress themes) is enough to constitute "not same template".
See the --unmangle
option for a work-around.
AUTHOR
Stanislaw Pusep <stas@sysd.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2014 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.