NAME
HTML::Laundry - Perl module to clean HTML by the piece
VERSION
Version 0.0102
SYNOPSIS
#!/usr/bin/perl -w
use strict;
use HTML::Laundry;
my $laundry = HTML::Laundry->new();
my $snippet = q{
<P STYLE="font-size: 300%"><BLINK>"You may get to touch her<BR>
If your gloves are sterilized<BR></BR>
Rinse your mouth with Listerine</BR>
Blow disinfectant in her eyes"</BLINK><BR>
-- X-Ray Spex, <I>Germ-Free Adolescents<I>
<SCRIPT>alert('!!');</SCRIPT>
};
my $germfree = $laundry->clean($snippet);
# $germfree is now:
# <p>"You may get to touch her<br />
# If your gloves are sterilized<br />
# Rinse your mouth with Listerine<br />
# Blow disinfectant in her eyes"<br />
# -- X-Ray Spex, <i>Germ-Free Adolescents</i></p>
DESCRIPTION
HTML::Laundry is an HTML::Parser-based HTML normalizer, meant for small pieces of HTML, such as user comments, Atom feed entries, and the like, rather than full pages. Laundry takes these and returns clean, sanitary, UTF-8-based XHTML. The parser's behavior may be changed with callbacks, and the whitelist of acceptable elements and attributes may be updated on the fly.
A snippet is cleaned several ways:
Normalized, using HTML::Parser: attributes and elements will be lowercased, empty elements such as <img /> and <br /> will be forced into the empty tag syntax if needed, and unknown attributes and elements will be stripped.
Sanitized, using an extensible whitelist of valid attributes and elements based on Mark Pilgrim and Aaron Swartz's work on
sanitize.py
: tags and attributes which are known to be possible attack vectors are removed.Tidied, using HTML::Tidy or HTML::Tidy::libXML (as available): unclosed tags will be closed and the output generally neatened; future version may also use tidying to deal with character encoding issues.
Optionally rebased, to turn relative URLs in attributes into absolute ones.
HTML::Laundry provides mechanisms to extend the list of known allowed (and disallowed) tags, along with callback methods to allow scripts using HTML::Laundry to extend the behavior in various ways. Future versions may provide additional options for altering the rules used to clean snippets.
Out of the box, HTML::Laundry does not currently know about the <head> tag and its children. For santizing full HTML pages, consider using HTML::Scrubber or HTML::Defang.
FUNCTIONS
new
Create an HTML::Laundry object.
my $l = HTML::Laundry->new();
Takes an optional anonymous hash of arguments:
base_url
This turns relative URIs, as in
<img src="surly_otter.png"
>, into absolute URIs, as for use in feed parsing.my $l = HTML::Laundry->new({ base_uri => 'http://example.com/foo/' });
notidy
Disable use of HTML::Tidy or HTML::Tidy::libXML, even if they are available on your system.
my $l = HTML::Laundry->new({ notidy => 1 });
initialize
Instantiates the Laundry object properties based on an HTML::Laundry::Rules module.
add_callback
Adds a callback of type "start_tag", "end_tag", "text", "uri", or "output" to the appropriate internal array.
$l->add_callback('start_tag', sub {
my ($laundry, $tagref, $attrhashref) = @_;
# Now, perform actions and return
});
start_tag, end_tag, text, and uri callbacks that return false values will suppress the return value of the element they are processing; this allows additional checks to be done (for instance, images can be allowed only from whitelisted source domains).
clear_callback
Removes all callbacks of given type.
$l->clear_callback('start_tag');
clean
Cleans a snippet of HTML, using the ruleset and object creation options given to the Laundry object. The snippet should be passed as a scalar.
$output1 = $l->clean( '<p>The X-rays were penetrating' );
$output2 = $l->clean( $snippet );
base_uri
Used to get or set the base_uri property, used in URI rebasing.
my $base_uri = $l->base_uri; # returns current base_uri
$l->base_uri(q{http://example.com}); # return 'http://example.com'
$l->base_uri(''); # unsets base_uri
gen_output
Used to generate the final, XHTML output from the internal stack of text and tag tokens. Generally meant to be used internally, but potentially useful for callbacks that require a snapshot of what the output would look like before the cleaning process is complete.
my $xhtml = $l->gen_output;
empty_elements
Returns a list of the Laundry object's known empty elements: elements such as <img /> or <br /> which must not contain any children.
remove_empty_element
Removes an element (or, if given an array reference, multiple elements) from the "empty elements" list maintained by the Laundry object.
$l->remove_empty_element(['img', 'br']); # Let's break XHTML!
This will not affect the acceptable/unacceptable status of the elements.
acceptable_elements
Returns a list of the Laundry object's known acceptable elements, which will not be stripped during the sanitizing process.
add_acceptable_element
Adds an element (or, if given an array reference, multiple elements) to the "acceptable elements" list maintained by the Laundry object. Items added in this manner will automatically be removed from the "unacceptable elements" list if they are present.
$l->add_acceptable_element('style');
Elements which are empty may be flagged as such with an optional argument. If this flag is set, all elements provided by the call will be added to the "empty element" list.
$l->add_acceptable_element(['applet', 'script'], { empty => 1 });
remove_acceptable_element
Removes an element (or, if given an array reference, multiple elements) to the "acceptable elements" list maintained by the Laundry object. These items (although not their child elements) will now be stripped during parsing.
$l->remove_acceptable_element(['img', 'h1', 'h2']);
$l->clean(q{<h1>The Day the World Turned Day-Glo</h1>});
# returns 'The Day the World Turned Day-Glo'
unacceptable_elements
Returns a list of the Laundry object's unacceptable elements, which will be stripped -- including child objects -- during the cleaning process.
add_unacceptable_element
Adds an element (or, if given an array reference, multiple elements) to the "unacceptable elements" list maintained by the Laundry object.
$l->add_unacceptable_element(['h1', 'h2']);
$l->clean(q{<h1>The Day the World Turned Day-Glo</h1>});
# returns null string
remove_unacceptable_element
Removes an element (or, if given an array reference, multiple elements) from the "unacceptable elements" list maintained by the Laundry object. Note that this does not automatically add the element to the acceptable_element list.
$l->clean(q{<script>alert('!')</script>});
# returns null string
$l->remove_unacceptable_element( q{script} );
$l->clean(q{<script>alert('!')</script>});
# returns "alert('!')"
acceptable_attributes
Returns a list of the Laundry object's known acceptable attributes, which will not be stripped during the sanitizing process.
add_acceptable_attribute
Adds an attribute (or, if given an array reference, multiple attributes) to the "acceptable attributes" list maintained by the Laundry object.
my $snippet = q{ <p austen:id="3">"My dear Mr. Bennet," said his lady to
him one day, "have you heard that <span austen:footnote="netherfield">
Netherfield Park</span> is let at last?"</p>
};
$l->clean( $snippet );
# returns:
# <p>"My dear Mr. Bennet," said his lady to him one day,
# "have you heard that <span>Netherfield Park</span> is let at
# last?"</p>
$l->add_acceptable_attribute([austen:id, austen:footnote]);
$l->clean( $snippet );
# returns:
# <p austen:id="3">"My dear Mr. Bennet," said his lady to him
# one day, "have you heard that <span austen:footnote="netherfield">
# Netherfield Park</span> is let at last?"</span></p>
remove_acceptable_attribute
Removes an attribute (or, if given an array reference, multiple attributes) from the "acceptable attributes" list maintained by the Laundry object.
$l->clean(q{<p id="plugh">plover</p>});
# returns '<p id="plugh">plover</p>'
$l->remove_acceptable_element( q{id} );
$l->clean(q{<p id="plugh">plover</p>});
# returns '<p>plover</p>
SEE ALSO
There are a number of tools designed for sanitizing HTML, some of which may be better suited than HTML::Laundry to particular circumstances. In addition to HTML::Scrubber, you may want to consider HTML::StripScripts::Parser, an HTML::Parser
-based module designed solely for the purposes of sanitizing HTML from potential XSS attack vectors; HTML::Defang, a whitelist-based, pure-Perl module; or HTML::Restrict, an HTML tag whitelist using HTML::Parser
.
AUTHOR
Steve Cook, <scook at sixapart.com>
BUGS
Please report any bugs or feature requests on the GitHub page for this project, http://github.com/snark/html-laundry.
ACKNOWLEDGMENTS
Thanks to Dave Cross and Vera Tobin.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc HTML::Laundry
COPYRIGHT & LICENSE
Copyright 2009 Six Apart, Ltd., all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.