NAME

HTML::Manipulator - Perl extension for manipulating HTML files

SYNOPSIS

use HTML::Manipulator;

my $html = <<HTML;
  <h1 id=title>Old news</h1>
  <a href='http://www.google.com' id=link>Google</a>....
HTML

# replace a tag content
my $new = HTML::Manipulator::replace($html, title => 'New news');

# replace a tag attribute and content
my $new = HTML::Manipulator::replace($html, link => { 
  _content => 'Slashdot',
   href=>'http://www.slashdot.org/' }
  );

# extract a tag content
my $content = HTML::Manipulator::extract_content($html, 'link');

# extract a tag content and attributes
my $tag =  HTML::Manipulator::extract($html, 'link');
  # returns a hash ref like
  # { href => 'http://www.google.com', id => 'link', _content => 'Google' }

DESCRIPTION

This module manipulates of the contents of HTML files. It can query and replace the content or attributes of any HTML tag.

The advertised usage pattern is to update static HTML files.

ANOTHER TEMPLATE ENGINE ? NO !

HTML::Manipulator is NOT yet another templating module. There are, for example, no template files. It works on normal HTML files without any special markup (you only have to give element IDs to tags you are interested in).

While you could probably use this module for producing your web application's output, DON'T. It does not offer a lot of features for this area (no loops, no conditionals, no includes) and is not optimized for performance. Have a look at HTML::Template instead.

ABOUT THE INPUT HTML FILES

HTML::Manipulator is meant to work on real-life HTML files (in all their non-standards-compliant ugliness). It uses the HTML::Parser module to find elements (tags) inside those files, which you can then replace or modify. All you have to do is give those elements a DOM ID, for example

<h3 id=headline77>Headline</h3>

No other markup is necessary.

Malformed HTML (is fine)

HTML::Manipulator tries to cope with malformed input data. All you have to ensure is that you properly close the element you are working on (any other tags can be unbalanced) and that the IDs are unique. It will also preserve the content outside the element you asked it to operate on. It does not rewrite your HTML any more than it has to.

Case insensitivity issues

HTML is case insensitive in its tag and attribute names. This means that

<h3 id=headline77>Headline</h3>

and

<H3 iD=headline77>Headline</h3>

are treated as identical.

However, HTML::Manipulator respects case when comparing the IDs of elements (not sure about the HTML standard here), so that you could NOT address above h3 element as HeadLine77.

When HTML::Manipulator has to rewrite tags (this happens when you ask it to change element attributes) it will output the tag and attribute names as lower-case. It will also rearrange their order. When changing only the content of an element, it preserves the original opening and closing tags.

FUNCTIONS TO CHANGE CONTENT

You can change the content or attributes of any HTML element with an attached ID.

Replace the content of one element

my $new = HTML::Manipulator::replace($html, title => 'New news');

The function takes as input the HTML data and returns the modified data (as a long scalar).

Replace the content of many elements

my $new = HTML::Manipulator::replace($html, 
   title => 'New news', headline77=>'All clear?');

You can just pass many IDs and new contents to the function as well. The caveat here is that if those elements are nested, only the outermost will be applied: The complete content of the outermost element will be replaced with the new content, eliminating any nested tags. Even if the new content contains nested elements, these will not be evaluated. No recursion today.

Replace attribute values

If you want to replace attribute values (such as a link href), you use the same function described above, but pass a hashref instead of the string with the new content:

my $new = HTML::Manipulator::replace($html, link => { 
     href=>'http://www.slashdot.org/' }
);

The hashref can contain as many key/value pairs as you want. Any attributes that you specify here will appear in the output HTML. Any attributes that you do not specify will retain their old value.

Replace attribute values and content

You can also change content and attributes at the same time, by adding the special "attribute" _content to the attribute hashref.

 my $new = HTML::Manipulator::replace($html, link => { 
     _content => 'Slashdot',
    href=>'http://www.slashdot.org/' }
);

Replace the document title

You can set the document title (the stuff between the <title> tags) like

my $new = HTML::Manipulator::replace_title($html, 'new title');

FUNCTIONS TO EXTRACT CONTENT

In addition to replacing parts of the HTML document, you can also query it for the current content.

Extract the content of an element

  my $content = HTML::Manipulator::extract_content($html, 'link');

gives you a scalar containing the content of the tag with the ID 'link'.

Extract the content of all elements

my $content = HTML::Manipulator::extract_all_content($html);

gives you a hashref with all element IDs as keys and their contents as values.

Extract the content and attributes

my $content = HTML::Manipulator::extract($html, 'link');

gives you a hashref with information about the tag with the ID 'link'. There is a key for every attribute in the tag, and the special key '_content' which contains the content. The structure of the hashref is identical to what you would use when calling the replace function.

There is also a function to get information about all elements (that have an ID):

my $content = HTML::Manipulator::extract_all($html);

This returns a hashref of hashrefs, so that you could get the href of the "link" element like $content->{link}{href}.

Extract some elements

You can selectively use the extract_all* functions by passing in the IDs you are interested in. This is optional. The default returns data for all elements with IDs.

    my $content = HTML::Manipulator::extract_all_content
        ($html, 'one', 'two', 'three');

You can also mix in regular expressions. Any elements with IDs that match will be returned. This way you can also achieve case-insensitivity with IDs.

my $content = HTML::Manipulator::extract_all_content
   ($html, qr/^...$/i, 'two', qr/^some.*/);

Find out all element IDs

You can query for a list of all element IDs and their tag type.

$data = HTML::Manipulator::extract_all_ids($html);

This returns a hashref where the element IDs of the document are they keys. The associated value is the type of the element (the tag type, such as div, span, a), which is returned as lowercase.

You can filter this in the same way as with the extract_all_content function above:

$data = HTML::Manipulator::extract_all_ids($html, 
    qr/^...$/i, 'two', qr/^some.*/);

Find out the document title

my $title = HTML::Manipulator::extract_title($html);

USING FILEHANDLES

You can also call all of the above functions with a file handle instead of the string holding the HTML. HTML::Manipulator (or HTML::Parser deeper down the line) will read from the file.

use FileHandle;
my $new = HTML::Manipulator::replace(new FileHandle('myfile.html'), title => 'New news');

open IN, 'myfile.html';
my $new = HTML::Manipulator::replace(*IN, title => 'New news');
close IN;

HTML::Manipulator will only read from the file handles you give it. It does not change them. Nor does it open them, you have to have done that yourself. Or you can use HTML::Manipulator::Document, which does open files.

EXPORT

The module exports none of its functions. You have to prefix the full module name to use them.

If you want an object-oriented interface instead, consider HTML::Manipulator::Document.

BUGS

This is a young module. It works for me, but it has not been extensively tested in the wild. Handle with care. Report bugs to get them fixed.

AUTHOR

Thilo Planz, <planz@epost.de>

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install HTML::Manipulator, copy and paste the appropriate command in to your terminal.

cpanm

cpanm HTML::Manipulator

CPAN shell

perl -MCPAN -e shell
install HTML::Manipulator

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

ANOTHER TEMPLATE ENGINE ? NO !

ABOUT THE INPUT HTML FILES

Malformed HTML (is fine)

Case insensitivity issues

FUNCTIONS TO CHANGE CONTENT

Replace the content of one element

Replace the content of many elements

Replace attribute values

Replace attribute values and content

Replace the document title

FUNCTIONS TO EXTRACT CONTENT

Extract the content of an element

Extract the content of all elements

Extract the content and attributes

Extract some elements

Find out all element IDs

Find out the document title

USING FILEHANDLES

EXPORT

SEE ALSO

Processing HTML documents

Producing HTML output (templating engines)

Managing complete (static) web sites

BUGS

AUTHOR

COPYRIGHT AND LICENSE

NAME

SYNOPSIS

DESCRIPTION

ANOTHER TEMPLATE ENGINE ? NO !

ABOUT THE INPUT HTML FILES

Malformed HTML (is fine)

Case insensitivity issues

FUNCTIONS TO CHANGE CONTENT

Replace the content of one element

Replace the content of many elements

Replace attribute values

Replace attribute values and content

Replace the document title

FUNCTIONS TO EXTRACT CONTENT

Extract the content of an element

Extract the content of all elements

Extract the content and attributes

Extract some elements

Find out all element IDs

Find out the document title

USING FILEHANDLES

EXPORT

SEE ALSO

Processing HTML documents

Producing HTML output (templating engines)

Managing complete (static) web sites

BUGS

AUTHOR

COPYRIGHT AND LICENSE

Module Install Instructions