NAME
HTML::LinkExtractor - Extract links from an HTML document
DESCRIPTION
HTML::LinkExtractor is used for extracting links from HTML. It is very similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.
Example ( please run the examples ):
use HTML::LinkExtractor;
use Data::Dumper;
my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>};
my $LX = new HTML::LinkExtractor();
$LX->parse(\$input);
print Dumper($LX->links);
__END__
# the above example will yield
$VAR1 = [
{
'_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',
'href' => bless(do{\(my $o = 'http://perl.com/')}, 'URI::http'),
'tag' => 'a'
}
];
HTML::LinkExtractor
will also correctly extract nested link-type tags.
SYNOPSIS
## the demo
perl LinkExtractor.pm
perl LinkExtractor.pm file.html othefile.html
## or if the module is installed, but you don't know where
perl -MHTML::LinkExtractor -e" system $^X, $INC{q{HTML/LinkExtractor.pm}} "
perl -MHTML::LinkExtractor -e' system $^X, $INC{q{HTML/LinkExtractor.pm}} '
## or
use HTML::LinkExtractor;
use LWP qw( get ); # use LWP::Simple qw( get );
my $base = 'http://search.cpan.org';
my $html = get($base.'/recent');
my $LX = new HTML::LinkExtractor();
$LX->parse(\$html);
print qq{<base href="$base">\n};
for my $Link( @{ $LX->links } ) {
## new modules are linked by /author/NAME/Dist
if( $$Link{href}=~ m{^\/author\/\w+} ) {
print $$Link{_TEXT}."\n";
}
}
undef $LX;
__END__
## or
use HTML::LinkExtractor;
use Data::Dumper;
my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>};
my $LX = new HTML::LinkExtractor(
sub {
print Data::Dumper::Dumper(@_);
},
'http://perlFox.org/',
);
$LX->parse(\$input);
$LX->strip(1);
$LX->parse(\$input);
__END__
#### Calculate to total size of a web-page
#### adds up the sizes of all the images and stylesheets and stuff
use strict;
use LWP; # use LWP::Simple;
use HTML::LinkExtractor;
#
my $url = shift || 'http://www.google.com';
my $html = get($url);
my $Total = length $html;
#
print "initial size $Total\n";
#
my $LX = new HTML::LinkExtractor(
sub {
my( $X, $tag ) = @_;
#
unless( grep {$_ eq $tag->{tag} } @HTML::LinkExtractor::TAGS_IN_NEED ) {
#
print "$$tag{tag}\n";
#
for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}}} ) {
if( exists $$tag{$urlAttr} ) {
my $size = (head( $$tag{$urlAttr} ))[1];
$Total += $size if $size;
print "adding $size\n" if $size;
}
}
}
},
$url,
0
);
#
$LX->parse(\$html);
#
print "The total size of \n$url\n is $Total bytes\n";
__END__
METHODS
$LX->new([\&callback, [$baseUrl, [1]]])
Accepts 3 arguments, all of which are optional. If for example you want to pass a $baseUrl
, but don't want to have a callback invoked, just put undef
in place of a subref.
This is the only class method.
a callback ( a sub reference, as in
sub{}
, or\&sub
) which is to be called each time a new LINK is encountered ( for@HTML::LinkExtractor::TAGS_IN_NEED
this means after the closing tag is encountered )The callback receives an object reference(
$LX
) and a link hashref.and a base URL ( URI->new, so its up to you to make sure it's valid which is used to convert all relative URI's to absolute ones.
$ALinkP{href} = URI->new_abs( $ALink{href}, $base );
A "boolean" (just stick with 1). See the example in "DESCRIPTION". Normally, you'd get back _TEXT that looks like
'_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',
If you turn this option on, you'll get the following instead
'_TEXT' => ' I am a LINK!!! ',
The private utility function
_stripHTML
does this by using HTML::TokeParsers method get_trimmed_text.You can turn this feature on an off by using
$LX->strip(undef || 0 || 1)
$LX->parse( $filename || *FILEHANDLE || \$FileContent )
Each time you call parse
, you should pass it a $filename
a *FILEHANDLE
or a \$FileContent
Each time you call parse
a new HTML::TokeParser
object is created and stored in $this->{_tp}
.
You shouldn't need to mess with the TokeParser object.
$LX->links()
Only after you call parse
will this method return anything. This method returns a reference to an ArrayOfHashes, which basically looks like (Data::Dumper output)
$VAR1 = [ { tag => 'img', src => 'image.png' }, ];
Please note that if yo provide a callback this array will be empty.
$LX->strip( [ 0 || 1 ])
If you pass in undef
(or nothing), returns the state of the option. Passing in a true or false value sets the option.
If you wanna know what the option does see $LX->new([\&callback, [$baseUrl, [1]]])
WHAT'S A LINK-type tag
Take a look at %HTML::LinkExtractor::TAGS
to see what I consider to be link-type-tag.
Take a look at @HTML::LinkExtractor::VALID_URL_ATTRIBUTES
to see all the possible tag attributes which can contain URI's (the links!!)
Take a look at @HTML::LinkExtractor::TAGS_IN_NEED
to see the tags for which the '_TEXT'
attribute is provided, like <a href="#"> TEST </a>
How can that be?!?!
I took at look at %HTML::Tagset::linkElements
and the following URL's
http://www.blooberry.com/indexdot/html/tagindex/all.htm
http://www.blooberry.com/indexdot/html/tagpages/a/a-hyperlink.htm
http://www.blooberry.com/indexdot/html/tagpages/a/applet.htm
http://www.blooberry.com/indexdot/html/tagpages/a/area.htm
http://www.blooberry.com/indexdot/html/tagpages/b/base.htm
http://www.blooberry.com/indexdot/html/tagpages/b/bgsound.htm
http://www.blooberry.com/indexdot/html/tagpages/d/del.htm
http://www.blooberry.com/indexdot/html/tagpages/d/div.htm
http://www.blooberry.com/indexdot/html/tagpages/e/embed.htm
http://www.blooberry.com/indexdot/html/tagpages/f/frame.htm
http://www.blooberry.com/indexdot/html/tagpages/i/ins.htm
http://www.blooberry.com/indexdot/html/tagpages/i/image.htm
http://www.blooberry.com/indexdot/html/tagpages/i/iframe.htm
http://www.blooberry.com/indexdot/html/tagpages/i/ilayer.htm
http://www.blooberry.com/indexdot/html/tagpages/i/inputimage.htm
http://www.blooberry.com/indexdot/html/tagpages/l/layer.htm
http://www.blooberry.com/indexdot/html/tagpages/l/link.htm
http://www.blooberry.com/indexdot/html/tagpages/o/object.htm
http://www.blooberry.com/indexdot/html/tagpages/q/q.htm
http://www.blooberry.com/indexdot/html/tagpages/s/script.htm
http://www.blooberry.com/indexdot/html/tagpages/s/sound.htm
And the special cases
<!DOCTYPE HTML SYSTEM "http://www.w3.org/DTD/HTML4-strict.dtd">
http://www.blooberry.com/indexdot/html/tagpages/d/doctype.htm
'!doctype' is really a process instruction, but is still listed
in %TAGS with 'url' as the attribute
and
<meta HTTP-EQUIV="Refresh" CONTENT="5; URL=http://www.foo.com/foo.html">
http://www.blooberry.com/indexdot/html/tagpages/m/meta.htm
If there is a valid url, 'url' is set as the attribute.
The meta tag has no 'attributes' listed in %TAGS.
SEE ALSO
HTML::LinkExtor, HTML::TokeParser, HTML::Tagset.
AUTHOR
D.H (PodMaster)
Please use http://rt.cpan.org/ to report bugs.
Just go to http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Scrubber to see a bug list and/or repot new ones.
LICENSE
Copyright (c) 2003, 2004 by D.H. (PodMaster). All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The LICENSE file contains the full text of the license.