NAME

HTML::Inspect - Inspect a HTML document

SYNOPSIS

my $source    = 'http://example.com/doc';
my $inspector = HTML::Inspect->new(
    location => $source,
    html_ref => \$html,
);
my $classic   = $inspector->collectMetaClassic;

DESCRIPTION

This module extracts information from HTML, using a clean parser (XML::LibXML) Returned structures may need further processing. Please suggest additional extractors.

This module is part of the "Crawl Pipeline". You can find a detailed description of each of the output of the methods below on its web-page at https://pipeline.shared-search.eu/extract/

URL normalization is a really crucial feature of the output of these methods. You can use this separately via functions in HTML::Inspect::Normalization.

METHODS

Constructors

HTML::Inspect->new(%options)

-Option  --Default
 html_ref  <required>
 location  <required>

html_ref => REF-String: References to a (possibly troublesome) HTML string. Passed as reference to avoid copying large strings.
location => URL: An absolute url as a string or URI instance, which explains where the HTML was found. It is used as base of relative URLs found in the HTML, unless it contains as <base> element.

Accessors

$obj->base(): The base URI, which is used for relative links in the page. This is the location, unless the HTML contains a <base href> declaration. The base URI is a string representation, in absolute and normalized form.
$obj->location(): The URI object which represents the location parameter which was passed as default base for relative links to new().

Collecting

The <link> element

$obj->collectLinks(): Collect all <link> relations from the document. The returned HASH contains the relation (the rel attribute, required) to an ARRAY of link elements with that value. The ARRAY elements are HASHes of all attributes of the link and and all lower-cased. The added href_uri key will be a normalized, absolute translation of the href attribute.

The <meta> element

$obj->collectMeta(%options)

Returns an ARRAY of all kinds of <meta> records, which have a wide variety of fields and may be order dependend!!!

example:

[ { http-equiv => 'Content-Type', content => 'text/html; charset=UTF-8' },
  { name => 'viewport', content => 'width=device-width, initial-scale=1.0' },
]

$obj->collectMetaClassic(%options)

Returns a HASH reference with all <meta> information of traditional content: the single charset and all http-equiv records, plus the subset of names which are listed on https://www.w3schools.com/tags/tag_meta.asp. People defined far too many names to be useful for everyone.

example:

{  'http-equiv' => { 'content-type' => 'text/plain' },
    charset => 'UTF-8',
    name => { author => 'John Smith' , description => 'The John Smith\'s page.'},
}

$obj->collectMetaNames(%options)

Returns a HASH with all <meta> records which have both a name and a content attribute. These are used as key-value pairs for many, many different purposes.

example:

{ author => 'John Smith' , description => 'The John Smith\'s page.'}

References

The amount of references is large (easily a few hundred per HTML page), so you may wat to specify a filter. The %filter rules will produce a subset of the links found. You can use: http_only (returning only http and https links), mailto_only, maximum_set (returning only the first n links) and matching, returning links matching a certain regex.

$obj->collectReferences(%filter): Collects all references from document. Method collectReferencesFor() is called for a list of known tag/attribute pairs, and returned as a HASH of ARRAYs. The keys of the HASH have format "$tag_$attribute".
$obj->collectReferencesFor($tag, $attr, %filter): Returns an ARRAY of unique normalized URIs, which where found with the $tag attribute $attr. For instance, tag image attribute src. The URIs are in their textual order in the document, where only the first encounter is recorded.

Other

$obj->collectOpenGraph()

Returns structured OpenGraph information, when available in the HTML.

The logic really understands OpenGraph, and simplifies access to it: facts which may appear multiple times will always be returned as ARRAY.

AUTHORS and COPYRIGHT

Mark Overmeer
CPAN ID: MARKOV
markov at cpan dot org

Красимир Беров
CPAN ID: BEROV
berov на cpan точка org
https://studio-berov.eu

This is free software, licensed under: The Artistic License 2.0 (GPL Compatible) The full text of the license can be found in the LICENSE file included with this module.

LICENSE

Copyrights 2021 by [Mark Overmeer <markov@cpan.org>]. For other contributors see ChangeLog.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See http://dev.perl.org/licenses/

To install HTML::Inspect, copy and paste the appropriate command in to your terminal.

cpanm

cpanm HTML::Inspect

CPAN shell

perl -MCPAN -e shell
install HTML::Inspect

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

METHODS

Constructors

Accessors

Collecting

The <link> element

The <meta> element

References

Other

SEE ALSO

AUTHORS and COPYRIGHT

SEE ALSO

LICENSE

NAME

SYNOPSIS

DESCRIPTION

METHODS

Constructors

Accessors

Collecting

The <link> element

The <meta> element

References

Other

SEE ALSO

AUTHORS and COPYRIGHT

SEE ALSO

LICENSE

Module Install Instructions