NAME

HTML::Inspect - Inspect a HTML document

SYNOPSIS

my $source    = 'http://example.com/doc';
my $inspector = HTML::Inspect->new(
    location => $source,
    html_ref => \$html,
);
my $classic   = $inspector->collectMetaClassic;

DESCRIPTION

This module extracts information from HTML, using a clean parser (XML::LibXML) Returned structures may need further processing. Please suggest additional extractors.

This module is part of the "Crawl Pipeline". You can find a detailed description of each of the output of the methods below on its web-page at https://pipeline.shared-search.eu/extract/

URL normalization is a really crucial feature of the output of these methods. You can use this separately via functions in HTML::Inspect::Normalization.

METHODS

Constructors

HTML::Inspect->new(%options)
-Option  --Default
 html_ref  <required>
 location  <required>
html_ref => REF-String

References to a (possibly troublesome) HTML string. Passed as reference to avoid copying large strings.

location => URL

An absolute url as a string or URI instance, which explains where the HTML was found. It is used as base of relative URLs found in the HTML, unless it contains as <base> element.

Accessors

$obj->base()

The base URI, which is used for relative links in the page. This is the location, unless the HTML contains a <base href> declaration. The base URI is a string representation, in absolute and normalized form.

$obj->location()

The URI object which represents the location parameter which was passed as default base for relative links to new().

Collecting

The <link> element

$obj->collectLinks()

Collect all <link> relations from the document. The returned HASH contains the relation (the rel attribute, required) to an ARRAY of link elements with that value. The ARRAY elements are HASHes of all attributes of the link and and all lower-cased. The added href_uri key will be a normalized, absolute translation of the href attribute.

The <meta> element

$obj->collectMeta(%options)

Returns an ARRAY of all kinds of <meta> records, which have a wide variety of fields and may be order dependend!!!

example:

[ { http-equiv => 'Content-Type', content => 'text/html; charset=UTF-8' },
  { name => 'viewport', content => 'width=device-width, initial-scale=1.0' },
]
$obj->collectMetaClassic(%options)

Returns a HASH reference with all <meta> information of traditional content: the single charset and all http-equiv records, plus the subset of names which are listed on https://www.w3schools.com/tags/tag_meta.asp. People defined far too many names to be useful for everyone.

example:

{  'http-equiv' => { 'content-type' => 'text/plain' },
    charset => 'UTF-8',
    name => { author => 'John Smith' , description => 'The John Smith\'s page.'},
}
$obj->collectMetaNames(%options)

Returns a HASH with all <meta> records which have both a name and a content attribute. These are used as key-value pairs for many, many different purposes.

example:

{ author => 'John Smith' , description => 'The John Smith\'s page.'}

References

The amount of references is large (easily a few hundred per HTML page), so you may wat to specify a filter. The %filter rules will produce a subset of the links found. You can use: http_only (returning only http and https links), mailto_only, maximum_set (returning only the first n links) and matching, returning links matching a certain regex.

$obj->collectReferences(%filter)

Collects all references from document. Method collectReferencesFor() is called for a list of known tag/attribute pairs, and returned as a HASH of ARRAYs. The keys of the HASH have format "$tag_$attribute".

$obj->collectReferencesFor($tag, $attr, %filter)

Returns an ARRAY of unique normalized URIs, which where found with the $tag attribute $attr. For instance, tag image attribute src. The URIs are in their textual order in the document, where only the first encounter is recorded.

Other

$obj->collectOpenGraph()

Returns structured OpenGraph information, when available in the HTML.

The logic really understands OpenGraph, and simplifies access to it: facts which may appear multiple times will always be returned as ARRAY.

SEE ALSO

XML::LibXML, Log::Report

This software is a component of the Crawl Pipeline, https://pipeline.shared-search.eu. Development was made possible with a generous gift by the NLnet Foundation.

AUTHORS and COPYRIGHT

Mark Overmeer
CPAN ID: MARKOV
markov at cpan dot org

Красимир Беров
CPAN ID: BEROV
berov на cpan точка org
https://studio-berov.eu

This is free software, licensed under: The Artistic License 2.0 (GPL Compatible) The full text of the license can be found in the LICENSE file included with this module.

SEE ALSO

This module is part of HTML-Inspect distribution version 1.00, built on December 08, 2021. Website: http://perl.overmeer.net/CPAN/

LICENSE

Copyrights 2021 by [Mark Overmeer <markov@cpan.org>]. For other contributors see ChangeLog.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See http://dev.perl.org/licenses/