NAME
HTML::Inspect - Inspect a HTML document
SYNOPSIS
my $source = 'http://example.com/doc';
my $inspector = HTML::Inspect->new(
location => $source,
html_ref => \$html,
);
my $classic = $inspector->collectMetaClassic;
DESCRIPTION
This module extracts information from HTML, using a clean parser (XML::LibXML) Returned structures may need further processing. Please suggest additional extractors.
This module is part of the "Crawl Pipeline". You can find a detailed description of each of the output of the methods below on its web-page at https://pipeline.shared-search.eu/extract/
URL normalization is a really crucial feature of the output of these methods. You can use this separately via functions in HTML::Inspect::Normalization.
METHODS
Constructors
- HTML::Inspect->new(%options)
-
-Option --Default html_ref <required> location <required>
- html_ref => REF-String
-
References to a (possibly troublesome) HTML string. Passed as reference to avoid copying large strings.
- location => URL
-
An absolute url as a string or URI instance, which explains where the HTML was found. It is used as base of relative URLs found in the HTML, unless it contains as
<base>
element.
Accessors
- $obj->base()
-
The base URI, which is used for relative links in the page. This is the
location
, unless the HTML contains a<base href>
declaration. The base URI is a string representation, in absolute and normalized form. - $obj->location()
-
The URI object which represents the
location
parameter which was passed as default base for relative links tonew()
.
Collecting
The <link> element
- $obj->collectLinks()
-
Collect all
<link>
relations from the document. The returned HASH contains the relation (therel
attribute, required) to an ARRAY of link elements with that value. The ARRAY elements are HASHes of all attributes of the link and and all lower-cased. The addedhref_uri
key will be a normalized, absolute translation of thehref
attribute.
The <meta> element
- $obj->collectMeta(%options)
-
Returns an ARRAY of all kinds of
<meta>
records, which have a wide variety of fields and may be order dependend!!!example:
[ { http-equiv => 'Content-Type', content => 'text/html; charset=UTF-8' }, { name => 'viewport', content => 'width=device-width, initial-scale=1.0' }, ]
- $obj->collectMetaClassic(%options)
-
Returns a HASH reference with all
<meta>
information of traditional content: the singlecharset
and allhttp-equiv
records, plus the subset of names which are listed on https://www.w3schools.com/tags/tag_meta.asp. People defined far too many names to be useful for everyone.example:
{ 'http-equiv' => { 'content-type' => 'text/plain' }, charset => 'UTF-8', name => { author => 'John Smith' , description => 'The John Smith\'s page.'}, }
- $obj->collectMetaNames(%options)
-
Returns a HASH with all
<meta>
records which have both aname
and acontent
attribute. These are used as key-value pairs for many, many different purposes.example:
{ author => 'John Smith' , description => 'The John Smith\'s page.'}
References
The amount of references is large (easily a few hundred per HTML page), so you may wat to specify a filter. The %filter
rules will produce a subset of the links found. You can use: http_only
(returning only http and https links), mailto_only
, maximum_set
(returning only the first n
links) and matching
, returning links matching a certain regex.
- $obj->collectReferences(%filter)
-
Collects all references from document. Method
collectReferencesFor()
is called for a list of known tag/attribute pairs, and returned as a HASH of ARRAYs. The keys of the HASH have format "$tag_$attribute". - $obj->collectReferencesFor($tag, $attr, %filter)
-
Returns an ARRAY of unique normalized URIs, which where found with the
$tag
attribute$attr
. For instance, tagimage
attributesrc
. The URIs are in their textual order in the document, where only the first encounter is recorded.
Other
- $obj->collectOpenGraph()
-
Returns structured OpenGraph information, when available in the HTML.
The logic really understands OpenGraph, and simplifies access to it: facts which may appear multiple times will always be returned as ARRAY.
SEE ALSO
This software is a component of the Crawl Pipeline, https://pipeline.shared-search.eu. Development was made possible with a generous gift by the NLnet Foundation.
AUTHORS and COPYRIGHT
Mark Overmeer
CPAN ID: MARKOV
markov at cpan dot org
Красимир Беров
CPAN ID: BEROV
berov на cpan точка org
https://studio-berov.eu
This is free software, licensed under: The Artistic License 2.0 (GPL Compatible) The full text of the license can be found in the LICENSE file included with this module.
SEE ALSO
This module is part of HTML-Inspect distribution version 1.00, built on December 08, 2021. Website: http://perl.overmeer.net/CPAN/
LICENSE
Copyrights 2021 by [Mark Overmeer <markov@cpan.org>]. For other contributors see ChangeLog.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See http://dev.perl.org/licenses/