NAME
HTML::Inspect::Normalize - normalize urls
INHERITANCE
HTML::Inspect::Normalize
is an Exporter
SYNOPSIS
set_page_base($base_url); # used as base for relative urls
my $norm = normalize_url($relative_url);
my ($norm, $rc, $err) = normalize_url($relative_url);
DESCRIPTION
Although being part of module HTML::Inspect, it has a right of its own: the functions really, really fast convert sloppy http
and https
urls as found on webpages into cleanly normalized urls.
FUNCTIONS
- normalize_url($url)
-
Normalize a URL relative to the base (which needs to be set first). Same returns as set_page_base().
- set_page_base($base_url)
-
In LIST context, returns the normalized_url (string), rc, and errmsg. In SCALAR content, only returns the normalized_url and casts error exception when a problem was found. The base is normalized before use.
DETAILS
See also https://pipeline.shared-search.eu/extract/normalize.html
The following actions are taken:
leading and trailing blanks are stripped
spaces (CR, LF, TAB, VTAB) are moved, and following blanks as well
relative urls are converted to absolute
'+' and included blanks are converted to
%20
hex representation of normal characters (which includes comma and more) is converted back into their character
characters which need to be encoded are converted to hex
hex digits are upper-cased
utf8 characters get hex encoded
hex encoding must be valid utf8, possibly multi-byte
fragment is removed
empty path will becomde '/'
remove
./
and../
removed repeating slashes
hostnames with utf8 get IDN encoded
hostname syntax verified
remove trailing dot from hostname
default port numbers removed
port numbers leading zeros removed, restricted to max 65535
SEE ALSO
SEE ALSO
This module is part of HTML-Inspect distribution version 1.00, built on December 08, 2021. Website: http://perl.overmeer.net/CPAN/
LICENSE
Copyrights 2021 by [Mark Overmeer <markov@cpan.org>]. For other contributors see ChangeLog.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See http://dev.perl.org/licenses/