NAME

Email::Extractor::Utils - Set of functions that can be useful when building web crawlers

VERSION

version 0.03

SYNOPSIS

use Email::Extractor::Utils qw( looks_like_url looks_like_file get_file_uri load_addr_to_str )
# or use Email::Extractor::Utils qw[:ALL];
$Email::Extractor::Utils::Verbose = 1;

my $text = load_addr_to_str($url);

$Email::Extractor::Utils::Assets

List of asset extensions, used in "drop_asset_links" in Email::Extractor::Utils

To see default list of assets:

perl -Ilib -E "use Email::Extractor::Utils qw(:ALL); use Data::Dumper; print Dumper $Email::Extractor::Utils::Assets;"

load_addr_to_str

Accept URI of file path and return string with content

my $text = load_addr_to_str($url);
my $text = load_addr_to_str($path_to_file);

Function can accept http(s) uri or file paths both

dies if no such file

return $resp->content even if no such url

If verbose mode enabled prints time of request

Can be used in tests when you need to mock http requests also

get_abs_path

Return absolute path of file relative to current working directory

get_file_uri

Make absolute path from relative (to cwd) and return absolute path that can pass Regexp::Common::URI::file validation

get_file_uri('/test')   # 'file:///root/test' if cwd is /root

looks_like_url

looks_like_url('http://example.com')      # 1
looks_like_url('https://example.com')      # 1
looks_like_url('/root/somefolder')        # 0

Detect if link is http or https url

Uses Regexp::Common::URI::http

Return:

O if provided string is not url

url without query, https://metacpan.org/pod/Regexp::Common::URI::http#$7 if provided string is url

looks_like_rel_link

Return true if link looks like relative url, either return false

looks_like_file

looks_like_file('http://example.com')             # 0
looks_like_file('file:///root/somefolder')        # 1

Detect if string is file uri or no

Uses Regexp::Common::URI::file

absolutize_links_array

Make all links in array absolute

my $res = absolutize_links( $links, 'http://example.com ');

$links must be ARRAYREF, return also ARRAYREF

remove_external_links

my $res = absolutize_links( $links, 'http://example.com ');  # leave only links on http://example.com

Relative links stay untouched

$links must be ARRAYREF, return also ARRAYREF

drop_asset_links

my $res = drop_asset_links($links)

Leave only links that are not related to assets. Remove query params also

$links must be ARRAYREF, return also ARRAYREF

drop_anchor_links

my $res = drop_anchor_links ($links)

Leave only links that are not anchors to same page (anchor link is like #rec31047364)

$links must be ARRAYREF, return also ARRAYREF

remove_query_params

Remove GET query params from provided links array

my $res = remove_query_params($links)

$links must be ARRAYREF, return also ARRAYREF

find_all_links

Find all links and return href attributes of a tags

Return ARRAYREF

find_links_by_text

find_links_by_text($html, $a_text, <$upper_lower_case_flag> )

Find all a tags containing particular text and return href values

If no search text specified return all links

Currently is not used in Email::Extractor project since it has unexpected behaviour (see tests)

Return ARRAYREF

TO-DO: try to implement this method with HTML::LinkExtor

isin($str, $arrayref)

isin( $val, $array_ref )

Check is $str contained in $arrayref

Return true/false.

DESCRIPTION

Set of useful utilities that works with html and urls

NAME

Email::Extractor::Utils

AUTHOR

Pavel Serikov <pavelsr@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2018 by Pavel Serikov.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

To install Email::Extractor, copy and paste the appropriate command in to your terminal.

cpanm Email::Extractor

perl -MCPAN -e shell
install Email::Extractor

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)