NAME
WWW::LinkRot - check web page link rot
SYNOPSIS
use WWW::LinkRot;
VERSION
This documents version 0.02 of WWW-LinkRot corresponding to git commit e07a0ffb766775fc053e9820edf1f874ee40b78c released on Fri Apr 23 08:30:11 2021 +0900.
DESCRIPTION
Scan HTML files for links, try to access the links, and make a report.
The HTML files need to be in UTF-8 encoding.
This module is intended for people who run web sites to run, for example, periodic checks over a large number of HTML files to find all of the external links in those files, then given that list of links, test each link to make sure that it is actually valid.
The reading function is "get_links" which works on a list containing file names such as might be created by a module like Trav::Dir or File::Find. It looks for any https?://
links in the files and makes a list.
The list of links may then be checked for validity using "check_links" which runs the get
method of "LWP::UserAgent" on them and stores the status. This outputs a JSON file containing the link, the status, the location, and the files which contain the link.
The function "html_report" generates an HTML representation of the JSON file.
The function "replace" is a batch editing function which inputs a list of links and a list of files, then substitutes the redirected links (the ones with status 301 or 302) with their replacement.
FUNCTIONS
check_links
check_links ($links);
Check the links returned by "get_links" and write to a JSON file specified by the out
option.
check_links ($links, out => "link-statuses.json");
Usually one would filter the links returned by "get_links" to remove things like internal links.
Options
- nook
-
If this is set to a true value, before running the link checks, check_links reads in a previous copy of the file specified by the
out
option, and if the status is200
for that link, it doesn't try to access again but assumes it is still OK.This option is useful for the case when one has recently run the job and then done work on fixing the dead links or moved links, then wants to check whether the errors were fixed, without checking that all of the pages are still OK.
- out
-
Specify the file to write. Without this specified it will fail.
- verbose
-
Print messages about what is to be done. Since checking the links might take a long time, this is sometimes reassuring.
The user agent
The user agent used by WWW::LinkRot is "LWP::UserAgent" with the timeout
option set to 5 seconds and the number of redirects set to zero. If a timeout is not used, check_links
may take a very long time to run. However, some links, like archive.org links may take more than five seconds to respond.
The user agent set to the browser is WWW::LinkRot
.
get_links
my $links = get_links (\@files);
Given a list of HTML files in @files
, extract all the links from it. The return value $links
contains a hash reference whose keys are the links and whose values are array references containing a list of all the files of @files
which contain the link.
This looks for anything of the form href="*"
in the files and adds what is between the quotes to the list of links.
html_report
html_report (in => 'link-statuses.json', out => 'report.html');
Write an HTML report using the JSON output by "get_links". The report consists of header HTML generated by "HTML::Make::Page" followed by a table consisting of rows with links in each row, followed by its status, followed by the pages where it is used.
Options
- in
-
The input JSON file
- nofiles
-
If set to a true value, don't add the final "files" column. For example this may be used if only checking a single file for dead links.
- out
-
The output HTML file.
- strip
-
Part of the file name which needs to be stripped from the file names to make a URL, like "/home/users/jason/website".
- url
-
Part of the URL which needs to be added to the file names to make a URL, like "https://www.example.com/site";
The output HTML file
Moved links are coloured pink, and dead links are coloured yellow.
Links are cut down to a maximum length of 100 characters.
replace
replace (\%links, \@files, %options);
Make a regex of links with 30*
(redirect) statuses like 301 and 302, and which also have a valid location
, then go through @files
and replace the links with the new locations.
Options are
- verbose
-
Print messages about the links and the files being edited.
DEPENDENCIES
- Convert::Moji
-
This is used to make the regex used by "replace".
- File::Slurper
-
This is used for reading and writing files.
- HTML::Make
-
This is used to make the HTML report about the links.
- HTML::Make::Page
-
This is used to make the HTML report about the links.
- JSON::Create
-
This is used to make the report file about the links.
- JSON::Parse
-
This is used to read back the JSON report.
- LWP::UserAgent
-
This is used to check the links.
SEE ALSO
CPAN
Other
- Xenu's link sleuth
-
We used this more than ten years ago, it seemed to work very well. It hasn't been updated in ten years though.
- W3C Link Checker
-
A web site which checks the links on your website.
AUTHOR
Ben Bullock, <bkb@cpan.org>
COPYRIGHT & LICENCE
This package and associated files are copyright (C) 2021 Ben Bullock.
You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.