NAME

anarch - A script for creating offline copies of websites

VERSION

0.03 (alpha)

SYNOPSIS

anarch -start=http://www.example.com/some_page.html           \
     [ -root=http://www.example.com ]                         \
     [ -exclude='^http://www\.example\.com/dont-want-this/' ] \
     [ -save-as=folder\ name ]                                \
     [ -depth=5 ]                                             \
     [ -run-scripts ]                                         \
     [ -remove-scripts ]                                      \
     [ -dom ]                                                 \
     [ -sync ]                                                \
     [ -add-extensions ]

DESCRIPTION

anarch is a script for creating offline copies of websites. It downloads a website, correcting links in pages and style sheets so that they are all relative (and all links outside the root directory are absolute), and removing '<base href>' tags. It can also run scripts in pages (to find out which files the scripts use or to save pages with generated content) and remove scripts.

OPTIONS

-start=...

The page to start on

-root=...

Only get URLs beginning with this. If this is omitted, -start is used, trimmed to the last slash.

-exclude=...

Regular expression for URLs to be excluded

-save-as=...

Where to save it. If this is omitted, the last path segment of -root is used.

-depth=...

How many links to follow one after the other before going back

-run-scripts

Run scripts in HTML pages. This will be used to find which files the scripts need, so that those can be fetched as well. It is not always guaranteed that this will work, as some scripts have absolute URLs hard-coded.

-remove-scripts

Remove scripts from HTML pages. This can be used in conjunction with -run-scripts to save generated content while removing the scripts that generated it. -remove-scripts implies -dom.

-dom

Save the HTML DOM, possibly modified by scripts or -remove-scripts. Without this, the only changes to the DOM that are saved are those made to links to make them relative.

-sync

Synchronise mode: Only files that have changed since the last download will be downloaded if this option is given.

-add-extensions

This feature is highly experimental. Use it at your own risk!

This adds extensions to files based on the MIME type, if the file does not already have an appropriate extension. (This exists for the sake of picky web browsers that refuse to open certain files.)

It currently has four problems with it: The start page never gets an extension added; index.html is sometimes saved as index.html.html; images and style sheets never get extensions added, unless they are linked to directly (via <a href...>); and extra HEAD requests are made (far more than necessary), which slow things down to a crawl.

BUGS

If you find any bugs, please e-mail the author.

This program doesn't take redirection into account.

It doesn't work with pages that have an explicit encoding in the source code.

It doesn't work with pages not in UTF-8. (It would need to apply a charset attribute to various elements or save the page in the original encoding.)

Style attributes containing URLs get mangled.

When URLs in CSS style sheets are made relative, they are not properly escaped, so quotation marks may produce invalid CSS.

The -run-scripts option causes the script to eat up all your memory.

If a scripts browses to another page and -dom or -remove-scripts is specified, then the wrong DOM tree is serialised.

There is no way to set the local IP address to bind to. But you can use this nutty workaround, which requires Hook::WrapSub:

perl -sS -MHook::WrapSub -MIO::Socket::INET \
  -M'less Hook::WrapSub::wrap_subs
    sub{push @_, LocalAddr => "10.10.10.205" },
    "IO::Socket::INET::new"' \
  anarch ...

There are not enough tests yet.

If you run anarch as an argument to perl, and you are using perl 5.8.8 or lower, you will need to call perl with the -s option.

SINE QUIBUS NON

This program requires perl 5.8.3 or higher (5.8.4 or higher recommended) and the following CPAN modules:

WWW::Scripter

CSS::DOM 0.03 or higher

URI

File::Slurp

LWP 5.815 or higher

HTML::DOM 0.025 or higher

WWW::Scripter::Plugin::Ajax is required for the -run-scripts option to work.

AUTHOR & COPYLEFT

Copyright (C) 2009, Father Chrysostomos (sprout at, um, cpan dot org)

This program is free software; you may redistribute or modify it (or both) under the same terms as perl.