NAME

scrape - command-line frontend to HTML::ListScraper

SYNOPSIS

 scrape --core=all sample.html

 scrape --core=list [ --min-count=10 ] [ --detail=all ] [ --shapeless ]
	[ --ignore=b,i,em,strong,wbr ] [ --export=seq.txt ] sample.html

 scrape --core=item --import=seq.txt sample.html

 scrape --whole sample.html

DESCRIPTION

This script processes a HTML page with HTML::ListScraper and shows the result, as YAML (down to the tag sequences, which are YAML scalars formatted by HTML::ListScraper::Interactive). It's meant for interactive exploration of HTML::ListScraper results and fine-tuning of its settings for a specific scraping application.

For every invocation, the single source file is mandatory. All other command-line switches are optional and mutually independent. Note that with no switches, the script doesn't output anything. The switches are as follows:

==head2 --core

Show found repeats. Value is a string, one of

item (or just "i")

Show only the first sequence instance.

list (or just "l")

Show all instances of the first sequence.

all (or just "a")

Show all instances of all found sequences.

By default, no matches are shown. When they are shown, a YAML document, corresponding to a HTML::ListScraper::Sequence, has the sequence length as YAML field len, the repeat count as count and then a YAML sequence with items corresponding to HTML::ListScraper::Instance. Each item has a start field with the starting position and match field with the actual tag sequence. The tag sequence is formatted by HTML::ListScraper::Interactive::format_tags, with formatting options depending on the value of the --detail command line switch.

==head2 --shapeless

Boolean switch, sets HTML::ListScraper::shapeless to true.

==head2 --min-count

Value is an integer bigger than 1, used to set HTML::ListScraper::min_count.

==head2 --detail

Specifies formatting of found tag sequences. Value is a string, one of

none

Don't show the matches at all. This is useful to see just how many sequences were found and how many instances they have.

tags

Show just the tags, without text and links. This is the default value.

text

Show tags and text.

attributes

Show tags with links.

all

Show all fields of HTML:ListScraper::Tag: tags, text and links.

==head2 --whole

Boolean switch. When used, scrape outputs, as the first YAML document containing a single YAML scalar, the whole sequence maintained by HTML::ListScraper. Note that the sequence is formatted without attributes, without text and with line numbers, irrespective of the value of --detail.

==head2 --ignore

A comma-separated list of tags the HTML parser should ignore. The list items shouldn't contain any slashes nor angle brackets. For every name in the list, both opening and closing tag are ignored. Default is b, i, em, strong; when specifying the value explicitly, you probably want to include these tags in it.

==head2 --export

Instructs scrape to dump the first found sequence into the file specified by the option's value. If the file already exists, it's overwritten. When no sequence is found, nothing is dumped. Note that the sequence is formatted with just tags, irrespective of the value of --detail.

==head2 --import

Instructs scrape to call HTML::ListScraper::get_known_sequence instead of HTML::ListScraper::get_sequences, with arguments read from the file specified by the option's value. Lines of that file are converted to tag names by HTML::ListScraper::Interactive::canonicalize_tags.

AUTHOR

Vaclav Barta, <vbar@comp.cz>

COPYRIGHT & LICENSE

Copyright 2007 Vaclav Barta, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.