NAME
Catmandu::Importer::MediaWiki - Catmandu importer that imports pages from mediawiki
DESCRIPTION
This importer uses the query api from mediawiki to get a list of pages that match certain requirements.
It retrieves a list of pages and their content by using the generators from mediawiki:
http://www.mediawiki.org/wiki/API:Query#Generators
The default generator is 'allpages'.
The list could also be retrieved with the module 'list':
http://www.mediawiki.org/wiki/API:Lists
But this module 'list' is very limited. It retrieves a list of pages with a limited set of attributes (pageid, ns and title).
The module 'properties' on the other hand lets you add properties:
http://www.mediawiki.org/wiki/API:Properties
But the selecting parameters (titles, pageids and revids) are too specific to execute a query in one call. One should execute a list query, and then use the pageids to feed them to the submodule 'properties'.
To execute a query, and add properties to the pages in one call can be accomplished by use of generators.
http://www.mediawiki.org/wiki/API:Query#Generators
These parameters are set automatically, and cannot be overwritten:
action = "query" indexpageids = 1 generator = <generate> format = "json"
Additional parameters can be set in the constructor argument 'args'. Arguments for a generator origin from the list module with the same name, but must be prepended with 'g'.
ARGUMENTS
- generate
-
type: string
explanation: type of generator to use. For a complete list, see http://www.mediawiki.org/wiki/API:Lists. because Catmandu::Iterable already defines 'generator', this parameter has been renamed to 'generate'.
default: 'allpages'.
- args
-
type: hash
explanation: extra arguments. These arguments are merged with the defaults.
default:
{ prop => "revisions", rvprop => "ids|flags|timestamp|user|comment|size|content", gaplimit => 100, gapfilterredir => "nonredirects" }
which means:
prop add revisions to the list of page attributes rvprop specific properties for the list of revisions gaplimit limit for generator 'allpages' (every 'generator' has its own limit). gapfilterredir filter out redirect pages
- lgname
-
type: string
explanation: login name. Only used when both lgname and lgpassword are set.
- lgpassword
-
type: string
explanation: login password. Only used when both lgname and lgpassword are set.
SYNOPSIS
use Catmandu::Sane;
use Catmandu::Importer::MediaWiki;
binmode STDOUT,":utf8";
my $importer = Catmandu::Importer::MediaWiki->new(
url => "http://en.wikipedia.org/w/api.php",
generate => "allpages",
args => {
prop => "revisions",
rvprop => "ids|flags|timestamp|user|comment|size|content",
gaplimit => 100,
gapprefix => "plato",
gapfilterredir => "nonredirects"
}
);
$importer->each(sub{
my $r = shift;
my $content = $r->{revisions}->[0]->{"*"};
say $r->{title};
});
AUTHORS
Nicolas Franck <nicolas.franck at ugent.be>