NAME

Parse::MediaWikiDump - Tools to process MediaWiki dump files

SYNOPSIS

use Parse::MediaWikiDump;

$source = 'dump_filename.ext';
$source = \*FILEHANDLE;

$pages = Parse::MediaWikiDump::Pages->new($source);
$links = Parse::MediaWikiDump::Links->new($source);

#get all the records from the dump files, one record at a time
while(defined($page = $pages->page)) {
  print "title '", $page->title, "' id ", $page->id, "\n";
}

while(defined($link = $links->link)) {
  print "link from ", $link->from, " to ", $link->to, "\n";
}

#information about the page dump file
$pages->sitename;
$pages->base;
$pages->generator;
$pages->case;
$pages->namespaces;

#information about a page record
$page->redirect;
$page->categories;
$page->title;
$page->id;
$page->revision_id;
$page->timestamp;
$page->username;
$page->userid;
$page->minor;
$page->text;

#information about a link
$link->from;
$link->to;

DESCRIPTION

This module provides the tools needed to process the contents of various MediaWiki dump files.

USAGE

To use this module you must create an instance of a parser for the type of dump file you are trying to parse. The current parsers are:

Parse::MediaWikiDump::Pages

Parse the contents of the page archive.

Parse the link list dump file. *WARNING* The dump format has changed and this software needs to be updated to match it. Consequently the most recent English Wikipedia links dump that can be parsed is from June 2005 which is out of sync with the current pages dump.

General

Both parsers require an argument to new that is a location of source data to parse; this argument can be either a filename or a reference to an already open filehandle. This entire software suite will die() upon errors in the file, inconsistencies on the stack, etc. If this concerns you then you can wrap the portion of your code that uses these calls with eval().

Parse::MediaWikiDump::Pages

It is possible to create a Parse::MediaWikiDump::Pages object two ways:

$pages = Parse::MediaWikiDump::Pages->new($filename);
$pages = Parse::MediaWikiDump::Pages->new(\*FH);

After creation the folowing methods are avalable:

$pages->page

Returns the next available record from the dump file if it is available, otherwise returns undef. Records returned are instances of Parse::MediaWikiDump::page; see below for information on those objects.

$pages->sitename

Returns the plain-text name of the instance the dump is from.

$pages->base

Returns the base url to the website of the instance.

$pages->generator

Returns the version of the software that generated the file.

$pages->case

Returns the case-sensitivity configuration of the instance.

$pages->namespaces

Returns an array reference to the list of namespaces in the instance. Each namespace is stored as an array reference which has two items; the first is the namespace number and the second is the namespace name. In the case of namespace 0 the text stored for the name is ''

Parse::MediaWikiDump::page

The Parse::MediaWikiDump::page object represents a distinct MediaWiki page, article, module, what have you. These objects are returned by the page method of a Parse::MediaWikiDump::Pages instance. The scalar returned is a reference to a hash that contains all the data of the page in a straightforward manor. While it is possible to access this hash directly, and it involves less overhead than using the methods below, it is beyond the scope of the interface and is undocumented.

Some of the methods below require additional processing, such as namespaces, redirect, and categories, to name a few. In these cases the returned result is cached and stored inside the object so the processing does not have to be redone. This is transparent to you; just know that you don't have to worry about optimizing calls to these functions to limit processing overhead.

The following methods are available:

$page->id
$page->title
$page->text

A reference to a scalar containing the plaintext of the page.

$page->redirect

The plain text name of the article redirected to or undef if the page is not a redirect.

$page->categories

Returns a reference to an array that contains a list of categories or undef if there are no categories.

$page->revision_id
$page->timestamp
$page->username
$page->userid
$page->minor

Parse::MediaWikiDump::Links

This module also takes either a filename or a reference to an already open filehandle. For example:

$links = Parse::MediaWikiDump::Links->new($filename);
$links = Parse::MediaWikiDump::Links->new(\*FH);

It is then possible to extract the links a single link at a time using the link method, which returns an instance of Parse::MediaWikiDump::link or undef when there is no more data. For instance:

while(defined($link = $links->link)) {
  print 'from ', $link->from, ' to ', $link->to, "\n";
}

Instances of this class are returned by the link method of a Parse::MediaWikiDump::Links instance. The following methods are available:

These methods extract the numerical id of the article that is linked from and to. It is possible to extract the values from the underlying data structure (instead of using the object methods). While this can yield a speed increase it is not a part of the standard interface so it is undocumented.

EXAMPLES

Find uncategorized articles in the main name space

#!/usr/bin/perl -w

use strict;
use Parse::MediaWikiDump;

my $file = shift(@ARGV) or die "must specify a MediaWiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;

while(defined($page = $pages->page)) {
  #main namespace only           
  next unless $page->namespace eq '';

  print $page->title, "\n" unless defined($page->categories);
}

Find double redirects in the main name space

#!/usr/bin/perl -w

use strict;
use Parse::MediaWikiDump;

my $file = shift(@ARGV) or die "must specify a MediaWiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;
my %redirs;

while(defined($page = $pages->page)) {
  next unless $page->namespace eq '';
  next unless defined($page->redirect);

  my $title = $page->title;

  $redirs{$title} = $page->redirect;
}

foreach my $key (keys(%redirs)) {
  my $redirect = $redirs{$key};
  if (defined($redirs{$redirect})) {
    print "$key\n";
  }
}
#!/usr/bin/perl -w

use strict;
use Parse::MediaWikiDump;

my $pages = Parse::MediaWikiDump::Pages->new(shift(@ARGV));
my $links = Parse::MediaWikiDump::Links->new(shift(@ARGV));
my %stubs;
my $page;
my $link;
my @list;

select(STDERR);
$| = 1;
print '';
select(STDOUT);

print STDERR "Locating stubs: ";

while(defined($page = $pages->page)) {
	next unless $page->namespace eq '';

	my $text = $page->text;

	next unless $$text =~ m/stub}}/i;

	my $title = $page->title;
	my $id = $page->id;

	$stubs{$id} = [$title, 0];
}

print STDERR scalar(keys(%stubs)), " stubs found\n";

print STDERR "Processing links: ";

while(defined($link = $links->link)) {
	my $to = $link->to;

	next unless defined($stubs{$to});

	$stubs{$to}->[1]++;
}

print STDERR "done\n";

while(my ($key, $val) = each(%stubs)) {
	push(@list, $val);
}

@list = sort({ $$b[1] <=> $$a[1]} @list);

my $stub = $list[0]->[0];
my $num_links = $list[0]->[1];

print "Most wanted stub: $stub with $num_links links\n";

TODO

Optomization

It would be nice to increase the processing speed of the XML files but short of an implementation using XS I'm not sure what to do.

Testing

This software has received only light testing consisting of multiple runs over the most recent English Wikipedia dump file: July 13, 2005.

AUTHOR

This module was created and documented by Tyler Riddle <triddle@gmail.org>.

BUGS

Please report any bugs or feature requests to bug-parse-mediawikidump@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Parse-MediaWikiDump. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

Known Bugs

Parse::MediaWikiDump::Pages can only handle the current dumps, not the comprehensive dump files. For instance http://download.wikimedia.org/wikipedia/en/20050713_pages_current.xml.gz is ok but http://download.wikimedia.org/wikipedia/en/20050713_pages_full.xml.gz will most likely lead to the program aborting early due to uncontrolled stack growth.

The format of the links dumps has been changed and Parse::MediaWikiDump::Links can not deal with the new format. It will need to be modified for the new dump files.

COPYRIGHT & LICENSE

Copyright 2005 Tyler Riddle, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.