NAME

MediaWiki::DumpFile::Compat - Compatibility with Parse::MediaWikiDump

SYNOPSIS

use MediaWiki::DumpFile::Compat;

$pmwd = Parse::MediaWikiDump->new;

$pages = $pmwd->pages('pages-articles.xml');
$revisions = $pmwd->revisions('pages-articles.xml');
$links = $pmwd->links('links.sql');

ABOUT

This software suite provides the tools needed to process the contents of the XML page dump files and the SQL based links dump file from a Mediawiki instance. This is a compatibility layer between MediaWiki::Dumpfile and Parse::MediaWikiDump; instead of "use Parse::MediaWikiDump;" you "use MediaWiki::DumpFile::Compat;". The benefit of using the new compatibility module is an increased processing speed - see the MediaWiki::DumpFile::Benchmarks documentation for benchmark results.

MORE DOCUMENTATION

The original Parse::MediaWikiDump documentation is also available in this package; it has been updated to include new features introduced by MediaWiki::DumpFile. You can find the documentation in the following locations:

MediaWiki::DumpFile::Compat::Pages
MediaWiki::DumpFile::Compat::Revisions
MediaWiki::DumpFile::Compat::page

USAGE

This module is a factory class that allows you to create instances of the individual parser objects.

$pmwd->pages

Returns a Parse::MediaWikiDump::Pages object capable of parsing an article XML dump file with one revision per each article.

$pmwd->revisions

Returns a Parse::MediaWikiDump::Revisions object capable of parsing an article XML dump file with multiple revisions per each article.

Returns a Parse::MediaWikiDump::Links object capable of parsing an article links SQL dump file.

General

All parser creation invocations require a location of source data to parse; this argument can be either a filename or a reference to an already open filehandle. This entire software suite will die() upon errors in the file or if internal inconsistencies have been detected. If this concerns you then you can wrap the portion of your code that uses these calls with eval().

COMPATIBILITY

Any deviation of the behavior of MediaWiki::DumpFile::Compat from Parse::MediaWikiDump that is not listed below is a bug. Please report it so that this package can act as a near perfect standin for the original. Compatibility is verified by using the existing Parse::MediaWikiDump test suite with the following adjustments:

Parse::MediaWikiDump::Pages

  • Parse::MediaWikiDump did not need to load all revisions of an article into memory when processing dump files that contain more than one revision but this compatibility module does. The API does not change but the memory requirements for parsing those dump files certainly do. It is, however, highly unlikely that you will notice this as most of the documents with many revisions per article are so large that Parse::MediaWikiDump would not have been able to parse them in any reasonable timeframe.

  • The order of the results from namespaces() is now sorted by the namespace ID instead of being in document order

  • Order of values from next() is now in identical order as SQL file.

BUGS

  • The value of current_byte() wraps at around 2 gigabytes of input XML; see http://rt.cpan.org/Public/Bug/Display.html?id=56843

LIMITATIONS

  • This compatibility layer is not yet well tested.