NAME

WWW::phpBB - phpBB forum scraper

SYNOPSIS

use WWW::phpBB;

# scrape as guest
my $phpbb = WWW::phpBB->new(
    base_url => 'http://localhost/~stefan/forum1',
    db_host => 'localhost',
    db_user => 'stefan',
    db_passwd => 'somepass',
    db_database => 'stefan',
    db_prefix => 'phpbb2_',
 );

$phpbb->empty_tables();
$phpbb->get_users();
$phpbb->scrape_forum_common();

# scrape a german forum, loging in just to get the memberlist
my $phpbb = WWW::phpBB->new(
    base_url => 'http://localhost/~stefan/index.php?mforum=de',
    db_host => 'localhost',
    db_user => 'stefan',
    db_passwd => 'somepass',
    db_database => 'stefan',
    db_prefix => 'phpbb3_',
    post_date_format => qr/(\d+)\s+(\w+),\s+(\d+)\s+(\d+):(\d+)/,
    post_date_pos => [qw(day_of_month month_name year hour minutes)],
    forum_user => 'raDical',
    forum_passwd => 'lfdiugyh',
 );

$phpbb->empty_tables();
$phpbb->forum_login();
$phpbb->get_users();
$phpbb->forum_logout();
$phpbb->scrape_forum_common();

# update an already scraped forum, maybe as a daily cron job
# $phpbb->update_overwrite(1); # don't try to keep modified data
$phpbb->update_users();
$phpbb->update_forum_common();

DESCRIPTION

This module can be used to scrape a phpBB instalation using the web interface. It requires a local phpBB setup that will be overwritten and it can only access what is available to the web browser (no private messages or user settings). Scraping is possible as a guest or as a loged in member. If used with an administrator name and password it will copy all the member e-mails (not just the public ones) allowing them to request a new random password from the new installation site and continue using the forum. The current implementation lacks search support, but this problem will disappear if you convert the forum to SMF. The "mforum" script is supported.

REQUIRED MODULES

WWW::Mechanize

Compress::Zlib

HTML::TokeParser::Simple

DBI

DBD::mysql

EXPORT

None.

CONSTRUCTOR

new()

Creates a new WWW::phpBB object.

Required parameters:

  • base_url => $forum_url

    URL of the original forum.

  • db_host => $mysql_server

    Location of the mysql server where the forum will be copied to.

  • db_user => $mysql_user

  • db_passwd => $mysql_pass

  • db_database => $mysql_db

    Database with an already installed phpBB forum.

  • db_prefix => $

    Prefix used by the local installation.

Optional parameters:

  • db_compression => [0|1]

    Compress mysql trafic (only useful when using a remote server).

  • max_rows => $value

    Maximum number of rows kept in memory. When the storage array reaches this value, the data is commited to the database.

  • months => [qw(jan feb mar apr may jun jul aug sep oct nov dec)]

    Month names as used by the forum. They vary with the translation used. The default is for the english version.

  • post_date_format => regex

    Date format used in posts. The default is qr/(\w+)\s+(\d+),\s+(\d+)\s+(\d+):(\d+)\s+(\w\w)/ and matches strings like "Tue May 30, 2006 5:17 pm" - note that the leading day of the week is ignored as it's not necessary to compute the timestamp.

  • post_date_pos => [qw(month_name day_of_month year hour minutes am_pm)]

    Position of the elements in the date string. The number of items must match the number of parantesis in "post_date_format". Valid field names are:

    am_pm - [am|pm] - case insensitive

    month_name - must be one of the values in "months"

    month - number of month. Has values from 1 to 12

    year

    hour

    minutes

    seconds

  • reg_date_format => regex

  • reg_date_pos => []

    Same requirements as for the post date, only that they refer to the registration date as it appears in the memberlist.

  • max_tries => $value

    How many times to try fetching a forum page until giving up.

  • db_empty => [qw(users categories forums topics posts posts_text vote_desc vote_results)]

    Tables that will be epmtied before scraping. The administrator of the local forum will be kept, anything else is deleted. This parameter is not used when updating.

  • db_insert => [0|1]

    Insert scraped data into the database. Defaults to 1.

  • update_overwrite => [0|1]

    Overwrite existing data when updating. Defaults to 0.

ACCESSORS

The accessors have the same name as the constructor parameters. If called without a param, they return the value. With a param, they set a value.

$phpbb->max_rows(100);
print $phpbb->max_tries, "\n";

PUBLIC METHODS

$phpbb->empty_tables()

Empties the tables af a local phpBB installation. It leaves the admin account untouched.

$phpbb->forum_login()

Login into the original forum. Useful when access is restricted for a guest.

$phpbb->forum_logout()

$phpbb->get_users()

Scrape user data from the memberlist and profile pages.

$phpbb->scrape_forum_common()

Scrape categories, forums, topics and posts.

$phpbb->update_users()

Update the users for an already scraped forum.

$phpbb->update_forum_common()

Update categories, forums, topics and posts for an already scraped forum.

AUTHOR

Stefan Talpalaru, <stefantalpalaru@yahoo.com>

COPYRIGHT AND LICENSE

Copyright (C) 2006 by Stefan Talpalaru

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.