NAME

WWW::phpBB - phpBB2 forum scraper

SYNOPSIS

use WWW::phpBB;

# scrape as guest
my $phpbb = WWW::phpBB->new(
    base_url => 'http://localhost/~stefan/forum1',
    db_host => 'localhost',
    db_user => 'stefan',
    db_passwd => 'somepass',
    db_database => 'stefan',
    db_prefix => 'phpbb2_',
);

$phpbb->empty_tables();
$phpbb->get_users();
$phpbb->scrape_forum_common();

# scrape a german forum with a non-standard date format and a custom GET var
my $phpbb = WWW::phpBB->new(
    base_url => 'http://localhost/~stefan/index.php?mforum=de',
    db_host => 'localhost',
    db_user => 'stefan',
    db_passwd => 'somepass',
    db_database => 'stefan',
    db_prefix => 'phpbb2_',
    post_date_format => qr/(\d+)\s+(\w+),\s+(\d+)\s+(\d+):(\d+)/,
    post_date_pos => [qw(day_of_month month_name year hour minutes)],
    forum_user => 'raDical',
    forum_passwd => 'lfdiugyh',
);

# login to access the private memberlist and some private forums
$phpbb->empty_tables();
$phpbb->forum_login();
$phpbb->get_users();
$phpbb->scrape_forum_common();
$phpbb->forum_logout();

# update an already scraped forum, maybe as a daily cron job
# $phpbb->update_overwrite(1); # don't try to keep modified data
$phpbb->update_users();
$phpbb->update_forum_common();

FANCY EXAMPLE

    use WWW::phpBB;

    # custom subclass
    package WWW::phpBB::custom;
    use base 'WWW::phpBB';

    # override some methods
    sub forum_url_for_page {
	    my $self = shift;
	    my ($url, $forum_id, $page) = @_;

	    $url =~ s%[^/]*$%%;
	    $url .= "forum,$forum_id,$page.html";
	    return $url;
    }

    sub topic_url_for_page {
	    my $self = shift;
	    my ($url, $topic_id, $page) = @_;

	    $url =~ s%[^/]*$%%;
	    $url .= "topic,$topic_id,$page.html";
	    return $url;
    }


    my $phpbb = WWW::phpBB::custom->new(
     base_url => 'http://foobar.foren-city.de',
     db_host => 'localhost',
     db_user => '****',
     db_passwd => '****',
     db_database => '****',
     db_prefix => 'phpbb_',
     verbose => 1,
     months => [qw(jan feb mär apr mai jun jul aug sep okt nov dez)],
     forum_user => '****',
     forum_passwd => '****',
     post_date_format => qr/(\d+)\s+(\w+)\s+(\d+)\s+(\d+):(\d+)/,
     post_date_pos => [qw(day_of_month month_name year hour minutes)],
     reg_date_format => qr/(\d+)\.(\d+)\.(\d+)/,
     reg_date_pos => [qw(day_of_month month year)],
     quote_string => "hat folgendes geschrieben",
     forum_link_regex => qr/forum,(\d+),/,
     topic_link_regex_p => qr/topic,.*#(\d+)/,
     topic_link_regex_t => qr/topic,(\d+),/,
     topic_link1 => "topic,%d.html",
     topic_link2 => "",
     profile_string_occupation => "beruf",
     alternative_page_number_regex_forum => qr/forum,\d+,(\d+)/,
     alternative_page_number_regex_topic => qr/topic,\d+,(\d+)/,
    );

    $phpbb->empty_tables();
    $phpbb->forum_login();
    $phpbb->get_users();
    $phpbb->scrape_forum_common();
    $phpbb->forum_logout();

DESCRIPTION

This module can be used to scrape a phpBB2 instalation using the web interface. It requires a local phpBB2 setup (you can download the old 2.x versions from http://sourceforge.net/projects/phpbb/files/phpBB%202/ ) that will be overwritten and it can only access what is available to the web browser (i.e. no private messages or user settings). Make sure the username used during the local installation doesn't exist in the remote forum. Scraping is possible as a guest or as a loged in member. If used with an administrator name and password it will copy all the member e-mails (not just the public ones) allowing them to request a new random password from the new installation site and continue using the forum. The current implementation lacks search support, but this can be fixed by converting the forum to phpBB3 or SMF. The "mforum" script is supported.

REQUIRED MODULES

WWW::Mechanize

Compress::Zlib

HTML::TokeParser::Simple

DBI

DBD::mysql

EXPORT

None.

CONSTRUCTOR

new()

Creates a new WWW::phpBB object.

Required parameters:

  • base_url => $forum_url

    URL of the original forum.

  • db_host => $mysql_server

    Location of the mysql server where the forum will be copied to.

  • db_user => $mysql_user

  • db_passwd => $mysql_pass

  • db_database => $mysql_db

    Database with an already installed phpBB forum.

  • db_prefix => $

    Prefix used by the local installation.

Optional parameters:

  • db_compression => [0|1]

    Compress mysql trafic (only useful when using a remote server).

  • max_rows => $value

    Maximum number of rows kept in memory. When the storage array reaches this value, the data is commited to the database.

  • months => [qw(jan feb mar apr may jun jul aug sep oct nov dec)]

    Month names as used by the forum. They vary with the translation used. The default is for the english version.

  • post_date_format => regex

    Date format used in posts. The default is qr/(\w+)\s+(\d+),\s+(\d+)\s+(\d+):(\d+)\s+(\w\w)/ and matches strings like "Tue May 30, 2006 5:17 pm" - note that the leading day of the week is ignored as it's not necessary to compute the timestamp.

  • post_date_pos => [qw(month_name day_of_month year hour minutes am_pm)]

    Position of the elements in the date string. The number of items must match the number of parantesis in "post_date_format". Valid field names are:

    am_pm - [am|pm] - case insensitive

    month_name - must be one of the values in "months"

    month - number of month. Has values from 1 to 12

    year

    hour

    minutes

    seconds

  • reg_date_format => regex

  • reg_date_pos => []

    Same requirements as for the post date, only that they refer to the registration date as it appears in the memberlist.

  • forum_link_regex => regex

    default: qr/f=(\d+)/

  • topic_link_regex_p => regex

    Regex for the topic link with the post id. Defaults to qr/viewtopic.*p=(\d+)/

  • topic_link_regex_t => regex

    Regex for the topic link with the topic id. Defaults to qr/viewtopic.*t=(\d+)/

  • topic_link1 => string

    First part of the topic page link. The topic id will be inserted with sprintf if "%d" is found. Defaults to "viewtopic.php".

  • topic_link2 => string

    Second part of the topic page link, consisting of GET vars. The topic id will be inserted with sprintf if "%d" is found. Defaults to "t=%d&postorder=asc".

  • verbose => [0|1]

    Verbosity. Defaults to 0.

  • max_tries => $value

    How many times to try fetching a forum page until giving up. Defaults to 50.

  • max_children => $value

    How many parallel processes should be used for fetching. Defaults to 1.

  • db_empty => [qw(users categories forums topics posts posts_text vote_desc vote_results)]

    Tables that will be epmtied before scraping. The administrator of the local forum will be kept, anything else is deleted. This parameter is not used when updating.

  • db_insert => [0|1]

    Insert scraped data into the database. Defaults to 1.

  • update_overwrite => [0|1]

    Overwrite existing data when updating. Defaults to 0.

ACCESSORS

The accessors have the same name as the constructor parameters. If called without a param, they return the value. With a param, they set a value.

$phpbb->max_rows(100);
print $phpbb->max_tries, "\n";

PUBLIC METHODS

$phpbb->empty_tables()

Empties the tables af a local phpBB installation. It leaves the admin account untouched.

$phpbb->forum_login()

Login into the original forum. Useful when access is restricted for a guest.

$phpbb->forum_logout()

$phpbb->get_users()

Scrape user data from the memberlist and profile pages.

$phpbb->scrape_forum_common()

Scrape categories, forums, topics and posts.

$phpbb->update_users()

Update the users for an already scraped forum.

$phpbb->update_forum_common()

Update categories, forums, topics and posts for an already scraped forum.

AUTHOR

Stefan Talpalaru, <stefantalpalaru@yahoo.com>

COPYRIGHT AND LICENSE

Copyright (c) 2006-2011 by Stefan Talpalaru

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.12.2 or, at your option, any later version of Perl 5 you may have available.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 1769:

Non-ASCII character seen before =encoding in 'mär'. Assuming UTF-8