NAME
WWW::phpBB - phpBB2 forum scraper
SYNOPSIS
use WWW::phpBB;
# scrape as guest
my $phpbb = WWW::phpBB->new(
base_url => 'http://localhost/~stefan/forum1',
db_host => 'localhost',
db_user => 'stefan',
db_passwd => 'somepass',
db_database => 'stefan',
db_prefix => 'phpbb2_',
);
$phpbb->empty_tables();
$phpbb->get_users();
$phpbb->scrape_forum_common();
# scrape a german forum with a non-standard date format and a custom GET var
my $phpbb = WWW::phpBB->new(
base_url => 'http://localhost/~stefan/index.php?mforum=de',
db_host => 'localhost',
db_user => 'stefan',
db_passwd => 'somepass',
db_database => 'stefan',
db_prefix => 'phpbb2_',
post_date_format => qr/(\d+)\s+(\w+),\s+(\d+)\s+(\d+):(\d+)/,
post_date_pos => [qw(day_of_month month_name year hour minutes)],
forum_user => 'raDical',
forum_passwd => 'lfdiugyh',
);
# login to access the private memberlist and some private forums
$phpbb->empty_tables();
$phpbb->forum_login();
$phpbb->get_users();
$phpbb->scrape_forum_common();
$phpbb->forum_logout();
# update an already scraped forum, maybe as a daily cron job
# $phpbb->update_overwrite(1); # don't try to keep modified data
$phpbb->update_users();
$phpbb->update_forum_common();
FANCY EXAMPLE
use WWW::phpBB;
# custom subclass
package WWW::phpBB::custom;
use base 'WWW::phpBB';
# override some methods
sub forum_url_for_page {
my $self = shift;
my ($url, $forum_id, $page) = @_;
$url =~ s%[^/]*$%%;
$url .= "forum,$forum_id,$page.html";
return $url;
}
sub topic_url_for_page {
my $self = shift;
my ($url, $topic_id, $page) = @_;
$url =~ s%[^/]*$%%;
$url .= "topic,$topic_id,$page.html";
return $url;
}
my $phpbb = WWW::phpBB::custom->new(
base_url => 'http://foobar.foren-city.de',
db_host => 'localhost',
db_user => '****',
db_passwd => '****',
db_database => '****',
db_prefix => 'phpbb_',
verbose => 1,
months => [qw(jan feb mär apr mai jun jul aug sep okt nov dez)],
forum_user => '****',
forum_passwd => '****',
post_date_format => qr/(\d+)\s+(\w+)\s+(\d+)\s+(\d+):(\d+)/,
post_date_pos => [qw(day_of_month month_name year hour minutes)],
reg_date_format => qr/(\d+)\.(\d+)\.(\d+)/,
reg_date_pos => [qw(day_of_month month year)],
quote_string => "hat folgendes geschrieben",
forum_link_regex => qr/forum,(\d+),/,
topic_link_regex_p => qr/topic,.*#(\d+)/,
topic_link_regex_t => qr/topic,(\d+),/,
topic_link1 => "topic,%d.html",
topic_link2 => "",
profile_info => 0,
alternative_page_number_regex_forum => qr/forum,\d+,(\d+)/,
alternative_page_number_regex_topic => qr/topic,\d+,(\d+)/,
);
$phpbb->empty_tables();
$phpbb->forum_login();
$phpbb->get_users();
$phpbb->scrape_forum_common();
$phpbb->forum_logout();
DESCRIPTION
This module can be used to scrape a phpBB2 instalation using the web interface. It requires a local phpBB2 setup (you can download the old 2.x versions from http://sourceforge.net/projects/phpbb/files/phpBB%202/ ) that will be overwritten and it can only access what is available to the web browser (i.e. no private messages or user settings). Make sure the username used during the local installation doesn't exist in the remote forum. Scraping is possible as a guest or as a loged in member. If used with an administrator name and password it will copy all the member e-mails (not just the public ones) allowing them to request a new random password from the new installation site and continue using the forum. The current implementation lacks search support, but this can be fixed by converting the forum to phpBB3 or SMF. The "mforum" script is supported.
REQUIRED MODULES
EXPORT
None.
CONSTRUCTOR
new()
Creates a new WWW::phpBB object.
Required parameters:
base_url => $forum_url
URL of the original forum.
db_host => $mysql_server
Location of the mysql server where the forum will be copied to.
db_user => $mysql_user
db_passwd => $mysql_pass
db_database => $mysql_db
Database with an already installed phpBB forum.
db_prefix => $
Prefix used by the local installation.
Optional parameters:
db_compression => [0|1]
Compress mysql trafic (only useful when using a remote server).
max_rows => $value
Maximum number of rows kept in memory. When the storage array reaches this value, the data is commited to the database.
months => [qw(jan feb mar apr may jun jul aug sep oct nov dec)]
Month names as used by the forum. They vary with the translation used. The default is for the english version.
post_date_format => regex
Date format used in posts. The default is qr/(\w+)\s+(\d+),\s+(\d+)\s+(\d+):(\d+)\s+(\w\w)/ and matches strings like "Tue May 30, 2006 5:17 pm" - note that the leading day of the week is ignored as it's not necessary to compute the timestamp.
post_date_pos => [qw(month_name day_of_month year hour minutes am_pm)]
Position of the elements in the date string. The number of items must match the number of parantesis in "post_date_format". Valid field names are:
am_pm - [am|pm] - case insensitive
month_name - must be one of the values in "months"
month - number of month. Has values from 1 to 12
year
hour
minutes
seconds
reg_date_format => regex
reg_date_pos => []
Same requirements as for the post date, only that they refer to the registration date as it appears in the memberlist.
forum_link_regex => regex
default: qr/f=(\d+)/
topic_link_regex_p => regex
Regex for the topic link with the post id. Defaults to qr/viewtopic.*p=(\d+)/
topic_link_regex_t => regex
Regex for the topic link with the topic id. Defaults to qr/viewtopic.*t=(\d+)/
topic_link1 => string
First part of the topic page link. The topic id will be inserted with sprintf if "%d" is found. Defaults to "viewtopic.php".
topic_link2 => string
Second part of the topic page link, consisting of GET vars. The topic id will be inserted with sprintf if "%d" is found. Defaults to "t=%d&postorder=asc".
verbose => [0|1]
Verbosity. Defaults to 0.
max_tries => $value
How many times to try fetching a forum page until giving up. Defaults to 50.
max_children => $value
How many parallel processes should be used for fetching. Defaults to 1.
db_empty => [qw(users categories forums topics posts posts_text vote_desc vote_results)]
Tables that will be epmtied before scraping. The administrator of the local forum will be kept, anything else is deleted. This parameter is not used when updating.
db_insert => [0|1]
Insert scraped data into the database. Defaults to 1.
update_overwrite => [0|1]
Overwrite existing data when updating. Defaults to 0.
ACCESSORS
The accessors have the same name as the constructor parameters. If called without a param, they return the value. With a param, they set a value.
$phpbb->max_rows(100);
print $phpbb->max_tries, "\n";
PUBLIC METHODS
$phpbb->empty_tables()
Empties the tables af a local phpBB installation. It leaves the admin account untouched.
$phpbb->forum_login()
Login into the original forum. Useful when access is restricted for a guest.
$phpbb->forum_logout()
$phpbb->get_users()
Scrape user data from the memberlist and profile pages.
$phpbb->scrape_forum_common()
Scrape categories, forums, topics and posts.
$phpbb->update_users()
Update the users for an already scraped forum.
$phpbb->update_forum_common()
Update categories, forums, topics and posts for an already scraped forum.
AUTHOR
Stefan Talpalaru, <stefantalpalaru@yahoo.com>
COPYRIGHT AND LICENSE
Copyright (c) 2006-2011 by Stefan Talpalaru
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.12.2 or, at your option, any later version of Perl 5 you may have available.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 1635:
Non-ASCII character seen before =encoding in 'mär'. Assuming UTF-8