NAME
XML::RSS::FromHTML - simple framework for making RSS out of HTML
SYNOPSIS
### create your own sub-class, with these four methods
package MyModule;
use base XML::RSS::FromHTML;
sub init {
my $self = shift;
# set your configurations here
$self->name('MyRSS');
$self->url('http://foo.com/headlines.html');
}
sub defineRSS {
my $self = shift;
my $xmlrss = shift;
# define your RSS using XML::RSS->channel method
$xmlrss->channel(
title => 'foo.com headlines feed',
description => 'generated from http://foo.com headlines'
);
}
sub makeItemList {
my $self = shift;
my $html = shift;
# parse HTML and make an item list
my @list;
while ($html =~ m|<li><a href="(.+?)">(.+?)</a></li>|g){
push(@list,{
link => $1,
title => $2
});
}
return \@list;
}
sub addNewItem {
my $self = shift;
my ($xmlrss,$eachItem) = @_;
# make your item using XML::RSS->add_item method
$xmlrss->add_item(
title => $eachItem->{title},
link => $eachItem->{link},
description => 'this is '. $eachItem->{title},
);
}
#### and from your main routine...
package main;
use MyModule;
my $rss = MyModule->new;
$rss->update;
# an updated RSS file './MyRSS.xml' will be created.
# run this script every day, and your RSS will always
# be up-to-date.
DESCRIPTION
This module is a simple framework for creating RSS out of HTML periodically. There are still plenty of web sites that doesn't supply RSS feeds, which we think it would be nice if they did. This module helps you create RSS feeds for those sites by your-own-hand, and maintain the contents up to date. The core features are as follows:
retrieving HTML text from url
restraining short interval access to url
caching of update records (cause minimum access to url)
framework that offers minimum coding to developers
It's mostly focused on trying not to be an annoyance to the target url/web site (and of course, developer-friendliness). We don't want to be seen as spams, but would be nice if we could tell them the value of RSS feeds.
USAGE
BASIC
This module is not intended to work by itself. You will need to create a sub class of it, and define these four methods with customization for your target url/web site.
FOUR METHODS
init()
sub init {
my $self = shift;
# set your configurations here
$self->name('Test');
$self->url('http://foo.com/headlines.html');
$self->cacheDir('./cache');
$self->feedDir('./feed');
return 1;
}
Called with-in the constructor, this method should initialize property values of your choice. See the PROPERTIES section for description of available properties.
defineRSS()
Define your RSS feed descriptions and informations here, using the XML::RSS->channel method.
sub defineRSS {
my $self = shift;
my $xmlrss = shift;
# define your RSS using XML::RSS->channel method
$xmlrss->channel(
title => 'foo.com headlines feed',
description => 'generated from http://foo.com headlines'
);
# you can also define images with XML::RSS->image method
$xmlrss->image(
title => 'foo.com headlines feed',
url => 'http://mysite/image/logo.gif',
link => 'http://foo.com/headlines.html'
);
return 1;
}
makeItemList()
With the whole html string (supplied as argument), use whatever mean (i.e. regexp) to create a data structure of items. Later on, you'll be using these information to create feed items.
sub makeItemList {
my $self = shift;
my $html = shift;
# parse HTML and make an item list
my @list;
while ($html =~ m| .. some mumbling regexp here .. |g){
push(@list,{
link => $1,
title => $2,
category => $3,
id => $4,
...
});
}
return \@list;
}
addNewItem()
From the list created with above method (makeItemList), the framework will check for updates, and will call this method for each new items. Thus, the argument $eachItem represents the iterator (each element of @list created with $self->makeItemList) object. Use XML::RSS->add_item method to add a new item to the RSS feed. You can also fetch any additional information about the item, like from the description page, and add them to the feed too.
sub addNewItem {
my $self = shift;
my ($xmlrss,$eachItem) = @_;
# fetch additional information if you want to
require LWP::Simple;
my $html = get("http://foo.com/archives/$eachItem->{id}.html");
my ($desc) = ($html =~ m|<p class="desc">(.+?)</p>|);
# make your rss item using XML::RSS->add_item method
$xmlrss->add_item(
title => $eachItem->{title},
link => $eachItem->{link},
category => $eachItem->{cateogry},
description => $desc,
);
return 1;
}
HOW TO USE
Basically, all you need to do is load your sub-class module, create new instance, and call the update method. The return value of update method is a boolean value, representing:
1 : RSS feed re-written. There were some updates.
0 : No update, for some reason.
And with $self->updateStatus method, you'll be informed with a status message.
use MyModule;
my $rss = MyModule->new;
my $hasNewItem = $rss->update;
if($hasNewItem){
print "RSS updated with some new items";
return 1;
}else{
# i.e. "still under check interval time period"
print $rss->updateStatus;
return undef;
}
PROPERTIES
These are all the properties available for configuration within $self->init method.
name
Identification string, used for feed file name and cache file name. Default value is 'myrss'.
url
The URL of the target web page.
cacheDir
Directory path to where the cache files are stored. Default is '.' (current dir).
feedDir
Directory path to where the RSS feed file will be saved. Default is '.' (current dir).
minInterval
Minimum interval period in seconds. If $self->update is called more than once with-in this interval period, the call will silently be ignored, thus restricting un-necessary access to the target url. Default is 300 (=5minutes).
maxItemCount
The maximum number of items the RSS feed contains. If exceeded, older items will be deleted from the feed. Default is 30.
unicodeDowngrade
Parsing of RSS files with XML::RSS (actually XML::Parser) results in utf-8 flagged strings. Setting this to a true value will take all these utf-8 flags off, which is sometimes helpfull for non-ascii language codes without using the 'encoding' pragma.
passthru
Should supply a hashref data, containing optional values you would want to pass to XML::RSS->new() method. Default is {} (empty). For example, setting this:
$self->passthru({ version => '2.0' });
will work as
XML::RSS->new( version => '2.0' );
in every place XML::RSS->new is called internally.
outFileName
If supplied, the name of the out file (feed xml file) will use this one instead of $self->name. (Intended for custom usage only).
debug
If set to a true value, each time $self->update method is called, some useful debugging information (files) will be created in the $self->cacheDir directory.
OTHER USEFUL PROPERTIES
updateStatus
As described above (section HOW TO USE), this property contains some helpful message about the update sequence. Currently there are:
'update not executed yet'
default message before $self->update is called.
'still under check interval time period'
$self->minInterval seconds hasn't passed yet since the last update.
'makeItemList returned with 0 item - html parse failure'
parsing logic is not working right. Must be a change in the html structure.
'updated with $n new items'
successfully updated with $n new items.
'there was no new item'
the HTML hasn't changed a bit.
newItems
An array reference to all the items that were counted as new item. Sometimes usefull after $self->update method call.
$rss->update;
print "there were " scalar @{ $rss->newItems } . " items new.\n";
foreach (@{ $rss->newItems }){
print "title: $_->{title}\n";
}
OTHER USEFUL METHODS
as_string()
Will return RSS feed as XML string.
as_object()
Will return XML::RSS object of the current RSS feed.
getDateTime()
Will return the current date + time in a RFC 1123 styled GMT Ascii format, like this:
Sun, 06 Nov 1994 08:49:37 GMT
Useful for date/time related elements within RSS feed (i.e. pubDate). Also, if passed with some kind of a date-time string as an argument, it'll try it's best to parse the string and return as GMT Ascii format string as well.
print $self->getDateTime('19940203T141529Z');
# will print 'Thu, 03 Feb 1994 14:15:29 GMT'
It uses HTTP::Date internally, so see HTTP::Date's parse_date() method documentation for available (parse-able) formats.
TIPS
RETRIEVING HTML FROM SESSION REQUIRED WEB SITE
With some web sites, they require a valid session-id in your browser cookie or query string in order to retrieve their contents. The session id is usually given to you the first time you visit their TOP PAGE, or of course, when you go through the LOGIN process.
If you want/need to retrieve some HTML from pages that require these session id's, you should override the $self->getHTML method with your own customization. For example, assuming a web site that gives you session-id's when you access their top.cgi page, the getHTML method will be like this:
sub getHTML {
my $self = shift;
my $url = shift;
my $ua = LWP::UserAgent->new;
$ua->cookie_jar({ file => $self->cacheDir.'/'.$self->name.'.cookie' });
$ua->get('http://foo.com/top.cgi'); # set session-id in cookie
my $res = $ua->get($url); # send with session-id cookie
return $res->content;
}
BUGS
Nothing that I'm aware of, yet.
AUTHOR
Toshimasa Ishibashi
CPAN ID: BASHI
bashi@cpan.org
http://iandeth.dyndns.org/mt/ian/
COPYRIGHT
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
SEE ALSO
perl(1). XML::RSS