The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Text::Corpus::VoiceOfAmerica::Document - Parse a VOA article for research.

SYNOPSIS

use Cwd;
use File::Spec;
use Text::Corpus::VoiceOfAmerica;
use Data::Dump qw(dump);
use Log::Log4perl qw(:easy);
Log::Log4perl->easy_init ($INFO);
my $corpusDirectory = File::Spec->catfile (getcwd(), 'corpus_voa');
my $corpus = Text::Corpus::VoiceOfAmerica->new (corpusDirectory => $corpusDirectory);
$corpus->update (verbose => 1);
my $document = $corpus->getDocument (index => 0);
dump $document->getBody;
dump $document->getCategories;
dump $document->getContent;
dump $document->getDate;
dump $document->getDescription;
dump $document->getTitle;
dump $document->getUri;

DESCRIPTION

Text::Corpus::VoiceOfAmerica::Document provides methods for accessing the content of VOA news articles for the researching and testing of information processing techniques. Read the Voice of America's Terms of Use statement to ensure you abide by it when using this module.

CONSTRUCTOR

new

The constructor new creates an instance of the Text::Corpus::VoiceOfAmerica::Document class with the following parameters:

htmlContent
htmlContent => '...'

htmlContent is a string of the HTML of the document to be parsed.

uri
uri => '...'

url is the URL of the HTML content provided by htmlContent; it is also returned as the documents unique identifier by getUri.

METHODS

getBody

getBody ()

getBody returns an array reference of strings of sentences that are the body of the article.

getCategories

getCategories ()

getCategories returns an array reference of strings of categories assigned to the article. They are the phrases and words from the /html/head/meta[@name="KEYWORDS"] field in the HTML of the document.

getContent

getContent ()

getContent returns an array reference of strings of sentences that form the content of the article, the title and body of the article.

getDate

getDate (format => '%g')

getDate returns the date and time of the article in the format speficied by format that uses the print directives of Date::Manip::Date. The default is to return the date and time in RFC2822 format.

getDescription

getDescription ()

getDescription returns an array reference of strings of sentences, usually one, that describes the articles content. It is from the /html/head/meta[@name="description"] field in the HTML of the document.

getTitle

getTitle ()

getTitle returns an array reference of strings, usually one, of the title of the article.

getUri

getUri ()

getUri returns the URL of the document.

INSTALLATION

For installation instructions see Text::Corpus::VoiceOfAmerica.

AUTHOR

Jeff Kubina<jeff.kubina@gmail.com>

COPYRIGHT

Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

KEYWORDS

information processing, english corpus, voa, voice of america

SEE ALSO

Read the Voice of America's Terms of Use statement to ensure you abide by it when using this module.

CHI, HTML::TreeBuilder::XPath, Lingua::EN::Sentence, Log::Log4perl, Text::Corpus::VoiceOfAmerica