NAME

create_summary_corpus.pl - Script to create corpus for summary testing.

SYNOPSIS

create_summary_corpus.pl [-d corpusDirectory -l languageCode -p maxProcesses -h -t n]

DESCRIPTION

The script create_summary_corpus.pl makes a corpus for summarization testing using the featured articles of various Wikipedias.

All errors and warnings are logged using Log::Log4perl to the file corpusDirectory/languageCode/log.txt.

OPTIONS

`-d corpusDirectory`

The option -d sets the directory to store the corpus of documents; the directory is created if it does not exist. The default is the cwd.

A language subdirectory is created at corpusDirectory/languageCode that will contain the directories log, html, unparsable, text, and xml. The directory log will contain the file log.txt that all errors, warnings, and informational messages are logged to using Log::Log4perl. The directory html will contain copies of the HTML versions of the featured article pages fetched using LWP. The directory text will contain two files for each article; one file will end with _body.txt and contain the body text of the article, the other will end with _summary.txt and will contain the summary. The directory unparsable will contain the HTML files that could not be parsed into body and summary sections. The XML files are UTF8 encoded, the text and html files are saved as UTF8 octets.

`-l languageCode`

The option -l sets the language code of the Wikipedia from which the corpus of featured articles are to be created. The supported language codes are af:Afrikaans, ar:Arabic, az:Azerbaijani, bg:Bulgarian, bs:Bosnian, ca:Catalan, cs:Czech, de:German, el:Greek, en:English, eo:Esperanto, es:Spanish, eu:Basque, fa:Persian, fi:Finnish, fr:French, he:Hebrew, hr:Croatian, hu:Hungarian, id:Indonesian, it:Italian, ja:Japanese, jv:Javanese, ka:Georgian, kk:Kazakh, km:Khmer, ko:Korean, li:Limburgish, lv:Latvian, ml:Malayalam, mr:Marathi, ms:Malay, mzn:Mazandarani, nl:Dutch, nn:Norwegian (Nynorsk), no:Norwegian (Bokm?l), pl:Polish, pt:Portuguese, ro:Romanian, ru:Russian, sh:Serbo-Croatian, simple:Simple English, sk:Slovak, sl:Slovenian, sr:Serbian, sv:Swedish, sw:Swahili, ta:Tamil, th:Thai, tl:Tagalog, tr:Turkish, tt:Tatar, uk:Ukrainian, ur:Urdu, vi:Vietnamese, vo:Volap?k, and zh:Chinese. If the language code is all, then the corpus for each supported language is created (which takes a long time). The default is en.

`-p maxProcesses`

maxProcesses => 1

The option -p is the maximum number of processes that can be running simultaneously to parse the files. Parsing the files for the summary and body sections may be computational intensive so the module Forks::Super is used for parallelization. The default is one.

`-r`

Causes only the text and XML files from all the HTML files that have already been fetched to be created; no new files are downloaded.

`-h`

Makes this documentation print.

`-t 0`

The option -t initiates testing mode; only the specified number of pages are fetched and parsed. The default is zero, indicating no testing, all possible pages are fetched and parsed.

BUGS

This script creates corpora by parsing Wikipedia pages, the xpath expressions used to extract links and text will become invalid as the format of the various pages changes, causing some corpora not to be created.

Please email bugs reports or feature requests to bug-text-corpus-summaries-wikipedia@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Corpus-Summaries-Wikipedia. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.

AUTHOR

Jeff Kubina<jeff.kubina@gmail.com>

COPYRIGHT

The full text of the license can be found in the LICENSE file included with this module.

KEYWORDS

corpus, information processing, summaries, summarization, wikipedia

Links to the featured article page for the supported language codes: af:Afrikaans, ar:Arabic, az:Azerbaijani, bg:Bulgarian, bs:Bosnian, ca:Catalan, cs:Czech, de:German, el:Greek, en:English, eo:Esperanto, es:Spanish, eu:Basque, fa:Persian, fi:Finnish, fr:French, he:Hebrew, hr:Croatian, hu:Hungarian, id:Indonesian, it:Italian, ja:Japanese, jv:Javanese, ka:Georgian, kk:Kazakh, km:Khmer, ko:Korean, li:Limburgish, lv:Latvian, ml:Malayalam, mr:Marathi, ms:Malay, mzn:Mazandarani, nl:Dutch, nn:Norwegian (Nynorsk), no:Norwegian (Bokm?l), pl:Polish, pt:Portuguese, ro:Romanian, ru:Russian, sh:Serbo-Croatian, simple:Simple English, sk:Slovak, sl:Slovenian, sr:Serbian, sv:Swedish, sw:Swahili, ta:Tamil, th:Thai, tl:Tagalog, tr:Turkish, tt:Tatar, uk:Ukrainian, ur:Urdu, vi:Vietnamese, vo:Volap?k, and zh:Chinese.

Copies of the data sets generated in May 2010 and February 2013 can be download here.

To install Text::Corpus::Summaries::Wikipedia, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Text::Corpus::Summaries::Wikipedia

CPAN shell

perl -MCPAN -e shell
install Text::Corpus::Summaries::Wikipedia

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)