NAME
create_summary_corpus.pl - Script to create corpus for summary testing.
SYNOPSIS
create_summary_corpus.pl [-d corpusDirectory -l languageCode -p maxProcesses -h -t n]
DESCRIPTION
The script create_summary_corpus.pl
makes a corpus for summarization testing using the featured articles of various Wikipedias.
All errors and warnings are logged using Log::Log4perl to the file corpusDirectory/languageCode/log.txt
.
OPTIONS
-d corpusDirectory
The option -d
sets the directory to store the corpus of documents; the directory is created if it does not exist. The default is the cwd
.
A language subdirectory is created at corpusDirectory/languageCode
that will contain the directories log
, html
, unparsable
, text
, and xml
. The directory log
will contain the file log.txt
that all errors, warnings, and informational messages are logged to using Log::Log4perl. The directory html
will contain copies of the HTML versions of the featured article pages fetched using LWP. The directory text
will contain two files for each article; one file will end with _body.txt
and contain the body text of the article, the other will end with _summary.txt
and will contain the summary. The directory unparsable
will contain the HTML files that could not be parsed into body and summary sections. The XML files are UTF8 encoded, the text and html files are saved as UTF8 octets.
-l languageCode
The option -l
sets the language code of the Wikipedia from which the corpus of featured articles are to be created. The supported language codes are af
:Afrikaans, ar
:Arabic, az
:Azerbaijani, bg
:Bulgarian, bs
:Bosnian, ca
:Catalan, cs
:Czech, de
:German, el
:Greek, en
:English, eo
:Esperanto, es
:Spanish, eu
:Basque, fa
:Persian, fi
:Finnish, fr
:French, he
:Hebrew, hr
:Croatian, hu
:Hungarian, id
:Indonesian, it
:Italian, ja
:Japanese, jv
:Javanese, ka
:Georgian, kk
:Kazakh, km
:Khmer, ko
:Korean, li
:Limburgish, lv
:Latvian, ml
:Malayalam, mr
:Marathi, ms
:Malay, mzn
:Mazandarani, nl
:Dutch, nn
:Norwegian (Nynorsk), no
:Norwegian (Bokm?l), pl
:Polish, pt
:Portuguese, ro
:Romanian, ru
:Russian, sh
:Serbo-Croatian, simple
:Simple English, sk
:Slovak, sl
:Slovenian, sr
:Serbian, sv
:Swedish, sw
:Swahili, ta
:Tamil, th
:Thai, tl
:Tagalog, tr
:Turkish, tt
:Tatar, uk
:Ukrainian, ur
:Urdu, vi
:Vietnamese, vo
:Volap?k, and zh
:Chinese. If the language code is all
, then the corpus for each supported language is created (which takes a long time). The default is en
.
-p maxProcesses
maxProcesses => 1
The option -p
is the maximum number of processes that can be running simultaneously to parse the files. Parsing the files for the summary and body sections may be computational intensive so the module Forks::Super is used for parallelization. The default is one.
-r
Causes only the text and XML files from all the HTML files that have already been fetched to be created; no new files are downloaded.
-h
Makes this documentation print.
-t 0
The option -t
initiates testing mode; only the specified number of pages are fetched and parsed. The default is zero, indicating no testing, all possible pages are fetched and parsed.
BUGS
This script creates corpora by parsing Wikipedia pages, the xpath expressions used to extract links and text will become invalid as the format of the various pages changes, causing some corpora not to be created.
Please email bugs reports or feature requests to bug-text-corpus-summaries-wikipedia@rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Corpus-Summaries-Wikipedia. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.
AUTHOR
Jeff Kubina<jeff.kubina@gmail.com>
COPYRIGHT
Copyright (c) 2010 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
KEYWORDS
corpus, information processing, summaries, summarization, wikipedia
SEE ALSO
Forks::Super, Log::Log4perl, Text::Corpus::Summaries::Wikipedia
Links to the featured article page for the supported language codes: af:Afrikaans, ar:Arabic, az:Azerbaijani, bg:Bulgarian, bs:Bosnian, ca:Catalan, cs:Czech, de:German, el:Greek, en:English, eo:Esperanto, es:Spanish, eu:Basque, fa:Persian, fi:Finnish, fr:French, he:Hebrew, hr:Croatian, hu:Hungarian, id:Indonesian, it:Italian, ja:Japanese, jv:Javanese, ka:Georgian, kk:Kazakh, km:Khmer, ko:Korean, li:Limburgish, lv:Latvian, ml:Malayalam, mr:Marathi, ms:Malay, mzn:Mazandarani, nl:Dutch, nn:Norwegian (Nynorsk), no:Norwegian (Bokm?l), pl:Polish, pt:Portuguese, ro:Romanian, ru:Russian, sh:Serbo-Croatian, simple:Simple English, sk:Slovak, sl:Slovenian, sr:Serbian, sv:Swedish, sw:Swahili, ta:Tamil, th:Thai, tl:Tagalog, tr:Turkish, tt:Tatar, uk:Ukrainian, ur:Urdu, vi:Vietnamese, vo:Volap?k, and zh:Chinese.
Copies of the data sets generated in May 2010 and February 2013 can be download here.