NAME
Text::Summarize::En
- Routine to summarize English text.
SYNOPSIS
use strict;
use warnings;
use Text::Summarize::En;
use Data::Dump qw(dump);
my $summarizerEn = Text::Summarize::En->new();
my $text = 'All people are equal. All men are equal. All are equal.';
dump $summarizerEn->getSummaryUsingSumbasic(listOfText => [$text]);
DESCRIPTION
Text::Summarize
contains routines for ranking the sentences in English text for inclusion in a summary using the sumBasic algorithm.
CONSTRUCTOR
new
The method new
creates an instance of the Text::Summarize::En
class with the following parameters:
endingSentenceTag
-
endingSentenceTag => 'PP'
endingSentenceTag
is the part-of-speech tag that should be used to indicate the end of a sentence. The default is 'PP'. The value of this tag must be a tag generated by the module Lingua::EN::Tagger. listOfPOSTypesToKeep
-
listOfPOSTypesToKeep => [qw(CONTENT_WORDS)]
The sumBasic algorithm preprocesses the text so that only certain parts-of-speech (POS) are retained and used to rank the sentences. The module Lingua::EN::Tagger is used to tag the parts-of-speech of the text. The parts-of-speech retained can be specified by word types, where the type is a combination of 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION', 'TEXTRANK_WORDS', or 'VERBS'. The default is
[qw(CONTENT_WORDS)]
, which equates to[qw(CONTENT_ADVERBS, VERBS, ADJECTIVES, NOUNS)]
. listOfPOSTagsToKeep
-
listOfPOSTagsToKeep => [...]
listOfPOSTagsToKeep
provides finer control over the parts-of-speech to be retained when filtering the tagged text. For a list of all the possible tags callgetListOfPartOfSpeechTags()
.
METHODS
getSummaryUsingSumbasic
getSummaryUsingSumbasic
computes the summary of text using the sumBasic algorithm.
listOfStemmedTaggedSentences
-
listOfStemmedTaggedSentences => [...]
listOfStemmedTaggedSentences
is an array reference containing the list of stemmed and part-of-speech tagged sentences from Text::StemTagPos. IflistOfStemmedTaggedSentences
is not defined, then the text to be processed should be provided vialistOfText
. listOfText
-
listOfText => [...]
listOfText
is an array reference containing the strings of text to be summarized.listOfText
is only used iflistOfStemmedTaggedSentences
is undefined. tokenWeight
-
tokenWeight => {}
tokenWeights
is an optional hash reference that can provide the weights for the tokens provided bylistOfStemmedTaggedSentences
orlistOfText
. IftokenWeights
is not defined then the weight of a token is just its frequency of occurrence in the filtered text. IftextRankParameters
is defined, then the token weights are computed using Text::Categorize::Textrank. textRankParameters
-
textRankParameters => undef
If
textRankParameters
is defined, then the token weights for the sumBasic algorithm are computed using Text::Categorize::Textrank. The parameters to use for Text::Categorize::Textrank, excluding thelistOfTokens
parameters, can be set using the hash reference defined bytextRankParameters
. For example,textRankParameters => {directedGraph => 1}
would make the textrank weights be computed using a directed token graph.
INSTALLATION
Use CPAN to install the module and all its prerequisites:
perl -MCPAN -e shell
>install Text::Summarize
BUGS
Please email bugs reports or feature requests to bug-text-summarize@rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Summarize. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.
AUTHOR
Jeff Kubina<jeff.kubina@gmail.com>
COPYRIGHT
Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
KEYWORDS
information processing, summary, summaries, summarization, summarize, sumbasic, textrank
SEE ALSO
Log::Log4perl, Text::Categorize::Textrank, Text::Summarize
The SumBasic algorithm for ranking sentences is from Beyond SumBasic: Task-Focused Summarization with Sentence Simplification and Lexical Expansion by L. Vanderwendea, H. Suzukia, C. Brocketta, and A. Nenkovab.