NAME
Text::Summarize
- Routine to compute summaries of text.
SYNOPSIS
use strict;
use warnings;
use Text::Summarize;
use Data::Dump qw(dump);
my $listOfSentences = [
{ id => 0, listOfTokens => [qw(all people are equal)] },
{ id => 1, listOfTokens => [qw(all men are equal)] },
{ id => 2, listOfTokens => [qw(all are equal)] },
];
dump getSumbasicRankingOfSentences(listOfSentences => $listOfSentences);
DESCRIPTION
Text::Summarize
contains a routine to score a list of sentences for inclusion in a summary of the text using the SumBasic algorithm from the report Beyond SumBasic: Task-Focused Summarization with Sentence Simplification and Lexical Expansion by L. Vanderwendea, H. Suzukia, C. Brocketta, and A. Nenkovab.
ROUTINES
getSumbasicRankingOfSentences
use Text::Summarize;
use Data::Dump qw(dump);
my $listOfSentences = [
{ id => 0, listOfTokens => [qw(all people are equal)] },
{ id => 1, listOfTokens => [qw(all men are equal)] },
{ id => 2, listOfTokens => [qw(all are equal)] },
];
dump getSumbasicRankingOfSentences(listOfSentences => $listOfSentences);
getSumbasicRankingOfSentences
computes the sumBasic score of the list of sentences provided. It returns an array reference containing the pairs [id, score]
sorted in descending order of score, where id
is from listOfSentences
.
listOfSentences
-
listOfSentences => [{id => '..', listOfTokens => [...]}, ..., {id => '..', listOfTokens => [...]}]
listOfSentences
holds the list of sentences that are to be scored. Each item in the list is a hash reference of the form{id => '..', listOfTokens => [...]}
whereid
is a unique identifier for the sentence andlistOfTokens
is an array reference of the list of tokens comprizing the sentence. tokenWeight
-
tokenWeight => {}
tokenWeight
is a optional hash reference that provides the weight of the tokens defined inlistOfSentences
. IftokenWeight
is defined, but undefined for a token in a sentence, then the tokens weight defaults to zero unlessignoreUndefinedTokens
is true, in which case the token is ignored and not used to compute the average weight of the sentences containing it. IftokenWeight
is undefined then the weights of the tokens are either their frequency of occurrence in the filtered text, or their textranks iftextRankParameters
is defined. ignoreUndefinedTokens
-
ignoreUndefinedTokens => 0
If
ignoreUndefinedTokens
is true, then any tokens for whichtokenWeight
is undefined are ignored and not used to compute the average weight of a sentence; the default is false. tokenWeightUpdateFunction
-
tokenWeightUpdateFunction => &subroutine (currentTokenWeight, initialTokenWeight, token, selectedSentenceId, selectedSentenceWeight)
tokenWeightUpdateFunction
is an optional parameter for defining the function that updates the weight of a token when it is contained in a selected sentence. Five parameters are passed to the subroutine: the token's current weight (float), the token's initial weight (float), the token (string), theid
of the selected sentence (string), and the current average weight of the tokens in the selected sentence (float). The default is tokenWeightUpdateFunction_Squared. textRankParameters
-
textRankParameters => undef
If
textRankParameters
is defined, then the token weights are computed using Text::Categorize::Textrank. The parameters to use for Text::Categorize::Textrank, excluding thelistOfTokens
parameters, can be set using the hash reference defined bytextRankParameters
. For example,textRankParameters => {directedGraph => 1}
would make the textrank weights be computed using a directed token graph.
tokenWeightUpdateFunction_Squared
Returns the tokens current weight squared.
tokenWeightUpdateFunction_Multiplicative
Returns the tokens current weight times its intial weight.
tokenWeightUpdateFunction_Sentence
Returns the tokens current weight times its the average weight of the tokens in the selected sentence.
INSTALLATION
Use CPAN to install the module and all its prerequisites:
perl -MCPAN -e shell
>install Text::Summarize
BUGS
Please email bugs reports or feature requests to bug-text-summarize@rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Summarize. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.
AUTHOR
Jeff Kubina<jeff.kubina@gmail.com>
COPYRIGHT
Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
KEYWORDS
information processing, summary, summaries, summarization, summarize, sumbasic, textrank
SEE ALSO
Log::Log4perl, Text::Categorize::Textrank, Text::Summarize::En
The SumBasic algorithm for ranking sentences is from Beyond SumBasic: Task-Focused Summarization with Sentence Simplification and Lexical Expansion by L. Vanderwendea, H. Suzukia, C. Brocketta, and A. Nenkovab.