NAME
Text::Categorize::Util
- Method to get keywords and phrases of text.
SYNOPSIS
use strict;
use warnings;
use Text::Categorize::Textrank::En;
use Text::Categorize::Util qw(getKeywordsAndPhrases);
use Data::Dump qw(dump);
my $textrankerEn = Text::Categorize::Textrank::En->new();
my $text = Text::Categorize::Util::getTestText();
print $text;
my $textrankInfo = $textrankerEn->getTextrankInfoOfText(listOfText => [$text]);
my $keywordInfo = getKeywordsAndPhrases(
%$textrankInfo,
listOfStemmedTaggedDocuments => [ $textrankInfo->{listOfStemmedTaggedSentences} ],
numberOfKeywords => 9
);
dump $keywordInfo;
my %phrases = map { ($_->{phrase}, 1) } map { (@$_) } @{ $keywordInfo->{keyphrases} };
dump [ sort keys %phrases ];
DESCRIPTION
Text::Categorize::Util
provides a routine to select the keywords and related phrases from the results of the routine getTextrankInfoOfText in Text::Categorize::Textrank::En.
ROUTINES
getKeywordsAndPhrases
From the results of the routine getTextrankInfoOfText in Text::Categorize::Textrank::En the routine getKeywordsAndPhrases
selects the keywords for the text and their most common instance in the text (keywordOrderInstance
) plus the keyphrases in the text associated with the selected keywords (keyphrases
).
More precisely, if $results
is the returned hash, then $results->{keywordOrderInstance}
contains an array reference of the selected keywords in their descending order of importance within the text; each item in the list is {keyword => '', instance => ''}
, where keyword
is the identifier used for the keyword and instance
is the most common form or instance of the keyword in the text.
$results->{keyphrases}
contains an array reference of hashes of the form {wordsOfPhrase => [], keywordsOfPhrase => [], phrase => ''}
where wordsOfPhrase
is a list of the words from listOfStemmedTaggedSentences
that comprise the phrase, keywordsOfPhrase
is a list of the keywords that occur in the phrase, and phrase
is the string of the phrase words.
listOfStemmedTaggedDocuments
-
listOfStemmedTaggedDocuments => [...]
listOfStemmedTaggedDocuments
is an array reference where each item in the array is a list of stemmed and part-of-speech tagged sentences from Text::StemTagPos. IflistOfStemmedTaggedDocuments
is not defined, then the text to be processed should be provided vialistOfText
. hashOfTextrankValues
-
hashOfTextrankValues => {}
hashOfTextrankValues
holds the hash of the textrank values computed by getTextrankOfListOfTokens. Selected phrases will only begin and end with tokens for whichhashOfTextrankValues
is defined and positive. useStemmedWords
-
useStemmedWords => 1
If
useStemmedWords
should be set to the same value when computing the textrank using the routine getTextrankInfoOfText in Text::Categorize::Textrank::En. The default is true. numberOfKeywords
-
numberOfKeywords => 10
numberOfKeywords
should be set to the number of keywords to select for the text. If it is greater than the number of values inhashOfTextrankValues
, it is then set to that value. The default is 10.
INSTALLATION
To install the module run the following commands:
perl Makefile.PL
make
make test
make install
If you are on a windows box you should use 'nmake' rather than 'make'.
BUGS
Please email bugs reports or feature requests to bug-text-categorize-util@rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Categorize-Util. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.
AUTHOR
Jeff Kubina<jeff.kubina@gmail.com>
COPYRIGHT
Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
KEYWORDS
categorize, keywords, keyphrases, nlp, textrank