NAME

Text::Summarizer - Summarize Bodies of Text

SYNOPSIS

use Text::Summarizer;

my $summarizer = Text::Summarizer->new( articles_path => "articles/*" );

my $summary   = $summarizer->summarize_file("articles/article00.txt");
	#or if you want to process in bulk
my @summaries = $summarizer->summarize_all("articles/*");

$summarizer->pretty_print($summary);
$summarizer->pretty_print($_) for (@summaries);

DESCRIPTION

This module allows you to summarize bodies of text into a scored hash of sentences, phrase-fragments, and individual words from the provided text. These scores reflect the weight (or precedence) of the relative text-fragments, i.e. how well they summarize or reflect the overall nature of the text. All of the sentences and phrase-fragments are drawn from within the existing text, and are NOT proceedurally generated.

$summarizer-summarize_text > and $summarizer-summarize_file > each return a hash-ref containing three array-refs ( $summarizer-summarize_all > returns a list of these hash-refs):

sentences a list of full sentences from the given article, with composite scores of the words contained therein
fragments a list of phrase fragments from the given article, scored as above
words a list of all words in the article, scored by a three-factor system consisting of frequency of appearance, population standard deviation of word clustering, and use in selected phrase fragments.: The $summarizer-pretty_print > method prints a visually pleasing graph of the above three summary categories.

About Fragments

Phrase fragments are in actuallity short "scraps" of text (usually only two or three words) that are derived from the text via the following process:

the entirety of the text is tokenized and scored into a frequency table, with a high-pass threshold of frequencies above # of tokens * user-defined scaling factor
each sentence is tokenized and stored in an array
for each word within the frequency table, a table of phrase-fragments is derived by finding each occurance of said word and tracking forward and backward by a user-defined "radius" of tokens (defaults to radius = 5, does not include the central key-word) — each phrase-fragment is thus compiled of (by default) an 11-token string
all fragments for a given key-word are then compared to each other, and each word is deleted if it appears only once amongst all of the fragments (leaving only A ∪ B ∪ ... ∪ S where A, B,...,S are the phrase-fragments)
what remains of each fragment is a list of "scraps" — strings of consecutive tokens — from which the longest scrap is chosen as a representation of the given phrase-fragment
when a shorter fragment-scrap is included in the text of a longer scrap (i.e. a different phrase-fragment), the shorter is deleted and its score is added to the score of the longer
when multiple fragments are equivalent (i.e. they consist of the same list of tokens when stopwords are excluded), they are condensed into a single scrap in the form of "(some|word|tokens)" such that the fragment now represents the tokens of the scrap (excluding stopwords) regardless of order

AUTHOR

Faelin Landy (CPAN:FaeTheWolf) <faelin.landy@gmail.com>

COPYRIGHT AND LICENSE

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

To install Text::Summarizer, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Text::Summarizer

CPAN shell

perl -MCPAN -e shell
install Text::Summarizer

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)