NAME

Text::Summarizer - Summarize Bodies of Text

SYNOPSIS

use Text::Summarizer;

my $summarizer = Text::Summarizer->new( articles_path => "articles/*" );

my $summary   = $summarizer->summarize_file("articles/article00.txt");
	#or if you want to process in bulk
my @summaries = $summarizer->summarize_all("articles/*");

$summarizer->pretty_print($summary);
$summarizer->pretty_print($_) for (@summaries);

DESCRIPTION

This module allows you to summarize bodies of text into a scored hash of sentences, phrase-fragments, and individual words from the provided text. These scores reflect the weight (or precedence) of the relative text-fragments, i.e. how well they summarize or reflect the overall nature of the text. All of the sentences and phrase-fragments are drawn from within the existing text, and are NOT proceedurally generated.

$summarizer-summarize_text> and $summarizer-summarize_file> each return a hash-ref containing three array-refs ($summarizer-summarize_all> returns a list of these hash-refs): =over 2 =item * sentences => a list of full sentences from the given article, with composite scores of the words contained therein

fragments => a list of phrase fragments from the given article, scored as above
words => a list of all words in the article, scored by a three-factor system consisting of frequency of appearance, population standard deviation of word clustering, and use in important phrase fragments.

The $summarizer-pretty_print> method prints a visually pleasing graph of the above three summary categories.

The $summarizer-pretty_print> method prints a visually pleasing graph of the above three summary categories.

## About Fragments Phrase fragments are in actuallity short "scraps" of text (usually only two or three words) that are derived from the text via the following process: =over 8 =item 1 the entirety of the text is tokenized and scored into a frequency table, with a high-pass threshold of frequencies above # of tokens * user-defined scaling factor =item 2 each sentence is tokenized and stored in an array =item 3 for each word within the frequency table, a table of phrase-fragments is derived by finding each occurance of said word and tracking forward and backward by a user-defined "radius" of tokens (defaults to radius = 5, does not include the central key-word) — each phrase-fragment is thus compiled of (by default) an 11-token string =item 4 all fragments for a given key-word are then compared to each other, and each word is deleted if it appears only once amongst all of the fragments (leaving only A ∪ B ∪ ... ∪ S where A, B,...,S are the phrase-fragments) =item 5 what remains of each fragment is a list of "scraps" — strings of consecutive tokens — from which the longest scrap is chosen as a representation of the given phrase-fragment =item 6 when a shorter fragment-scrap is included in the text of a longer scrap (i.e. a different phrase-fragment), the shorter is deleted and its score is added to the score of the longer =item 7 when multiple fragments are equivalent (i.e. they consist of the same list of tokens when stopwords are excluded), they are condensed into a single scrap in the form of "(some|word|tokens)" such that the fragment now represents the tokens of the scrap (excluding stopwords) regardless of order

AUTHOR

Faelin Landy (CPAN:FaeTheWolf) <faelin.landy@gmail.com>

COPYRIGHT AND LICENSE

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

3 POD Errors

The following errors were encountered while parsing the POD:

Around line 646:: '=item' outside of any '=over'
Around line 664:: Non-ASCII character seen before =encoding in '—'. Assuming UTF-8
Around line 675:: You forgot a '=back' before '=head1'

To install Text::Summarizer, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Text::Summarizer

CPAN shell

perl -MCPAN -e shell
install Text::Summarizer

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

AUTHOR

COPYRIGHT AND LICENSE

Module Install Instructions