The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Lingua::EN::Splitter - Split text into words, paragraphs, segments, and tiles

SYNOPSIS

use Lingua::EN::Splitter qw(words paragraphs paragraph_breaks 
                            segment_breaks tiles set_tokens_per_tile);

my $text = <<EOT;
Lingua::EN::Splitter is a useful module that allows text to be split up 
into words, paragraphs, segments, and tiles.

Paragraphs are by default indicated by blank lines. Known segment breaks are
indicated by a line with only the word "segment_break" in it.

segment_break

This module does not make any attempt to guess segment boundaries. For that,
see L<Lingua::EN::Segmenter::TextTiling>.

EOT

# Set the number of tokens per tile to 20 (the default)
set_tokens_per_tile(20);

my @words = words $text;
my @paragraphs = paragraphs $text;
my @paragraph_breaks = paragraph_breaks $text;
my @segment_breaks = segment_breaks $text;
my @tiles = tile words $text;

print "@words[0..3,5]";     # Prints "lingua en segmenter is useful"
print "@words[43..46,53]";  # Prints "this module does not guess"
print $paragraphs[2];       # Prints the third paragraph of the above text
print $paragraph_breaks[2]; # Prints which tile the 3rd paragraph starts on
print $segment_breaks[1];   # Prints which tile the 2nd segment starts on
print $tiles[1];            # Prints @words[20..39] filtered for stopwords 
                            # and stemmed

# This module can also be used in an object-oriented fashion
my $splitter = new Lingua::EN::Splitter;
@words = $splitter->words $text;

DESCRIPTION

See synopsis.

This module can be used in an object-oriented fashion or the routines can be exported.

AUTHORS

David James <splice@cpan.org>

SEE ALSO

Lingua::EN::Segmenter::TextTiling, Class::Exporter, http://www.cs.toronto.edu/~james