NAME

Text::Shingle - Pure Perl implementation of shingles for pieces of text

SYNOPSIS

use Text::Shingle;
my $s = Text::Shingle->new;

my @shingles = $s->shingle_text("a rose is a rose");
# prints [ (a is) (is rose) (a rose) ]
print '[ (',join(') (',@shingles),') ]',"\n";

DESCRIPTION

The module provides a way to extract shingles from a piece of text. Shingles can then be used for other operations such as clustering, deduplication, etc.

Given a document, the w-shingles represent a set of sorted groups of w adjacent words in the text. The parameter w is also called the width of the shingle. For instance, the sentence "a rose is a rose", contains the following shingles of width 2, or 2-shingles: [ (a is), (is rose) and (a rose). While the shingle "a rose" would be present twice in the text twice, in the set of the shingles that is found only once.

Since the w-shingles are very close relatives of the n-grams, this module is built on top of Text::NGrammer and then it can break the text into sentences before the shingling in such a way that they do not cross the boundaries of the sentences. Moreover, the module provides a way to normalize the shingles in order to collapse on the same shingle token that look the same but that are represented by different code points, e.g., composite accents vs. accented letters. The normalization, enabled by default, is done through the module Unicode::Normalize and it uses the NFKC normalization (details in http://www.unicode.org/reports/tr15/).

The shingles in output are represented by strings in which the tokens have been joined through the use of the space character U+0020, the common space character available also in the ASCII set. This choice has been made for two reasons: the first one is the fact that usually the shingles are then used as tokens in computing distances and this makes life a lot easies, and second that breaking them again in the various components is just doable invoking split.

METHODS

new(%)

Creates a new Text::Shingle object and returns it. The accepted parameters are w, the width of the shingles (default is 2); lang the language to be passed to the tokenizer for the division in sentences, if no language is specified, English is assumed, and the supported languages, are the ones supported by Lingua::Sentence; norm to specify if the NFKC normalization has to be applied to the tokens of not (default is 1).

my $s = Text::Shingle->new ( lang => 'de', # German
                             w    => 3,    # width is 3
                             norm => 1,    # please normalize the tokens
                           );

my $t = Text::Shingle->new ( ); # defaults to English, width 2 and enables normalization
shingle_text($text)

Extracts all the shingles of width w, from the constructor, from the $text. $text is broken into sentences by the module Lingua::Sentence in such a way that the shingles do not cross the sentence boundaries.

shingle_sentence($sentence)

Extracts all the shingles of width w, from the constructor, from the $sentence. $sentence is not broken into sub-sentences but only into tokens representing single words.

shingle_array(@array)

Extracts all the shingles of width w, from the constructor, from the @array. Exactly as in the case of shingle_sentence, the module Lingua::Sentence is not used.

HISTORY

0.01

Initial version of the module

0.02

Fixed dependencies

0.03

Fixed dependencies in Makefile.PL

0.04

Fixed bug in constructor

0.05

Fixed test

0.06

Updated dependency on Text::NGrammer to 0.06

AUTHOR

Francesco Nidito

COPYRIGHT

Copyright 2018 Francesco Nidito. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

Unicode::Normalize, Lingua::Sentence, Text::NGrammer