NAME
Text::NGrammer - Pure Perl extraction of n-grams and skip-grams
SYNOPSIS
use Text::NGrammer;
my $s = Text::NGrammer->new;
# prints [ (a,rose) (rose,is) (is,a) (a,flower) ]
my @ngrams = $n->ngrams_text(2, "a rose is a flower");
print "[ ";
for my $ngram (@ngrams) {
print "(",$ngram->[0],",",$ngram->[1],") ";
}
print "]\n";
# prints [ (a,is) (rose,a) (is,flower) ]
my @skipgrams = $n->skipgrams_text(2, 1, "a rose is a flower");
print "[ ";
for my $skipgram (@skipgrams) {
print "(",$skipgram->[0],",",$skipgram->[1],") ";
}
print "]\n";
DESCRIPTION
The module provides a way to extract both n-grams and skip-grams from a text, a sentence or fro man array of tokens.
A n-gram is defines as an ordered sequence of tokens in a piece or text. Some frequent n-grams such as 2-grams, are also called bigrams and they represent all the ordered pairs of words in a text. For instance, the text "a rose is a flower" is composed by 4 bigrams: "a rose", "rose is", "is a", "a flower".
A skip-gram is defined as an ordered sequence of n tokens from a text with a predetermined interval k. For instance, the skip-gram with n=2 and k=1 for a piece of text are all the sequences of tokens of length 2 with interval 1 between the tokens. For instance, the text "a rose is a flower" is composed by 3 skip-grams with n=2 and k=1: "a is", "rose a", "is a", "is flower". A skip-gram with k=0 is the same of a n-gram of the same size, e.g., a 2-skip-gram with k=0 is the same of a bigram.
A broader, and better, discussion on n-grams and skip-grams can be found at https://en.wikipedia.org/wiki/N-gram.
Behind the scenes, the module uses the Lingua::Sentence module to tokenize the text in such a way that the n-grams and skip-grams never go over the boundaries of the sentences. The module provides also ways to extract the n-grams and skip-grams from sentences, i.e., without invoking Lingua::Sentence, or from an array of tokens if the application wants to make use of a custom tokenization for the text. The language to be used for the sentencer must be specified in the constructor; if not present, English is used by default.
All the methods return the n-grams and skip-grams as arrays or references to arrays of length n, where n is the specifies as a parameter of the method. Sentences, or more in general, pieces of text are not divided in n-grams skip-grams if not long enough to perform the operation. For instance, asking for all the n-grams of length 4 for the sentence "I am Francesco" returns an empty array of 4-grams because there are are only 3 tokens in the sentence.
my $ngrammer = Text::NGrammer->new();
my @ngrams = $ngrammer->ngrams_array(3, ("a", "b", "c", "d"));
my $ngram = $ngrams[0]; # the first ngram
print $ngram->[1]; # prints "b"
my @empty = $ngrammer->ngrams_array(5, ("a", "b", "c", "d"));
print "empty!" if (@empty == 0); # prints "empty!"
METHODS
- new(%)
-
Creates a new
Text::NGrammer
object and returns it. The only parameter to accepted to the constructor is the language for the sentencer. For instance, to create a NGrammer for German the syntax is the following onemy $german_ngrammer = Text::NGrammer->new(lang => 'de');
If no language is specified, English is assumed. The supported languages, are the ones supported by Lingua::Sentence.
- skipgrams_text($n, $k, $text)
-
Extracts all the skip-grams of length
$n
with interval$k
from the$text
.$text
is broken into sentences by the module Lingua::Sentence in such a way that the skip-grams do not cross the sentence bounduaries. - skipgrams_sentence($n, $k, $sentence)
-
Extracts all the skip-grams of length
$n
with interval$k
from the $sentence.$sentence
is not broken into sub-sentences but only into tokens representing single words. - skipgrams_array($n, $k, @array)
-
Extracts all the skip-grams of length
$n
with interval$k
from the@array
. Exactly as in the case ofskipgrams_sentence
, the module Lingua::Sentence is not used. - ngrams_text($n, $text)
-
Extracts all the n-grams of length
$n
from the$text
.$text
is broken into sentences by the module Lingua::Sentence in such a way that the n-grams do not cross the sentence boundaries. This is equivalent toskipgrams_text($n, 0, $text)
. - ngrams_sentence($n, $sentence)
-
Extracts all the n-grams of length
$n
from the$sentence
.$sentence
is not broken into sub-sentences but only into tokens representing single words. This is equivalent toskipgrams_sentence($n, 0, $sentence)
. - ngrams_array($n, @array)
-
Extracts all the n-grams of length
$n
from the@array
. Exactly as in the case ofngrams_sentence
, the module Lingua::Sentence is not used. This is equivalent toskipgrams_array($n, 0, $array)
.
HISTORY
AUTHOR
Francesco Nidito
COPYRIGHT
Copyright 2018 Francesco Nidito. All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.