NAME

Text::Ngram - Basis for n-gram analysis

SYNOPSIS

use Text::Ngram qw(ngram_counts add_to_counts);
my $text   = "abcdefghijklmnop";
my $hash_r = ngram_counts($text, 3); # Window size = 3
# $hash_r => { abc => 1, bcd => 1, ... }

add_to_counts($more_text, 3, $hash_r);

DESCRIPTION

n-Gram analysis is a field in textual analysis which uses sliding window character sequences in order to aid topic analysis, language determination and so on. The n-gram spectrum of a document can be used to compare and filter documents in multiple languages, prepare word prediction networks, and perform spelling correction.

The neat thing about n-grams, though, is that they're really easy to determine. For n=3, for instance, we compute the n-gram counts like so:

the cat sat on the mat
---                     $counts{"the"}++;
 ---                    $counts{"he "}++;
  ---                   $counts{"e c"}++;
   ...

This module provides an efficient XS-based implementation of n-gram spectrum analysis.

There are two functions which can be imported:

ngram_counts

This first function returns a hash reference with the n-gram histogram of the text for the given window size. The default window size is 5.

$href = ngram_counts(\%config, $text, $window_size);

The only necessary parameter is $text.

The possible value for \%config are:

flankbreaks

If set to 1 (default), breaks are flanked by spaces; if set to 0, they're not. Breaks are punctuation and other non-alfabetic characters, which, unless you use punctuation = 0> in your configuration, do not make it into the returned hash.

Here's an example, supposing you're using the default value for punctuation (1):

my $text = "Hello, world";
my $hash = ngram_counts($text, 5);

That produces the following ngrams:

{
  'Hello' => 1,
  'ello ' => 1,
  ' worl' => 1,
  'world' => 1,
}

On the other hand, this:

my $text = "Hello, world";
my $hash = ngram_counts({flankbreaks => 0}, $text, 5);

Produces the following ngrams:

{
  'Hello' => 1,
  ' worl' => 1,
  'world' => 1,
}

lowercase

If set to 0, casing is preserved. If set to 1, all letters are lowercased before counting ngrams. Default is 1.

# Get all ngrams of size 4 preserving case
$href_p = ngram_counts( {lowercase => 0}, $text, 4 );

punctuation

If set to 0, punctuation is removed before calculating the ngrams. Set to 1 to preserve it. Default is 0.

# Get all ngrams of size 2 preserving punctuation
$href_p = ngram_counts( {punctuation => 1}, $text, 2 );

spaces

If set to 0, no ngrams contaning spaces will be returned

# Get all ngrams of size 3 that do not contain spaces
$href = ngram_counts( {spaces => 0}, $text, 3);

If you're going to request both types of ngrams, than the best way to avoid calculating the same thing twice is probably this:

$href_with_spaces = ngram_counts($text[, $window]);
$href_no_spaces = $href_with_spaces;
for (keys %$href_no_spaces) { delete $href->{$_} if / / }

Remember, the default configuration is:

{
  spaces      => 1,
  punctuation => 0,
  lowercase   => 1,
}

add_to_counts

This incrementally adds to the supplied hash; if $window is zero or undefined, then the window size is computed from the hash keys.

add_to_counts($more_text, $window, $href)

TO DO

Look further into the tests. Sort them and add more.

AUTHOR

Maintained by Jose Castro, cog@cpan.org.

Originally created by Simon Cozens, simon@cpan.org.

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install Text::Ngram, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Text::Ngram

CPAN shell

perl -MCPAN -e shell
install Text::Ngram

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)