NAME

Set::Similarity - similarity measures for sets

SYNOPSIS

use Set::Similarity::Dice;

# object method
my $dice = Set::Similarity::Dice->new;
my $similarity = $dice->similarity('Photographer','Fotograf');

# class method
my $dice = 'Set::Similarity::Dice';
my $similarity = $dice->similarity('Photographer','Fotograf');

# from 2-grams
my $width = 2;
my $similarity = $dice->similarity('Photographer','Fotograf',$width);

# from arrayref of tokens
my $similarity = $dice->similarity(['a','b'],['b']);

# from hashref of features
my $bird = {
  wings    => true,
  eyes     => true,
  feathers => true,
  hairs    => false,
  legs     => true,
  arms     => false,
};
my $mammal = {
  wings    => false,
  eyes     => true,
  feathers => false,
  hairs    => true,
  legs     => true,
  arms     => true,
};
my $similarity = $dice->similarity($bird,$mammal);

# from arrayref sets
my $bird = [qw(
  wings
  eyes
  feathers
  legs
)];
my $mammal = [qw(
  eyes
  hairs
  legs
  arms
)];
my $similarity = $dice->from_sets($bird,$mammal);

DESCRIPTION

This is the base class including mainly helper and convenience methods.

Overlap coefficient

( A intersect B ) / min(A,B)

Jaccard Index

The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets

( A intersect B ) / (A union B)

The Tanimoto coefficient is the ratio of the number of features common to both sets to the total number of features, i.e.

( A intersect B ) / ( A + B - ( A intersect B ) ) # the same as Jaccard

The range is 0 to 1 inclusive.

Dice coefficient

The Dice coefficient is the number of features in common to both sets relative to the average size of the total number of features present, i.e.

( A intersect B ) / 0.5 ( A + B ) # the same as sorensen

The weighting factor comes from the 0.5 in the denominator. The range is 0 to 1.

METHODS

All methods can be used as class or object methods.

new

$object = Set::Similarity->new();

similarity

my $similarity = $object->similarity($any1,$any1,$width);

$any can be an arrayref, a hashref or a string. Strings are tokenized into n-grams of width $width.

$width must be integer, or defaults to 1.

from_tokens

my $similarity = $object->from_tokens(['a','b'],['b']);

from_sets

my $similarity = $object->from_sets(['a'],['b']);

Croaks if called directly. This method should be implemented in a child module.

intersection

my $intersection_size = $object->intersection(['a'],['b']);

uniq

my @uniq = $object->uniq(['a','b']);

Transforms an arrayref of strings into an array of unique elements.

combined_length

my $set_size_sum = $object->combined_length(['a'],['b']);

min

my $min_set_size = $object->min(['a'],['b']);

ngrams

my @monograms = $object->ngrams('abc');
my @bigrams = $object->ngrams('abc',2);

_any

my $arrayref = $object->_any($any,$width);

SOURCE REPOSITORY

http://github.com/wollmers/Set-Similarity

AUTHOR

Helmut Wollmersdorfer, <helmut@wollmersdorfer.at>

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install Set::Similarity, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Set::Similarity

CPAN shell

perl -MCPAN -e shell
install Set::Similarity

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)