NAME

Text::Soundex - Implementation of the Soundex Algorithm as Described by Knuth

SYNOPSIS

use Text::Soundex 'soundex';

$code = soundex($name);    # Get the soundex code for a name.
@codes = soundex(@names);  # Get the list of codes for a list of names.

# Redefine the value that soundex() will return if the input string
# contains no identifiable sounds within it.
$Text::Soundex::nocode = 'Z000';

DESCRIPTION

This module implements the soundex algorithm as described by Donald Knuth in Volume 3 of The Art of Computer Programming. The algorithm is intended to hash words (in particular surnames) into a small space using a simple model which approximates the sound of the word when spoken by an English speaker. Each word is reduced to a four character string, the first character being an upper case letter and the remaining three being digits.

The value returned for strings which have no soundex encoding is defined using $Text::Soundex::nocode. The default value is undef, however values such as 'Z000' are commonly used alternatives.

For backward compatibility with older versions of this module the $Text::Soundex::nocode is exported into the caller's namespace as $soundex_nocode.

In scalar context, soundex() returns the soundex code of its first argument. In list context, a list is returned in which each element is the soundex code for the corresponding argument passed to soundex(). For example, the following code assigns @codes the value ('M200', 'S320'):

@codes = soundex qw(Mike Stok);

To use Text::Soundex to generate codes that can be used to search one of the publically available US Censuses, a variant of the soundex() subroutine must be used:

use Text::Soundex 'soundex_nara';
$code = soundex_nara($name);

The algorithm used by the US Censuses is slightly different than that defined by Knuth and others. The descrepancy shows up in names such as "Ashcraft":

use Text::Soundex qw(soundex soundex_nara);
print soundex("Ashcraft"), "\n";       # prints: A226
print soundex_nara("Ashcraft"), "\n";  # prints: A261

EXAMPLES

Knuth's examples of various names and the soundex codes they map to are listed below:

Euler, Ellery -> E460
Gauss, Ghosh -> G200
Hilbert, Heilbronn -> H416
Knuth, Kant -> K530
Lloyd, Ladd -> L300
Lukasiewicz, Lissajous -> L222

so:

$code = soundex 'Knuth';         # $code contains 'K530'
@list = soundex qw(Lloyd Gauss); # @list contains 'L300', 'G200'

LIMITATIONS

As the soundex algorithm was originally used a long time ago in the US it considers only the English alphabet and pronunciation. In particular, unicode letters may be ignored, or considered to be sound breaks.

Since the soundex algorithm maps a large space (strings of arbitrary length) onto a small space (single letter plus 3 digits) no inference can be made about the similarity of two strings which end up with the same soundex code. For example, both Hilbert and Heilbronn end up with a soundex code of H416.

MAINTAINER

This module is currently maintain by Mark Mielke (mark@mielke.cc).

HISTORY

Version 3 is a significant update to provide support for versions of Perl later than Perl 5.004. Specifically, the XS version of the soundex() subroutine understands strings that are encoded using UTF-8 (unicode strings).

Version 2 of this module was a re-write by Mark Mielke (mark@mielke.cc) to improve the speed of the subroutines. The XS version of the soundex() subroutine was introduced in 2.00.

Version 1 of this module was written by Mike Stok (mike@stok.co.uk) and was included into the Perl core library set.

Dave Carlsen (dcarlsen@csranet.com) made the request for the NARA algorithm to be included. The NARA soundex page can be viewed at: http://www.nara.gov/genealogy/soundex/soundex.html

Ian Phillips (ian@pipex.net) and Rich Pinder (rpinder@hsc.usc.edu) supplied ideas and spotted mistakes for v1.x.