NAME
Text::GenderFromName - Guess the gender of an American first name.
SYNOPSIS
use Text::GenderFromName;
print gender("Jon"); # prints 'm'
See EXAMPLES for additional uses.
DESCRIPTION
This module provides gender()
, which takes a name and returns one of three values: 'm' for male, 'f' for female, or undef for unknown.
CHANGES
Version 0.30 is a significant departure from previous versions. By default, version 0.30 uses the U.S. Social Security Administration's "Most Popular Names of the 1980's" list of 1001 male first names and 1013 female first names. See CAVEATS below for details on this list.
Version 0.30 also allows for arbitrary female and male hashed lists to be provided at run-time, and includes several built-ins to provide matches based on exclusivity, weight, metaphones, and both version 0.20 and version 0.10 regexp-style matching. The user can also specify additional match subroutines and change the match order at run-time.
EXPORT
The single exported function is:
- gender ($name [, $looseness])
-
Returns one of three values: 'm' for male, 'f' for female, or undef for unknown.
gender()
also accepts a "looseness" level: the higher the looseness value, the broader the match. See THE MATCH LIST below for details.
NON-EXPORT
The non-exported matching subs are:
- one_only ($name)
-
Returns 'm' or 'f' if and only if $name is found in only one of the two lists.
- either_weight ($name)
-
Returns 'm' or 'f' if $name is found in either list. If $name is in both lists, it returns the more heavily weighted of the two.
- one_only_metaphone ($name)
-
Uses Text::DoubleMetaphone for comparison. Returns 'm' or 'f' if and only if the metaphone for $name is found in only one of the two lists.
Note that this function builds a copy of the female/male name lists to speed up the metaphone lookup.
- either_weight_metaphone ($name)
-
Uses Text::DoubleMetaphone for comparison. Returns 'm' or 'f' if $name is found in either list. If $name is in both lists, it sums the weights of all matching metaphones and returns the larger of the two.
Note that this function builds a copy of the female/male name lists to speed up the metaphone lookup.
- v2_rules ($name)
-
Uses Jon Orwant's v0.20 rules for matching.
- v1_rules ($name)
-
Uses Jon Orwant's adaptation of Scott Pakin's awk script from v0.10 for matching.
If you wish to use your own hash refs containing names and weights, you should explicitly import:
- gender_init ($female_names_ref, $male_names_ref)
-
Initializes the male and female hashes. This package calls
gender_init()
internally: without arguments it uses the table provided by the U.S. Social Security Administration. Don't call this function unless you want to override the supplied lists. See THE FEMALE/MALE HASHES below for details.
THE MATCH LIST
@MATCH_LIST
contains the list of subs gender
will use to determine the gender of a given name.
By default, there are 6 items in @MATCH_LIST, corresponding to the non-exported functions above. Strictly matching subs should go first, loosely matching subs should go last, as gender
will iterate over the list from 0 to the specified looseness value or the number of subs in @MATCH_LIST
, whichever comes first.
You may override this like so:
@Text::GenderFromName::MATCH_LIST = ('main::my_matching_routine');
THE FEMALE/MALE HASHES
By default, these hashes are built using data from the U.S. SSA. You may override them by calling gender_init()
with your own female and male hash refs, like so:
use Text::GenderFromName qw( :DEFAULT &gender_init );
my %females = ('barbly' => 4.1, 'bar' => 2.3, ...);
my %males = ('foobly' => 4.5, 'foo' => 1.3, ...);
&gender_init(\%females, \%males);
The hash keys are lowercase names, and their values are their relative weights. This allows for names that could be male or female, but are more often one or the other.
EXAMPLES
Very strict usage:
use Text::GenderFromName;
my @names = ('Josephine', 'Michael', 'Dondi', 'Jonny',
'Pascal', 'Velvet', 'Eamon', 'FLKMLKSJN');
for (@names) {
# Use strict matching
my $gender = &gender($_) || '';
if ($gender eq 'f') { print "$_: Female\n" }
elsif ($gender eq 'm') { print "$_: Male\n" }
else { print "$_: UNSURE\n" }
}
returns:
Josephine: Female
Michael: UNSURE
Dondi: UNSURE
Jonny: UNSURE
Pascal: UNSURE
Velvet: UNSURE
Eamon: UNSURE
FLKMLKSJN: UNSURE
Loose matching:
for (@names) {
# Use loose matching
my $gender = &gender($_, 9) || '';
...
returns:
Josephine: Female
Michael: Male
Dondi: Male
Jonny: Male
Pascal: Male
Velvet: Female
Eamon: UNSURE
FLKMLKSJN: UNSURE
Turn on debugging:
$Text::GenderFromName::DEBUG = 1;
returns:
Matching "josephine":
one_only...
==> HIT (f)
Matching "michael":
one_only...
either_weight...
F: 0.0271266376105491, M: 3.4091409099979
==> HIT (m)
Matching "dondi":
one_only...
either_weight...
one_only_metaphone...
M: dondi => dante => TNT: 0.020568
==> HIT (m)
Matching "jonny":
one_only...
either_weight...
one_only_metaphone...
F: jonny => jenna => JN: 0.193945
M: jonny => john => JN: 1.629871
either_weight_metaphone...
F: jonny => jenna => JN: 0.193945
F: jonny => joanna => JN: 0.118652
F: jonny => jenny => JN: 0.104875
...
M: jonny => john => JN: 1.629871
M: jonny => juan => JN: 0.309234
M: jonny => johnny => JN: 0.127193
...
==> HIT (m)
Matching "pascal":
one_only...
either_weight...
one_only_metaphone...
either_weight_metaphone...
v2_rules...
==> HIT (m)
Matching "velvet":
one_only...
either_weight...
one_only_metaphone...
either_weight_metaphone...
v2_rules...
v1_rules...
==> HIT (f)
Matching "eamon":
one_only...
either_weight...
one_only_metaphone...
either_weight_metaphone...
v2_rules...
v1_rules...
Matching "flkmlksjn":
one_only...
either_weight...
one_only_metaphone...
either_weight_metaphone...
v2_rules...
v1_rules...
Josephine: Female
Michael: Male
Dondi: Male
Jonny: Male
Pascal: Male
Velvet: Female
Eamon: UNSURE
FLKMLKSJN: UNSURE
Add your own match sub:
push @Text::GenderFromName::MATCH_LIST, 'main::eamon_hack';
sub eamon_hack {
my $name = shift;
return 'm' if $name =~ /^eamon/;
}
returns:
...
Matching "eamon":
one_only...
either_weight...
one_only_metaphone...
either_weight_metaphone...
v2_rules...
v1_rules...
main::eamon_hack...
==> HIT (m)
Eamon: Male
Don't use metaphones:
@Text::GenderFromName::MATCH_LIST =
grep !/metaphone/, @Text::GenderFromName::MATCH_LIST;
Use your own female/male hash lists:
use Text::GenderFromName qw( :DEFAULT &gender_init );
my %females = ('josephine' => 2.1);
my %males = ('dondi' => 4.5);
&gender_init(\%females, \%males);
Use female/male hash lists from a database:
use Text::GenderFromName qw( :DEFAULT &gender_init );
use Tie::RDBM;
tie my %females, 'Tie::RDBM', {db => 'mysql:common',
table => 'females',
key => 'name',
value => 'weight'};
tie my %males, 'Tie::RDBM', {db => 'mysql:common',
table => 'males',
key => 'name',
value => 'weight'};
&gender_init(\%females, \%males);
COMPATIBILITY
To run v0.30 in a (mostly) backward compatible mode, override the MATCH_LIST like so:
@Text::GenderFromName::MATCH_LIST = ('v2_rules', 'v1_rules');
and set the looseness to any value greater than 1:
&gender($_, 9);
Note that v0.30 uses significantly different lists than before. If you'd like to use the v0.20 name lists, you may download a previous version of Text::GenderFromName
, cut out the hashes, and use the &gender_init() function to use those lists instead. To minimize the size of this module, they are not included in this module.
CAVEATS
REGARDING THIS MODULE
Rules are now case-insensitive, which is a departure from earlier versions of this module. Also, Orwant's v0.20 rules no longer fall through, though v0.10's do.
Version 0.30 was a complete overhaul by someone who's never submitted a module to CPAN before. Please consider this fact when using Text::GenderFromName
module in a production environment.
Also note that the matching routines in this module are strongly biased toward American first names. None of the methods included in this module correctly identify the v0.30 author's gender (m) from his first name (Eamon).
REGARDING THE DEFAULT LIST
From http://www.ssa.gov/OACT/babynames/1999/top1000of80s.html:
"The data comes from a 5% sampling of Social Security card applications with dates of birth from January 1980 through December 1989."
"All names which occurred at least five times in the sample are included in the table below. The total number of males in the sample is 977,255 and the total number of females is 936,349. Criteria to be included in the sample is simply that a Social Security card application was filed, that the year of birth was between 1980 and 1989, and that the birth was on US soil. As always each unique spelling is considered a unique name. It may be appropriate for purposes of ranking popularity of names to combine similar spellings of the same name. This kind of grouping, however, is subjective and time consuming, and is beyond the scope of this document. The 2000 edition of the World Almanac lists the top 10 names of each decade based on this data after combining different spellings of the same name."
"No effort has been made to edit the data and as a result some coding errors are obvious. For example initials like "A" are included in the lists. Another common problem, especially for the earlier decades is females coded as being male. For example Jessica is the ranked 647 among male names. Finally entries like "Unknown" and "Baby" are not removed from the lists."
REGARDING HENRY
m (0.111843889261247)
BUGS
Did I mention this module doesn't match the v0.30 author's name?
AUTHOR
Originally by Jon Orwant <orwant@readable.com>, v0.30 by Eamon Daly <eamon@eamondaly.com>.
This is an adaptation of an 8/91 awk script by Scott Pakin in the December 91 issue of Computer Language Monthly.
Small contributions by Andrew Langmead and John Strickler. Thanks to Bob Baldwin, Matt Bishop, Daniel Klein, and the U.S. SSA for their lists of names.