NAME

Lingua::Norms::SUBTLEX - Retrieve frequency values and frequency-based lists from Brysbaert Subtitles Corpus

VERSION

Version 0.01

SYNOPSIS

# the basics:
use Lingua::Norms::SUBTLEX;
my $subtlex = Lingua::Norms::SUBTLEX->new();
my $bool = $subtlex->is_normed(string => 'fuip'); # isa_word ? 
my $frq = $subtlex->freq(string => 'frog'); # freq. per million, or get log/zipf
my $href = $subtlex->freqhash(words => [qw/frog fish ape/]); # freqs. for a list of words
say "$_ freq per mill = $href->{$_}" for keys %{$href};

# stats, parts-of-speech, orthographic relations:
say "mean freq per mill = ", $subtlex->mean_freq(words => [qw/frog fish ape/]); # or median, std-dev.
say "frog part-of-speech = ", $subtlex->pos(string => 'frog');
my ($count, $orthons_aref) = $subtlex->on_count(string => 'frog'); # or scalar context for count only; or freq_max/mean
say "orthon of frog = $_" for @{$orthons_aref}; # e.g., from

# retrieve (list of) words to certain specs:
my $aref = $subtlex->list_words(freq => [2, 400], onc => [1,], length => [4, 4], cv_pattern => 'CCVC', regex => '^f');
my $string = $subltex->random_word();

DESCRIPTION

The SUBTLEX-US word-frequency list comprises 74,286 letter-strings, with their frequencies of occurrence and parts-of-speech, based on a corpus of some 30 million words from film and television subtitles. For details, see http://expsy.ugent.be/subtlexus/ to download the file and install it, and REFERENCES. Only a small sample from the SUBTLEX-US list is included in the installation distribtuion used for testing purposes (or the archive would be about 2 MB, and testing would take about 35 secs). The complete file should be downloaded, named "US_2007.csv", and placed in an appropriate directory, with its location specified in object construction, as described below. Other language files from this project might be supported by this module but have not been tested to date.

SUBROUTINES/METHODS

All methods are called via the class object, and with named (hash of) arguments, usually string, where relevant.

new

$subtlex = Lingua::Norms::SUBTLEX->new();
$subtlex = Lingua::Norms::SUBTLEX->new(dir => 'file_location'); # where US_2007.csv is located
$subtlex = Lingua::Norms::SUBTLEX->new(dir => 'file_location', filename => 'foo'); # where datafile is located

Returns a class object for accessing the other methods. An optional argument dir can be given to specify the directory in which the SUBTLEX table is stored. If this is not specified, then the "Lingua/Norms/SUBTLEX" directory within the 'sitelib' configured for the local Perl installation is assumed to be the location (i.e., using Config.pm, and where the sample file should have been stored upon installation of the module). The method will croak if the given filename or the default, "US_2007.csv" in this directory cannot be opened (or closed).

Frequencies and POS for individual words or word-lists

is_normed

$bool = $subtlex->is_normed(string => $word);

Alias: isa_word

Returns a boolean value to specify whether or not the letter-string passed as string is represented in the SUBTLEX corpus - by simply going line-by-line through the datafile and checking if the given string is identical to the first comma-delimited string on each line. This might be thought of as a lexical decision ("is this string a word?") but note that some very low frequency letter-strings in the corpus would not be considered words in the average context.

freq

$frq = $subtlex->freq(string => 'aword');

Returns frequency per million for the word passed as string, or the empty-string if the string is not represented in the norms.

lfreq

$lfreq = $subtlex->freq(string => 'aword');

Returns log frequency per million for the word passed as string, or the empty-string if the string is not represented in the norms.

zipf

$zipf = $subtlex->zipf(string => 'aword');

Returns zipf frequency for the word passed as string, or the empty-string if the string is not represented in the norms. See Van Heuven et al. (in press) and http://crr.ugent.be/archives/1352.

freqhash

$href = $subtlex->freqhash(strings => [qw/word1 word2/], scale => raw|log|zipf);

Returns frequency as a reference to a hash for the words passed as strings; e.g., {string1 => number, string2 => number, ...}. By default, the values in the hash are corpus frequency per million. If the optional argument scale is defined, and it equals log, then the values are log-frequency; similarly, zipf yields zip-frequency.

pos

$pos_str = $subtlex->pos(string => 'aword');

Returns part-of-speech string for a given word, as per Brysbaert, New & Keuleers (2012). The return value is undefined if the word is not found.

Descriptive frequency statistics for lists

These methods return a descriptive statistic (mean, median or standard deviation) for a list of strings. Like freq_hash, they take an optional argument scale to specify if the returned values should be raw frequencies per million, log frequencies, or zip-frequencies.

mean_freq

$mean = $subtlex->mean_freq(strings => [qw/word1 word2/], scale => 'raw|log|zipf');

Returns the arithmetic mean of the frequencies for the given words, or mean of the log frequencies if log => 1.

median_freq

$median = $subtlex->median_freq(words => [qw/word1 word2/], scale => 'raw|log|zipf');

Returns the median of the frequencies for the given words, or median of the log frequencies if log => 1.

sd_freq

$sd = $subtlex->sd_freq(words => [qw/word1 word2/], scale => 'raw|log|zipf');

Returns the standard deviation of the frequencies for the given words, or standard deviation of the log frequencies if log => 1.

Orthographic neighbourhood measures

These methods return stats re the orthographic relatedness of a specified letter-string to words in the SUBTLEX corpus. Unless otherwise stated, an orthographic neighbour here means letter-strings that are identical except for a single-letter substitution while holding string-length constant, i.e., the Coltheart N of a letter-string, as defined in Coltheart et al. (1977). These measures are calculated in realtime; they are not listed in the datafile for look-up, so expect some extra-normal delay in getting a returned value.

on_count

$n = $subtlex->on_count(string => $letters);
($n, $orthons_aref) = $subtlex->on_count(string => $letters);

Returns orthographic neighbourhood count (Coltheart N) within the SUBTLEX corpus. Called in array context, also returns a reference to an array of the neighbours retrieved, if any.

on_freq_max

$m = $subtlex->on_freq_max(string => $letters);

Returns the maximum SUBTLEX frequency per million among the orthographic neighbours (per Coltheart N) of a particular letter-string. If (unusually) all the frequencies are the same, then that value is returned. If the string has no (Coltheart-type) neighbours, undef is returned.

on_freq_mean

$m = $subtlex->on_freq_mean(string => $letters);

Returns the mean SUBTLEX frequencies per million of the orthographic neighbours (per Coltheart N) of a particular letter-string. If the string has no (Coltheart-type) neighbours, undef is returned.

on_lfreq_mean

$m = $subtlex->on_lfreq_mean(string => $letters);

Returns the mean log of SUBTLEX frequencies of the orthographic neighbours (per Coltheart N) of a particular letter-string. If the string has no (Coltheart-type) neighbours, undef is returned.

on_zipf_mean

$m = $subtlex->on_zipf_mean(string => $letters);

Returns the mean zipf of SUBTLEX frequencies of the orthographic neighbours (per Coltheart N) of a given letter-string. If the string has no (Coltheart-type) <b></b>neighbours, undef is returned.

on_ldist

$m = $subtlex->on_ldist(string => $letters, lim => 20);

Alias: ldist

Returns the mean Levenshtein Distance from a word to its lim closest orthographic neighbours. The default limit is 20, as defined in Yarkoni et al. (2008). The module uses the matrix-based calculation of Levenshtein Distance as implemented in this author's Lingua::Orthon module. No defined value is returned if no Levenshtein Distance is found (whereas zero would connote "identical to everything").

Retrieving letter-strings/words

list_strings

$aref = $subtlex->list_words(freq => [1, 20], onc => [0, 3], length => [4, 4], cv_pattern => 'CVCV', regex => '^f');
$aref = $subtlex->list_words(zipf => [0, 2], onc => [0, 3], length => [4, 4], cv_pattern => 'CVCV', regex => '^f');

Alias: list_words

Returns a list of words from the SUBTLEX corpus that satisfies certain criteria: minimum and/or maximum letter-length (specified by the named argument length), minimum and/or maximum frequency (argument freq) or zip-frequency (argument zipf), minimum and/or maximum orthographic neighbourhood count (argument onc), a consonant-vowel pattern (argument cv_pattern), or a specific regular expression (argument regex).

For the minimum/maximum constrained criteria, the two limits are given as a referenced array where the first element is the minimum and the second element is the maximum. For example, [3, 7] would specify letter-strings of 3 to 7 letters in length; [4, 4] specifies letter-strings of only 4 letters in length. If only one of these is to be constrained, then the array would be given as, e.g., [3] to specify a minimum of 3 letters without constraining the maximum, or ['',7] for a maximum of 7 letters without constraining the minimum (checking if the element hascontent as per String::Util).

The consonant-vowel pattern is specified as a string by the usual convention, e.g., 'CCVCC' defines a 5-letter word starting and ending with pairs of consonants, the pairs separated by a vowel. 'Y' is defined here as a consonant.

A finer selection of particular letters can be made by giving a regular expression as a string to the regex argument. In the example above, only letter-strings starting with the letter 'f', followed by one of more other letters, are specified. Alternatively, for example, '[^aeiouy]$' specifies that the letter-strings must not end with a vowel (here including 'y'). The entire example for '^f', including the shown arguments for cv_pattern, freq, onc and length, would return only two words: fiji and fuse.

The selection procedure will be made particularly slow wherever onc is specified (as this has to be calculated in real-time) and no arguments are given for cv_pattern and regex (which are tested ahead of any other criteria).

Syllable-counts might be added in future; existing algorithms in the Lingua family are not sufficiently reliable for the purposes to which the present module might often be put; an alternative is being worked on.

The value returned is always a reference to the list of words retrieved (or to an empty list if none was retrieved).

all_strings

$aref = $subtlex->all_strings();

Alias: all_words

Returns a reference to an array of all letter-strings in the corpus, in their given order.

random_string

$string = $subtlex->random_string();
@data = $subtlex->random_string();

Alias: random_word

Picks a random line from the corpus, using File::RandomLine (except the top header line). Returns the word in that line if called in scalar context; otherwise, the array of data for that line. (A future version might let specifying a match to specific criteria, self-aborting after trying X lines.)

Miscellaneous

nlines

Returns the number of lines, less the column headings, in the installed US_2007.csv file used by other methods read.

DIAGNOSTICS

Value given to argument 'dir' (VALUE) in new() is not a directory

Croaked from new() if called with a value for the argument dir, and this value is not actually a directory/folder. This is where the main file, named US_2007.csv, should be located.

US_2007.csv does not exist within the directory 'VALUE'. Maybe you need to download the file (see POD) or re-locate it

Croaked from new() if the given or default directory exists, but the file 'US_2007.csv' cannot be found within it. This is the location of the file that should have been downloaded from the site: http://expsy.ugent.be/subtlexus/.

Cannot open SUBTLEX data file

Croaked when calling new and a valid path to the US_2007.csv file is not available for opening (and similarly for closing.

No word(s) to test; pass a string to the function

Croaked upon a number of methods that expect a value for the named argument string, and when no such value is given, or the string is empty. These methods require the letter-string to be passed to it as a key => value pair, with the key string and the value the string to test.

DEPENDENCIES

Statistics::Lite

Lingua::Orthon

String::Util

File::RandomLine

REFERENCES

Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977-990. doi: 10.3758/BRM.41.4.977.

Brysbaert, M., New, B., & Keuleers,E. (2012). Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior Research Methods, 44, 991-997.

Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Ed.), Attention and performance (Vol. 6, pp. 535-555). London, UK: Academic.

Van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (in press). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology.

Yarkoni, T., Balota, D. A., & Yap, M. (2008). Moving beyond Coltheart's N: A new measure of orthographic similarity. Psychonomic Bulletin and Review, 15, 971-979. doi: 10.3758/PBR.15.5.971.

AUTHOR

Roderick Garton, <rgarton at cpan.org>

BUGS AND LIMITATIONS

Please report any bugs or feature requests to bug-lingua-norms-subtlfreq-0.01 at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lingua-Norms-SUBTLEX-0.01. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

TO DO

Aliases

Alias for functions? frq, logf ... avoiding debate about how a string is a word or not. making whole module child of Class::Accessor or Moosify.

Language

Test with different language norms, adapt if necessary

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Lingua::Norms::SUBTLEX

You can also look for information at:

LICENSE AND COPYRIGHT

Copyright 2014 Roderick Garton.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.