NAME
Lingua::Norms::SUBTLEX - Retrieve frequency values and frequency-based lists from Brysbaert Subtitles Corpus
VERSION
Version 0.03
SYNOPSIS
use feature qw(say);
use Lingua::Norms::SUBTLEX;
my $subtlex = Lingua::Norms::SUBTLEX->new();
my $bool = $subtlex->is_normed(string => 'fuip'); # isa_word ?
my $frq = $subtlex->freq(string => 'frog'); # freq. per million, or get log/zipf
my $href = $subtlex->freqhash(words => [qw/frog fish ape/]); # freqs. for a list of words
say "$_ freq per mill = $href->{$_}" for keys %{$href};
# stats, parts-of-speech, orthographic relations:
say "mean freq per mill = ", $subtlex->mean_freq(words => [qw/frog fish ape/]); # or median, std-dev.
say "frog part-of-speech = ", $subtlex->pos(string => 'frog');
my ($count, $orthons_aref) = $subtlex->on_count(string => 'frog'); # or scalar context for count only; or freq_max/mean
say "orthon of frog = $_" for @{$orthons_aref}; # e.g., from
# retrieve (list of) words to certain specs:
my $aref = $subtlex->list_words(freq => [2, 400], onc => [1,], length => [4, 4], cv_pattern => 'CCVC', regex => '^f');
my $string = $subltex->random_word();
DESCRIPTION
The module facilitates access to raw data and descriptive statistics on word-frequency and parts-of-speech, as provided in the SUBTLEX-US study of Brysbaert and New (2009). This comprised a study of 74,286 letter-strings, with frequencies of occurrence within a corpus of some 30 million words from the subtitles of 8,388 film and television episodes. The frequency data obtained in this way have been shown to offer more psychologically predictive measures than those derived from books, newsgroup posts, and similar. See http://expsy.ugent.be/subtlexus/ for details, and to download the datafile and install it. See also the papers listed in REFERENCES.
Only a small sample from the SUBTLEX-US datafile is included for testing purposes in the installation distribution. The complete file should be downloaded, named "US_2007.csv", and placed in an appropriate directory, with its location specified in object construction, as described below. Other language files from this project might be supported by this module but have not been tested to date.
SUBROUTINES/METHODS
All methods are called via the class object, and with named (hash of) arguments, usually string, where relevant.
new
$subtlex = Lingua::Norms::SUBTLEX->new();
$subtlex = Lingua::Norms::SUBTLEX->new(dir => 'file_location'); # where US_2007.csv is located
$subtlex = Lingua::Norms::SUBTLEX->new(dir => 'file_location', filename => 'foo'); # where datafile is located
Returns a class object for accessing other methods. An optional argument dir can be given to specify the directory in which the SUBTLEX table is stored. If this is not specified, then the "Lingua/Norms/SUBTLEX" directory within the 'sitelib' configured for the local Perl installation is assumed to be the location (i.e., using Config.pm, and where the sample file should have been stored upon installation of the module). The method will croak
if the given filename or the default, "US_2007.csv" in this directory cannot be open
ed (or close
d).
Frequencies and POS for individual words or word-lists
is_normed
$bool = $subtlex->is_normed(string => $word);
Alias: isa_word
Returns a boolean value to specify whether or not the letter-string passed as string is represented in the SUBTLEX corpus - by simply going line-by-line through the datafile and checking if the given string is identical to the first comma-delimited string on each line. This might be thought of as a lexical decision ("is this string a word?") but note that some very low frequency letter-strings in the corpus would not be considered words in the average context.
freq
$frq = $subtlex->freq(string => 'aword');
Returns frequency per million for the word passed as string, or the empty-string if the string is not represented in the norms.
lfreq
$lfreq = $subtlex->freq(string => 'aword');
Returns log frequency per million for the word passed as string, or the empty-string if the string is not represented in the norms.
zipf
$zipf = $subtlex->zipf(string => 'aword');
Returns zipf frequency for the word passed as string, or the empty-string if the string is not represented in the norms. The Zipf scale ranges from 1 to 7, with values of 1-3 representing low frequency words, and values of 4-7 representing high frequency words. See Van Heuven et al. (2014) and http://crr.ugent.be/archives/1352.
cd_pct
$cd = $subtlex->cd_pct(string => 'aword');
Returns a percentage measure to two decimal places of the number of films/TV episodes in which the given string was included in its subtitles. This corresponds to the measure "SUBTLCD" described in Brysbaert and New (2009). Note: where "cd" stands for "contextual diversity."
cd_log
Returns log10(cd_pct + 1) for the given string, with 4-digit precision. This corresponds to the measure "Lg10CD" described in Brysbaert and New (2009), where it is stated that "this is the best value to use if one wants to match words on word frequency" (p. 988). Note: "cd" stands for "contextual diversity," which is based on the number of films and TV episodes in which the string was represented.
freqhash
$href = $subtlex->freqhash(strings => [qw/word1 word2/], scale => raw|log|zipf);
Returns frequency as a reference to a hash for the words passed as strings; e.g., {string1 => number, string2 => number, ...}. By default, the values in the hash are corpus frequency per million. If the optional argument scale is defined, and it equals log, then the values are log-frequency; similarly, zipf yields zip-frequency.
pos
$pos_str = $subtlex->pos(string => 'aword');
Returns part-of-speech string for a given word, as per Brysbaert, New & Keuleers (2012). The return value is undefined if the word is not found.
Descriptive frequency statistics for lists
These methods return a descriptive statistic (mean, median or standard deviation) for a list of strings. Like freqhash, they take an optional argument scale to specify if the returned values should be raw frequencies per million, log frequencies, or zip-frequencies.
mean_freq
$mean = $subtlex->mean_freq(strings => [qw/word1 word2/], scale => 'raw|log|zipf');
Returns the arithmetic mean of the frequencies for the given words, or mean of the log frequencies if log => 1.
median_freq
$median = $subtlex->median_freq(words => [qw/word1 word2/], scale => 'raw|log|zipf');
Returns the median of the frequencies for the given words, or median of the log frequencies if log => 1.
sd_freq
$sd = $subtlex->sd_freq(words => [qw/word1 word2/], scale => 'raw|log|zipf');
Returns the standard deviation of the frequencies for the given words, or standard deviation of the log frequencies if log => 1.
Orthographic neighbourhood measures
These methods return stats re the orthographic relatedness of a specified letter-string to words in the SUBTLEX corpus. Unless otherwise stated, an orthographic neighbour here means letter-strings that are identical except for a single-letter substitution while holding string-length constant, i.e., the Coltheart N of a letter-string, as defined in Coltheart et al. (1977). These measures are calculated in realtime; they are not listed in the datafile for look-up, so expect some extra-normal delay in getting a returned value.
on_count
$n = $subtlex->on_count(string => $letters);
($n, $orthons_aref) = $subtlex->on_count(string => $letters);
Returns orthographic neighbourhood count (Coltheart N) within the SUBTLEX corpus. Called in array context, also returns a reference to an array of the neighbours retrieved, if any.
on_freq_max
$m = $subtlex->on_freq_max(string => $letters);
Returns the maximum SUBTLEX frequency per million among the orthographic neighbours (per Coltheart N) of a particular letter-string. If (unusually) all the frequencies are the same, then that value is returned. If the string has no (Coltheart-type) neighbours, undef is returned.
on_freq_mean
$m = $subtlex->on_freq_mean(string => $letters);
Returns the mean SUBTLEX frequencies per million of the orthographic neighbours (per Coltheart N) of a particular letter-string. If the string has no (Coltheart-type) neighbours, undef is returned.
on_lfreq_mean
$m = $subtlex->on_lfreq_mean(string => $letters);
Returns the mean log of SUBTLEX frequencies of the orthographic neighbours (per Coltheart N) of a particular letter-string. If the string has no (Coltheart-type) neighbours, undef is returned.
on_zipf_mean
$m = $subtlex->on_zipf_mean(string => $letters);
Returns the mean zipf of SUBTLEX frequencies of the orthographic neighbours (per Coltheart N) of a given letter-string. If the string has no (Coltheart-type) <b></b>neighbours, undef is returned.
on_ldist
$m = $subtlex->on_ldist(string => $letters, lim => 20);
Alias: ldist
Returns the mean Levenshtein Distance from a letter-string to its lim closest orthographic neighbours. The default limit is 20, as defined in Yarkoni et al. (2008). The module uses the matrix-based calculation of Levenshtein Distance as implemented in this author's Lingua::Orthon module. No defined value is returned if no Levenshtein Distance is found (whereas zero would connote "identical to everything").
Retrieving letter-strings/words
list_strings
$aref = $subtlex->list_words(freq => [1, 20], onc => [0, 3], length => [4, 4], cv_pattern => 'CVCV', regex => '^f');
$aref = $subtlex->list_words(zipf => [0, 2], onc => [0, 3], length => [4, 4], cv_pattern => 'CVCV', regex => '^f');
Alias: list_words
Returns a list of words from the SUBTLEX corpus that satisfies certain criteria: minimum and/or maximum letter-length (specified by the named argument length), minimum and/or maximum frequency (argument freq) or zip-frequency (argument zipf), minimum and/or maximum orthographic neighbourhood count (argument onc), a consonant-vowel pattern (argument cv_pattern), or a specific regular expression (argument regex).
For the minimum/maximum constrained criteria, the two limits are given as a referenced array where the first element is the minimum and the second element is the maximum. For example, [3, 7] would specify letter-strings of 3 to 7 letters in length; [4, 4] specifies letter-strings of only 4 letters in length. If only one of these is to be constrained, then the array would be given as, e.g., [3] to specify a minimum of 3 letters without constraining the maximum, or ['',7] for a maximum of 7 letters without constraining the minimum (checking if the element hascontent
as per String::Util).
The consonant-vowel pattern is specified as a string by the usual convention, e.g., 'CCVCC' defines a 5-letter word starting and ending with pairs of consonants, the pairs separated by a vowel. 'Y' is defined here as a consonant.
A finer selection of particular letters can be made by giving a regular expression as a string to the regex argument. In the example above, only letter-strings starting with the letter 'f', followed by one of more other letters, are specified. Alternatively, for example, '[^aeiouy]$' specifies that the letter-strings must not end with a vowel (here including 'y'). The entire example for '^f', including the shown arguments for cv_pattern, freq, onc and length, would return only two words: fiji and fuse.
The selection procedure will be made particularly slow wherever onc is specified (as this has to be calculated in real-time) and no arguments are given for cv_pattern and regex
(which are tested ahead of any other criteria).
Syllable-counts might be added in future; existing algorithms in the Lingua family are not sufficiently reliable for the purposes to which the present module might often be put; an alternative is being worked on.
The value returned is always a reference to the list of words retrieved (or to an empty list if none was retrieved).
all_strings
$aref = $subtlex->all_strings();
Alias: all_words
Returns a reference to an array of all letter-strings in the corpus, in their given order.
random_string
$string = $subtlex->random_string();
@data = $subtlex->random_string();
Alias: random_word
Picks a random line from the corpus, using File::RandomLine (except the top header line). Returns the word in that line if called in scalar context; otherwise, the array of data for that line. (A future version might let specifying a match to specific criteria, self-aborting after trying X lines.)
Miscellaneous
nlines
$num = $subtlex->nlines();
Returns the number of lines, less the column headings, in the installed US_2007.csv file used by other methods read. Expects/accepts no arguments.
DIAGNOSTICS
- Value given to argument 'dir' (VALUE) in new() is not a directory
-
Croaked from new() if called with a value for the argument dir, and this value is not actually a directory/folder. This is where the main file, named US_2007.csv, should be located.
- US_2007.csv does not exist within the directory 'VALUE'. Maybe you need to download the file (see POD) or re-locate it
-
Croaked from new() if the given or default directory exists, but the file 'US_2007.csv' cannot be found within it. This is the location of the file that should have been downloaded from the site: http://expsy.ugent.be/subtlexus/.
- Cannot open SUBTLEX data file
-
Croaked when calling new and a valid path to the US_2007.csv file is not available for opening (and similarly for closing.
- No word(s) to test; pass a string to the function
-
Croaked upon a number of methods that expect a value for the named argument string, and when no such value is given, or the string is empty. These methods require the letter-string to be passed to it as a key => value pair, with the key string and the value the string to test.
DEPENDENCIES
REFERENCES
Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977-990. doi: 10.3758/BRM.41.4.977
Brysbaert, M., New, B., & Keuleers,E. (2012). Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior Research Methods, 44, 991-997. doi: 10.3758/s13428-012-0190-4
Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Ed.), Attention and performance (Vol. 6, pp. 535-555). London, UK: Academic.
Van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176-1190. doi: 10.1080/17470218.2013.850521
Yarkoni, T., Balota, D. A., & Yap, M. (2008). Moving beyond Coltheart's N: A new measure of orthographic similarity. Psychonomic Bulletin and Review, 15, 971-979. doi: 10.3758/PBR.15.5.971
AUTHOR
Roderick Garton, <rgarton at cpan.org>
BUGS AND LIMITATIONS
Please report any bugs or feature requests to bug-lingua-norms-subtlfreq-0.03 at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lingua-Norms-SUBTLEX-0.03. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
TO DO
- Aliases
-
Alias for functions? frq, logf ... avoiding debate about how a string is a word or not. making whole module child of Class::Accessor or Moosify.
- Language
-
Test with different language norms, adapt if necessary
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Lingua::Norms::SUBTLEX
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Lingua-Norms-SUBTLEX-0.03
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
LICENSE AND COPYRIGHT
Copyright 2014-2015 Roderick Garton.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.