NAME

Lingua::Norms::SUBTLEX - Retrieve frequency values and frequency-based lists for words from Subtitles Corpora

VERSION

Version 0.05

SYNOPSIS

use feature qw(say);
use Lingua::Norms::SUBTLEX;
my $subtlex = Lingua::Norms::SUBTLEX->new(lang => 'US'); # or NL, UK, DE
my $bool = $subtlex->is_normed(string => 'fuip'); # isa_word ? 
my $frq = $subtlex->frq_opm(string => 'frog'); # freq. per million, or get log/zipf
my $href = $subtlex->freqhash(words => [qw/frog fish ape/]); # freqs. for a list of words
say "$_ freq per mill = $href->{$_}" for keys %{$href};

# stats, parts-of-speech, orthographic relations:
say "mean freq per mill = ", $subtlex->mean_frq(words => [qw/frog fish ape/]); # or median, std-dev.
say "frog part-of-speech = ", $subtlex->pos(string => 'frog');
my ($count, $orthons_aref) = $subtlex->on_count(string => 'frog'); # or scalar context for count only; or freq_max/mean
say "orthon of frog = $_" for @{$orthons_aref}; # e.g., from

# retrieve (list of) words to certain specs:
my $aref = $subtlex->list_words(freq => [2, 400], onc => [1,], length => [4, 4], cv_pattern => 'CCVC', regex => '^f');
my $string = $subltex->random_word();

DESCRIPTION

The module facilitates access to raw data and descriptive statistics on word-frequency and parts-of-speech, as provided in the SUBTLEX-DE, SUBTLEX-NL, SUBTLEX-UK and SUBTLEX-US databases (see REFERENCES). For example, the SUBTLEX-US database is based on a study of 74,286 letter-strings, with frequencies of occurrence within a corpus of some 30 million words from the subtitles of 8,388 film and television episodes. The frequency data obtained in this way have been shown to offer more psychologically predictive measures than those derived from books, newsgroup posts, and similar.

There are three groups of retrievable stats and sampling rules: (1) frequency; (2)contextual diversity (number of films/episodes appeared in); and (3) parts-of-speech. Depending on the source language, frequency is given as a count (frq_count), occurrences per million (frq_opm), logarithm of the opm (frq_log), and/or 7-point scaled (frq_zipf); contextual diversity is given as a count (cd_count), a percentage (cd_pct), or a logarithm (cd_log). For parts-of-speech, pos returns a string giving the dominant part. Sampling is given by the same labels, with keys with min/max values (or a whitelist of acceptable parts-of-speech).

A small sample from each of the databases is included in the installation distribution for testing purposes. The complete files need to be downloaded via the following URLs. The local directory location or actual pathname of these files can be given in class construction (by the arguments dir and path); otherwise the default location--the directory "SUBTLEX" alongside the module itself in the locally configured Perl sitelib--will be used, and the correct file determined by inclusion of lang value within its filename. The filenames of the original files downloaded from the following sites are supported in this way, and it does not matter if (as varies between the files) the fields are comma-separated or tab-delimited.

The three databases (comprised of one file per language) do not provide values for all methods. All three provide values for only the methods frq_count, cd_count, cd_pct, and pos. Further details of unsupported methods per database/lang are given below.

SUBTLEX-US

For the American norms, install the file "SUBTLEXusExcel2007.csv" from expsy.ugent.be/subtlexus/. All methods are supported by this database.

SUBTLEX-UK

For the British norms, install the file "SUBTLEX-UK.txt" from within the "SUBTLEX-UK.zip" archive via psychology.nottingham.ac.uk/subtlex-uk/. This database does not define values for occurrences per million (or log occurrences per million); the methods for these stats will return an empty string.

SUBTLEX-NL

For the Dutch norms, install the file "SUBTLEX-NL.with-pos.txt" from within the archive "SUBTLEX-NL.with-pos.txt.zip" via crr.ugent.be. This database does not define a value for Zipf frequency, so the "zipf" method will return an empty string if called with NL as the "lang".

SUBTLEX-DE

For the German norms, dowload the file "SUBTLEX-DE_cleaned_with_Google00.txt" via crr.ugent.be. There is no CD, POS or Zipf data at this point, so only the "frq_" methods, and the "on_" methods (based on realtime calculation work with this language. The file contains other information, including Google-based frequencies, for which this module does not provide retrieval at this time.

There are several other languages from this project which might be supported by this module in a later version (originally, only SUBTLEX-US was supported).

SUBROUTINES/METHODS

All methods are called via the class object, and with named (hash of) arguments, usually string, where relevant.

new

$subtlex = Lingua::Norms::SUBTLEX->new(lang => 'US'); # or 'UK', 'NL', 'DE' - looking in Perl sitelib
$subtlex = Lingua::Norms::SUBTLEX->new(lang => 'US', dir => 'file_location'); # where to look
$subtlex = Lingua::Norms::SUBTLEX->new(lang => 'US', path => 'actual_file');

Returns a class object for accessing other methods. The parameter lang should be set to specify the particular language database: DE (German), NL (Dutch), UK (British) or US (American); otherwise US (being the first published in the series) is the default. Optional arguments dir or path can be given to specify the location or actual file (respectively) of the database. The default location is within the "Lingua/Norms/SUBTLEX" directory within the 'sitelib' configured for the local Perl installation (as per Config.pm). The method will croak if the given path or default location cannot be found.

Frequencies and POS for individual words or word-lists

is_normed

$bool = $subtlex->is_normed(string => $word);

Alias: isa_word

Returns a boolean value to specify whether or not the letter-string passed as string is represented in the SUBTLEX corpus. This might be thought of as a lexical decision ("is this string a word?") but note that some very low frequency letter-strings in the corpus would not be considered words in the average context (perhaps, in part, because of misspelt subtitles).

frq_count

$val = $subtlex->frq_count(string => 'aword');

Returns the raw number of occurrences in all the films/TV episodes for the word passed as string, or the empty-string if the string is not represented in the norms.

frq_opm

$val = $subtlex->frq_opm(string => 'aword');

Alias: opm

Returns frequency per million for the word passed as string, or the empty-string if the string is not represented in the norms.

frq_log

$val = $subtlex->frq_log(string => 'aword');

Returns log frequency per million for the word passed as string, or the empty-string if the string is not represented in the norms.

frq_zipf

$val = $subtlex->frq_zipf(string => 'aword');

Returns zipf frequency for the word passed as string, or the empty-string if the string is not represented in the norms. The Zipf scale ranges from 1 to 7, with values of 1-3 representing low frequency words, and values of 4-7 representing high frequency words. See Van Heuven et al. (2014) and crr.ugent.be/archives.

cd_count

$cd = $subtlex->cd_count(string => 'aword');

Corresponds to the column labelled "CDcount" in the datafile.

cd_pct

$cd = $subtlex->cd_pct(string => 'aword');

Returns a percentage measure to two decimal places of the number of films/TV episodes in which the given string was included in its subtitles. This corresponds to the measure "SUBTLCD" described in Brysbaert and New (2009). Note: where "cd" stands for "contextual diversity."

cd_log

Returns log10(cd_pct + 1) for the given string, with 4-digit precision. This corresponds to the measure "Lg10CD" described in Brysbaert and New (2009), where it is stated that "this is the best value to use if one wants to match words on word frequency" (p. 988). Note: "cd" stands for "contextual diversity," which is based on the number of films and TV episodes in which the string was represented.

frq_hash

$href = $subtlex->frq_hash(strings => [qw/word1 word2/], scale => opm|log|zipf);

Returns frequency as values within a reference to a hash keyed by the words passed as strings. By default, the values in the hash are corpus frequency per million. If the optional argument scale is defined, and it equals log, then the values are log-frequency; similarly, zipf yields zipf-frequency. Note, however, that some databases do not support all types of scales; in which case the returned value will be the empty string.

pos

$pos_str = $subtlex->pos(string => 'aword');

Returns part-of-speech string for a given word. The return value is undefined if the word is not found.

Descriptive frequency statistics for lists

These methods return a descriptive statistic (mean, median or standard deviation) for a list of strings. Like freqhash, they take an optional argument scale to specify if the returned values should be raw frequencies per million, log frequencies, or zip-frequencies.

frq_mean

$mean = $subtlex->frq_mean(strings => [qw/word1 word2/], scale => 'raw|log|zipf');

Returns the arithmetic mean of the frequencies for the given words, or mean of the log frequencies if log => 1.

frq_median

$median = $subtlex->frq_median(words => [qw/word1 word2/], scale => 'raw|log|zipf');

Returns the median of the frequencies for the given words, or median of the log frequencies if log => 1.

frq_sd

$sd = $subtlex->frq_sd(words => [qw/word1 word2/], scale => 'raw|log|zipf');

Returns the standard deviation of the frequencies for the given words, or standard deviation of the log frequencies if log => 1.

Orthographic neighbourhood measures

These methods return stats re the orthographic relatedness of a specified letter-string to words in the SUBTLEX corpus. Unless otherwise stated, an orthographic neighbour here means letter-strings that are identical except for a single-letter substitution while holding string-length constant, i.e., the Coltheart N of a letter-string, as defined in Coltheart et al. (1977). These measures are calculated in realtime; they are not listed in the datafile for look-up, so expect some extra-normal delay in getting a returned value.

on_count

$n = $subtlex->on_count(string => $letters);
($n, $orthons_aref) = $subtlex->on_count(string => $letters);

Returns orthographic neighbourhood count (Coltheart N) within the SUBTLEX corpus. Called in array context, also returns a reference to an array of the neighbours retrieved, if any.

on_frq_max

$m = $subtlex->on_frq_max(string => $letters);

Returns the maximum SUBTLEX frequency per million among the orthographic neighbours (per Coltheart N) of a particular letter-string. If (unusually) all the frequencies are the same, then that value is returned. If the string has no (Coltheart-type) neighbours, undef is returned.

on_frq_opm_mean

$m = $subtlex->on_frq_mean(string => $letters);

Returns the mean SUBTLEX frequencies per million of the orthographic neighbours (per Coltheart N) of a particular letter-string. If the string has no (Coltheart-type) neighbours, undef is returned.

on_frq_log_mean

$m = $subtlex->on_frq_log_mean(string => $letters);

Returns the mean log of SUBTLEX frequencies of the orthographic neighbours (per Coltheart N) of a particular letter-string. If the string has no (Coltheart-type) neighbours, undef is returned.

on_frq_zipf_mean

$m = $subtlex->on_frq_zipf_mean(string => $letters);

Returns the mean zipf of SUBTLEX frequencies of the orthographic neighbours (per Coltheart N) of a given letter-string. If the string has no (Coltheart-type) <b></b>neighbours, undef is returned.

on_ldist

$m = $subtlex->on_ldist(string => $letters, lim => 20);

Alias: ldist

Returns the mean Levenshtein Distance from a letter-string to its lim closest orthographic neighbours. The default limit is 20, as defined in Yarkoni et al. (2008). The module uses the matrix-based calculation of Levenshtein Distance as implemented in Lingua::Orthon module. No defined value is returned if no Levenshtein Distance is found (whereas zero would connote "identical to everything").

Retrieving letter-strings/words

list_strings

$aref = $subtlex->list_words(freq => [1, 20], onc => [0, 3], length => [4, 4], cv_pattern => 'CVCV', regex => '^f');
$aref = $subtlex->list_words(zipf => [0, 2], onc => [0, 3], length => [4, 4], cv_pattern => 'CVCV', regex => '^f');

Alias: list_words

Returns a list of words from the SUBTLEX corpus that satisfies certain criteria: minimum and/or maximum letter-length (specified by the named argument length), minimum and/or maximum frequency (argument freq) or zip-frequency (argument zipf), minimum and/or maximum orthographic neighbourhood count (argument onc), a consonant-vowel pattern (argument cv_pattern), or a specific regular expression (argument regex).

For the minimum/maximum constrained criteria, the two limits are given as a referenced array where the first element is the minimum and the second element is the maximum. For example, [3, 7] would specify letter-strings of 3 to 7 letters in length; [4, 4] specifies letter-strings of only 4 letters in length. If only one of these is to be constrained, then the array would be given as, e.g., [3] to specify a minimum of 3 letters without constraining the maximum, or ['',7] for a maximum of 7 letters without constraining the minimum (checking if the element hascontent as per String::Util).

The consonant-vowel pattern is specified as a string by the usual convention, e.g., 'CCVCC' defines a 5-letter word starting and ending with pairs of consonants, the pairs separated by a vowel. 'Y' is defined here as a consonant.

A finer selection of particular letters can be made by giving a regular expression as a string to the regex argument. In the example above, only letter-strings starting with the letter 'f', followed by one of more other letters, are specified. Alternatively, for example, '[^aeiouy]$' specifies that the letter-strings must not end with a vowel (here including 'y'). The entire example for '^f', including the shown arguments for cv_pattern, freq, onc and length, would return only two words: fiji and fuse from SUBTLEX-US.

The selection procedure will be made particularly slow wherever onc is specified (as this has to be calculated in real-time) and no arguments are given for cv_pattern and regex (which are tested ahead of any other criteria).

Syllable-counts might be added in future; existing algorithms in the Lingua family are not sufficiently reliable for the purposes to which the present module might often be put; an alternative is being worked on.

The value returned is always a reference to the list of words retrieved (or to an empty list if none was retrieved).

all_strings

$aref = $subtlex->all_strings();

Alias: all_words

Returns a reference to an array of all letter-strings in the corpus, in their given order.

random_string

$string = $subtlex->random_string();
@data = $subtlex->random_string();

Alias: random_word

Picks a random line from the corpus, using File::RandomLine (except the top header line). Returns the word in that line if called in scalar context; otherwise, the array of data for that line. (A future version might let specifying a match to specific criteria, self-aborting after trying X lines.)

Miscellaneous

nlines

$num = $subtlex->nlines();

Returns the number of lines, less the column headings, in the installed language file. Expects/accepts no arguments.

DIAGNOSTICS

Cannot determine field indices

When constructing the class object with new, the module needs to read in the contents of a file named "fields.csv" which should be housed within the SUBTLEX directory where the module itself is located (alongside the downloaded SUBTLEX files). This is necessary because the field indices for the various stats vary from one language file to the next. This should have been done with installation of the module itself. Check that this file is indeed within the Perl/site/lib/Lingua/Norms/SUBTLEX directory. If it is not, download and install the file to that location via the CPAN package of this module.

Value given to argument 'dir' (VALUE) in new() is not a directory

Croaked from new if called with a value for the argument dir, and this value is not actually a directory/folder. This is the directory/folder in which the actual SUBTLEX datafiles should be located.

Cannot find required database for language $lang

Croaked from new if none of the given values to arguments lang, dir or path are valid, and even the default site/lib directory and US database are not accessible. Check that your have indeed a file with the given value of lang (DE, NL, UK or US) within the Perl/site/lib/Lingua/Norms/SUBTLEX directory, or at least that the SUBTLEX-US file is located within it, and can be read via your script.

Cannot determine fields for given language

Croaked upon construction if no fields are recognized for the given language. The value given to lang must be one of DE, NL, UK or US.

No string to test; pass a string to the function

Croaked by several methods that expect a value for the named argument string, and when no such value is given. These methods require the letter-string to be passed to it as a key => value pair, with the key string followed by the value of the string to test.

No string(s) to test; pass one or more letter-strings named \'strings\' as a referenced array

Same as above but specifically croaked by frq_hash which accepts more than one string in a single call.

Need to install and have access to module File::RandomLine

Croaked by method random_string if the module it depends on (File::RandomLine) is not installed or accessible. This should have been installed (if not already) upon installation of the present module. See CPAN to download and install this module manually.

DEPENDENCIES

File::RandomLine : needed to work random_string.

File::Slurp : handy for directory reading when calling new.

Lingua::Orthon : needed to calculate Levenshtein Distance, assessing orthographic neighbourhood.

List::AllUtils : handy none function.

Statistics::Lite : needed for the various statistical methods.

String::Util : utilities for determining valid string values.

Text::CSV::Separator : depended upon to determine the delimiter (comma or tab) within the datafiles.

REFERENCES

Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Boelte, J., & Boehl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental Psychology, 58, 412-424. doi: 10.1027/1618-3169/a000123

Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977-990. doi: 10.3758/BRM.41.4.977

Brysbaert, M., New, B., & Keuleers,E. (2012). Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior Research Methods, 44, 991-997. doi: 10.3758/s13428-012-0190-4

Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Ed.), Attention and performance (Vol. 6, pp. 535-555). London, UK: Academic.

Keuleers, E., Brysbaert, M., & New, B. (2010). SUBTLEX-NL: A new frequency measure for Dutch words based on film subtitles. Behavior Research Methods, 42, 643-650. doi: 10.3758/BRM.42.3.643

Van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176-1190. doi: 10.1080/17470218.2013.850521

Yarkoni, T., Balota, D. A., & Yap, M. (2008). Moving beyond Coltheart's N: A new measure of orthographic similarity. Psychonomic Bulletin and Review, 15, 971-979. doi: 10.3758/PBR.15.5.971

AUTHOR

Roderick Garton, <rgarton at cpan.org>

BUGS AND LIMITATIONS

Please report any bugs or feature requests to bug-lingua-norms-subtlfreq-0.05 at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lingua-Norms-SUBTLEX-0.05. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Lingua::Norms::SUBTLEX

You can also look for information at:

LICENSE AND COPYRIGHT

Copyright 2014-2015 Roderick Garton.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.

To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.