NAME

Lingua::Norms::SUBTLEX - Retrieve word frequencies and related values and lists from subtitles corpora

VERSION

This is documentation for Version 0.06 of Lingua::Norms::SUBTLEX.

SYNOPSIS

use Lingua::Norms::SUBTLEX 0.06;
my $subtlex = Lingua::Norms::SUBTLEX->new(lang => 'UK');

# Is the string 'frog' in the subtitles corpus?
my $bool = $subtlex->is_normed(string => 'frog');

# Occurrences-per-million:
# - for a single string:
my $frq = $subtlex->frq_opm(string => 'frog'); # freq. per million; also count, log-f, Zipf

# - for a list of strings: 
my $href = $subtlex->frq_hash(strings => [qw/frog fish ape/]); # freqs. for a list of words
print "'$_' opm\t$href->{$_}\n" for keys %{$href};

# stats:
printf "mean opm\t%f\n", $subtlex->frq_mean(strings => [qw/frog fish ape/]); # or median, std-dev.

# parts-of-speech:
printf "'frog' part-of-speech = %s\n", $subtlex->pos_dom(string => 'frog');

# retrieve (list of) words to certain specs, e.g., min/max range:
my $aref = $subtlex->select_words(freq => [2, 400], length => [4, 4], cv_pattern => 'CCVC', regex => '^f');
printf "Number of 4-letter CCVC strings with 2-400 opm starting with 'f' = %d\n", scalar @{$aref};

printf "A randomly selected subtitles string is '%s'\n", $subtlex->random_string();

DESCRIPTION

This module facilitates access to corpus frequency and other lexical attributes of character strings (generally, words), as provided in the various SUBTLEX and related projects (see REFERENCES) on the basis of the representation of these strings in film and television subtitles (see www.opensubtitles.org). Word frequencies obtained in this way have been shown to be generally more predictive of performance in word recognition tasks than frequencies derived from books, newsgroup posts, and similar sources (but see Herdagdelen & Marelli, 2017).

There are three main groups of measures that are potentially retrievable from the SUBTLEX datatables: (1) frequency; (2) contextual diversity (number of films/TV episodes appeared in); and (3) parts-of-speech. The module tries to uniformly offer, across the available files, frequency as a count (frq_count), occurrences per million (frq_opm), logarithm of the opm or frequency count (frq_log), and/or the 7-point scaled Zipf frequency (frq_zipf). "Contextual diversity" is given as a count (cd_count), a percentage (cd_pct), and/or a logarithm (cd_log). For parts-of-speech, the module returns, via pos_dom, the dominant linguistic syntactical role of the word, as well as all defined parts-of-speech for a word (via pos_all).

However, not all these methods are available across all projects; e.g., SUBTLEX-NL does not define Zipf frequency, and SUBTLEX-DE does not define CD, POS or Zipf frequency. In these cases, the method in question will return an empty string.

CORPORA SPECS and SOURCES

The SUBTLEX files need to be downloaded via the URLs shown in the table below (only a small sample from each of each of the SUBTLEX corpora is included in the installation distribution for testing purposes). So, for example, for the American norms, install the file named "SUBTLEX-US frequency list with PoS and Zipf information.csv" via ugent.be/pp/experimentele-psychologie/.

The local directory location or actual pathname of these files can be given in class construction (by the arguments dir and path, respectively); or it will be sought from the default location--within the directory "SUBTLEX" alongside the module itself in the locally configured Perl sitelib--given the lang argument to new(), or to set_lang(). The filenames of the original files downloaded from the following sites should be found in this way, but it should uniquely include the "key" shown in the table. The module will attempt to identify the correct field separator for the file (which can be comma-separated or tab-delimited). Only the files specified in the table are likely to be reliably accessed at this time.

  

LanguageKeyURLFile
DutchNL_allcrr.ugent.beSUBTLEX-NL.with-pos.txt
 NL_mincrr.ugent.beSUBTLEX-NL.cd-above2.with-pos.txt
English (American)USexpsy.ugent.be/subtlexusSUBTLEX-US frequency list with PoS and Zipf information.csv
English (British)UKpsychology.nottingham.ac.ukSUBTLEX-UK.txt
FrenchFRlexique.orgLexique381.txt
GermanDEcrr.ugent.beSUBTLEX-DE_cleaned_with_Google00.txt
PortuguesePTp-pal.di.uminho.ptSUBTLEX-PT_Soares_et_al._QJEP.csv

Notes regarding these different corpora.

  • SUBTLEX-DE

    The file has separate entries for words starting with an uppercase and a lowercase letter (e.g., for when a letter-string is both a noun and an adjective).

  • Lexique (SUBTLEX-FR)

    If not giving the full path to this file, it should be renamed to include "FR" (e.g., "FR_Lexique.csv") and stored in the default directory. The file also includes frequencies from books.

  • SUBTLEX-PT

    The Portuguese subtitles data are available as an Excel file (directly from here). This file needs to be saved as a (csv) text file to be usable here.

  • SUBTLEX-UK

    Includes words that might be spelled with a dash both with a dash and without; so there are separate entries for x-ray and xray, and for no-one and noone. It includes some strings with apostrophes (e.g., howe'er, k'nex); but common contractions like he's, isn't and ain't do not appear; they are stripped of their apostrophes, listed, e.g., as hes, isnt and aint. All strings are in lower-case; so Africa is represented as africa.

  • SUBTLEX-US

    There are no strings with capitalized onsets in this file, or with punctuation marks, including apostrophes and dashes (e.g., Aaron and Freudian are represented as aaron and freudian; you've as youve, and x-ray as xray).

    The earlier, original file "SUBTLEXusExcel2007.csv" presents strings as they were originally capitalised: there is, e.g., Aaron and Hawkeye--but neither aaron nor hawkeye. This file does not provide part-of-speech or Zipf frequencies.

There are several other languages from this project which might be supported by this module in a later version (originally, only SUBTLEX-US was supported).

See the new() method as to how this module handles case-sensitivity and diacritical marks. For files where strings are UTF-8 encoded, the strings being looked up should also be UTF-8 encoded (if they are diacritically marked, e.g. "embâcle")(see Encode).

If using Miscrosoft Excel to save any of these files, even if in CSV format, Excel will turn the words "true" and "false" into the Boolean strings "TRUE" and "FALSE", as well as throw them aside from alphabetic sorting (right down to the bottom of an alphabetic sort). That will surely stuff up any neatly intended pattern-matching for these words.

SUBROUTINES/METHODS

All methods are called via the class object, and with named (hash of) arguments, usually string, where relevant.

new

$subtlex = Lingua::Norms::SUBTLEX->new(lang => 'DE'); # - looking in Perl sitelib
$subtlex = Lingua::Norms::SUBTLEX->new(lang => 'DE', dir => 'file_directory'); # folder in which file is located
$subtlex = Lingua::Norms::SUBTLEX->new(lang => 'DE', path => 'file/is/here.csv'); # complete path to file for given language

Returns a class object for accessing other methods. The argument lang is required, specifying the particular language datafile by a "key" as given in the above table. Optional arguments dir or path can be given to specify the location or filepath of the database. The default location is the "Lingua/Norms/SUBTLEX" directory within the 'sitelib' configured for the local Perl installation (as per Config.pm). The method will croak if the file cannot be found.

The optional argument match_level specifies how string comparison, as when looking up a given word in the SUBTLEX corpus, should be conducted, with the function used to test string equality being derived from the eq function in Unicode::Collate (part of the standard Perl distribution). This matching level applies to the look-up of strings within all methods, including those specifically assessing orthographic equality. This argument can take one of three values: see set_eq:

Frequencies and POS for individual words or word-lists

is_normed

$bool = $subtlex->is_normed(string => $word);

Alias: isa_word

Returns 1 or 0 as to whether or not the letter-string passed as string is represented in the subtitles file. For some files, this might be thought of as a lexical decision ("does this string spell a word?"); but others include misspelled words (e.g., "pyscho"), digit strings, abbreviations ...

frq_count

$int = $subtlex->frq_count(string => 'aword');

Returns the raw number of occurrences in all the films/TV episodes for the word passed as string, or 0 if the string is not found in language file.

frq_opm

$val = $subtlex->frq_opm(string => 'aword');

Alias: opm

Returns frequency per million for the word passed as string, or 0 if the string is not found in language file.

frq_log

$val = $subtlex->frq_log(string => 'aword');

Returns log frequency per million for the word passed as string, or the empty-string if the string is not represented in the norms.

frq_zipf

$val = $subtlex->frq_zipf(string => 'aword');

Returns Zipf frequency for the word passed as string, or the empty-string if the string is not represented in the language file. The Zipf scale ranges from about 1 to 7, with values of 1-3 generally representing low frequency words, and values of generally 4-7+ representing high frequency words, with respect to various recognition measures used in the study of word frequency effects. See Van Heuven et al. (2014) and crr.ugent.be/archives for more information.

frq_zipf_calc

$calc = $subtlex->frq_zipf_calc( string => 'favourite' );
$calc = $subtlex->frq_zipf_calc( string => 'favourite', corpus_size => POS_FLOAT_in_millions, n_wordtypes => POS_INT );

Returns an estimate of Zipf frequency by calculating its value from the given or retrievable frq_count or frq_opm, and the given or retrievable values of the corpus_size and n_wordtypes for the particular SUBTLEX project; i.e., the values of corpus_size and n_wordtypes can be provided as named arguments. As introduced by Van Heuven et al. (2014) (see also crr.ugent.be/archives):

  Zipf = log10[ ( frq_count + 1 ) / ( corpus_size + n_wordtypes )/1000000 ] + 3

How well the returned value satisfies the "border relations" desired of the index (e.g., that up to 1 opm corresponds to Zipf of < 3) depends on the reliability of the corpus size and wordtype counts, and any rounding of these values (where relevant) and (if required) of the opm. Examinations of the returned values show that, when using the canned and reported values (which is the default here), they align with these definitions, and with any canned Zipf values, within the margins of about the third or fourth decimal place.

frq_opm2count

$int = $subtlex->frq_opm2count(string => STRING);

Returns the raw number of occurrences of a string (the frq_count) based on the number of occurrences per million (frq_opm), and the corpus size in millions. Returns 0 if the string is not found in language file.

The frq_opm can be given as a named argument, or it will be retrieved by the frq_opm respective method, where this is defined for a particular language file. The corpus_size (in millions) can also be given as a named argument, or it will be retrieved from the specifications file (specs.csv in the module's directory), where this value has been obtainable from published reports.

cd_count

$cd = $subtlex->cd_count(string => STRING);

Returns the number of samples (films/TV episodes) comprising the corpus in which the string occurred in its subtitles; so-called "contextual diversity". Returns 0 if the string is not found in language file.

cd_pct

$cd = $subtlex->cd_pct(string => 'aword');

Returns a percentage measure for the number of samples (films/TV episodes) comprising the corpus in which the string occurred in its subtitles; so-called "contextual diversity". Returns 0 if the string is not found in language file.

cd_log

$cd = $subtlex->cd_log(string => 'aword');

Returns log10(cd_pct + 1) for the given string, with 4-digit precision. Note: Brysbaert and New (2009) state that "this is the best value to use if one wants to match words on word frequency" (p. 988).

pos_dom

$pos_str = $subtlex->pos_dom(string => STRING, conform => BOOL);

Returns the dominant part-of-speech for the given string. The return value is undefined if the string is not found. If the field in the original file (as in SUBTLEX-PT) is actually for all possible parts-of-speech, the first element in the returned string (once split by non-word characters), is returned (assuming, as in SUBTLEX-PT) that this is indeed the most frequent part-of-speech for the particular string.

For interpretation of the POS codes: for NL, see crr.ugent.be/archives/362 ("SPEC" is there defined as "often personal or geographical names" and so similar to "Name" in SUBTLEX-UK).

To transliterate the various codes into a common two-letter code, then set conform => 1 (default is not defined, returning the POS string as given in the original files). The two-letter codes are:

NN noun (common)
NM name (proper)
PN pronoun
VB verb
AJ adjective
AV adverb
PP proposition
CJ conjunction
IJ interjection
DA determiner or article
NB number
OT other
UK unknown

The "OT" code includes some rare POS values (e.g., "marker", "ONO"), anomalous values (e.g., "2"), and values not defined in the associated reports. The "UK" code ("unknown") is comprised of values specifically recorded as "unclassified" or similar, or where the POS field is empty.

pos_all

$pos_aref = $subtlex->pos_all(string => STRING, conform => BOOL);

Returns all parts-of-speech for the given string as a referenced array. The return value is an empty list if the string is not found. If the language file does not define this field, the returned value is simply the same as what would, if possible, be returned from pos_dom (i.e., if that value is defined), but now as a referenced array.

Multiple strings/values lists

Array given as measures to the following methods might include one or more of the following:

frq_count
frq_opm
frq_log
frq_zipf
cd_count
cd_pct
cd_log
pos_dom
pos_all

values_list

$aref = $subtlex->values_list(string => STRING, values => AREF);

Returns values for a single letter-string as a referenced array.

multi_list

$hashref = $subtlex->multi_list(strings => AREF_of_char_strings, measures => AREF_of_FIELD_NAMES);

$frq_hashref = $subtlex->multi_list(strings => [qw/ICH PEA CHOWDER ZEER AIME/], measures => [qw/frq_opm frq_zipf/]);
   # $frq_hashref = { 
   #        ICH => {
   #            frq_opm => 20000,
   #            frq_zipf => 7.01,
   #        },
   #        PEA => {
   #            frq_opm ...
   #        },
   #        ...
   #    }

Returns multiple values for a list of strings as a hashref of hashrefs. This is perhaps the most efficient method here for retrieving several values for several words, but only for a small number of words; it could take a long time to return given large lists.

So, given one or more words in the array ref strings, and several measures/values to find for each of them (such as 'frq_opm', 'pos_dom' or any other values defined for the particular language file) in the the array measures, the method looks line-by-line through the file to check if the line's string is equal to any of those in strings. If so, it collates the relevant measures in a hash keyed by the string, whose values are themselves a hash of the measure-names keying each respective measure-value. The found string is then removed from the look-up list, and the next line is looked-up in the same way. The search stops as soon as there are no more strings in the look-up list (all have been found).

In this way, there is only one pass through the file for the entire search; no line is looked-up more than once for all strings or their respective measure values. The method could be used for looking up a single string and/or a single value, but the other methods for doing this avoid the overhead of checking an array of strings, and splitting the line against the delimiter; this is only done here to facilitate caching multiple values whereas other methods avoid doing this as they only need to find one value after a known number of delimiters.

Descriptive frequency statistics for lists

These methods return a descriptive statistic (sum, mean, median or standard deviation) for a list of strings. Like freqhash, they take an optional argument scale to specify if the returned values should be occurrences per million, log frequencies, or Zipf values. Providing this as an argument obviates the need to provide multiple methods for each different type of frequency measure, e.g., "mean_opm()", mean_log_opm()", ...

Because not all types of frequency scales (count, opm, log, Zipf) are provided in all SUBTLEX corpora, these methods will croak if there are no canned stats for the particular scale called for.

It might be thought useful to allow any valid scale to be returned by, say, providing each method without a value for scale; a hash-ref of frequency values, keyed by scale-type, might be returned. However, this seems to be unrecommended; it assumes that users are blind as to what measures they want (as well as to what they can get).

frq_sum

$sum = $subtlex->frq_sum(strings => [qw/word1 word2/], scale => 'count|opm|log|zipf');

Returns the sum of the count, opm, log (usually opm) or Zipf frequency, depending on the value of scale.

frq_mean

$mean = $subtlex->frq_mean(strings => [qw/word1 word2/], scale => 'count|opm|log|zipf');

Returns the arithmetic average of the count, opm, log (usually opm) or Zipf frequency, depending on the value of scale.

frq_median

$median = $subtlex->frq_median(strings => [qw/word1 word2/], scale => 'count|opm|log|zipf');

Returns the median count, opm, log (usually opm) or Zipf frequency for the given strings, depending on the value of scale.

frq_sd

$sd = $subtlex->frq_sd(strings => [qw/word1 word2/], scale => 'count|opm|log|zipf');

Returns the standard deviation of the count, opm, log (usually opm) or Zipf frequency, depending on the value of scale.

Retrieving letter-strings/words

select_strings

$aref = $subtlex->select_strings(frq_opm => [1, 20], length => [4, 4], cv_pattern => 'CVCV', regex => '^f');
$aref = $subtlex->select_strings(frq_zipf => [0, 2], length => [4, 4], cv_pattern => 'CVCV', regex => '^f');

Alias: select_words

Returns a list of strings (presumably words) from the SUBTLEX corpus that satisfies certain criteria, as per the following arguments:

length

minimum and/or maximum length of the string (or "letter-length")

frq_opm, frq_log, cd_count, etc.

minimum and/or maximum frequency (as given in whatever unit offered by the datafile for the set language)

cv_pattern

a consonant-vowel pattern, given as a string by the usual convention, e.g., 'CCVCC' defines a 5-letter word starting and ending with pairs of consonants, the pairs separated by a vowel. 'Y' is defined here as a consonant. The tested strings are stripped of marks and otherwise ASCII transliterated (using Text::Unidecode) ahead of the check.

regex

a regular expression (perlretut). In the examples above, only letter-strings starting with the letter 'f', followed by one of more other letters, are specified for retrieval. Alternatively, for example, the regex value '[^aeiouy]$' specifies that the letter-strings to be returned must not end with a vowel (or 'y'). The tested strings are stripped of marks and otherwise ASCII transliterated (using Text::Unidecode) ahead of matching, so if the string in the file has, say, a 'u' with an Umlaut, it will match a 'u' in the regex.

For the minimum/maximum constrained criteria, the two limits are given as a referenced array where the first element is the minimum and the second element is the maximum. For example, [3, 7] would specify letter-strings of 3 to 7 letters in length; [4, 4] specifies letter-strings of only 4 letters in length. If only one of these is to be constrained, then the array would be given as, e.g., [3] to specify a minimum of 3 letters without constraining the maximum, or ['',7] for a maximum of 7 letters without constraining the minimum (checking if the element hascontent as per String::Util).

The value returned is always a reference to the list of words retrieved (or to an empty list if none was retrieved).

Calling this method as "list_strings" or "list_words" is deprecated; to avoid confusion with all_strings, which also returns a list of strings. A deprecation warning and wrap to the method is in place as of version 0.06 if using this name; they will be removed in a subsequent version.

all_strings

$aref = $subtlex->all_strings();

Alias: all_words

Returns a reference to an array of all letter-strings in the corpus. These are culled of empty and duplicate strings, and then alphabetically sorted.

random_string

$string = $subtlex->random_string();
@data = $subtlex->random_string();

Alias: random_word

Picks a random line from the corpus, using File::RandomLine (except the top header line). Returns the word in that line if called in scalar context; otherwise, the array of data for that line. (A future version might let specifying a match to specific criteria, self-aborting after trying X lines.)

Miscellaneous

n_lines

$num = $subtlex->n_lines();

Returns the number of lines, less the column headings and any lines with no content, in the installed language file. Expects/accepts no arguments.

pct_alpha

Returns the percentage of strings in the subtitles file that satisfy "look like words" relative to the number of lines (as per n_lines). Specifically, after ASCII transliteration of the string (per Text::Unidecode), does it match to /[\p{XPosixAlpha}\-']/ (per perluniprops, but including apostrophes and dashes)?

set_lang

$lang = $subtlex->set_lang(lang => STR); # DE, FR, NL_all, NL_min, PT, UK or US
$lang = $subtlex->set_lang(lang => STR, path => 'this/is/the/file.csv');
$lang = $subtlex->set_lang(lang => STR, dir => 'file/is/in/here');

Set or guess location of datafile; see new. Naturally, the given value of lang (required)--which is used as a database ID--should correspond with any given path to the SUBTLEX datafile (optional but recommended). If only a dir value is given, the SUBTLEX datafile should be named so that it uniquely includes the specific value of lang.

get_lang

$str = $subtlex->get_lang();

Returns the language code (e.g., 'UK', 'FR') currently set for the module (which determines the file being looked up, if not explicitly given). The empty string is returned if the language has not been set.

get_path2db

$path = $subtlex->get_path2db();

Returns the path (directory and filename) from which the module's methods are currently set to look-up strings, frequencies, etc.

get_index

$int = $subtlex->get_index(measure => 'frq_opm');

Returns the index within the currently looked-up file that contains the given measure.

set_eq

$subtlex->set_eq(match_level => INT); # undef, 0, 1, 2 or 3

See Lingua::Orthon.

url2datafile

$url = $subtlex->url2datafile(lang => STRING);
%loc = $subtlex->url2datafile(lang => STRING);

Returns the URL (complete path) where the SUBTLEX file for a given language is stored, and from which it should be downloadable. These are locations as specified (at the time of releasing this version of the module) at expsy.ugent.be/subtlexus/ and/or crr.ugent.be, and so as listed in the DOWNLOADS section. This could include an archive from within which the file needs to be retrieved. Called in list context, this method returns a hash with keys for 'www_dir', 'archive' (if the file is within an archive) and 'filename'. (This module does not fetch the file off the WWW itself; it should be installed and available on the local machine/network--see new).

DIAGNOSTICS

  • Need a valid <lang> attribute

    When constructing the class object with new, the lang argument must have a valid value, as indicated in the table above. Also, the module needs to read in the contents of a file named "specs.csv" which should be located within the SUBTLEX directory where the module itself is located (alongside the downloaded SUBTLEX files). This file specifies the field indices for the various stats within each SUBTLEX datafile. Check that this file is indeed within the Perl/site/lib/Lingua/Norms/SUBTLEX directory. If it is not, download and install the file to that location via the CPAN package of this module.

  • Value given to argument 'dir' (VALUE) in new() is not a directory

    Croaked from new if called with a value for the argument dir, and this value is not actually a directory/folder. This is the directory/folder in which the actual SUBTLEX datafiles should be located.

  • Cannot find required database for language ...

    Croaked from new if none of the given values to arguments lang, dir or path are valid, and even the default site/lib directory and US database are not accessible. Check that your have indeed a file with the given value of lang (DE, NL, UK or US) within the Perl/site/lib/Lingua/Norms/SUBTLEX directory, or at least that the SUBTLEX-US file is located within it, and can be read via your script.

  • Cannot determine fields for given language

    Croaked upon construction if no fields are recognized for the given language. The value given to lang must be one of DE, NL, UK or US.

  • The requested value is not defined for the ... SUBTLEX corpus

    Croaked when calling for a value for a statistic that is not defined for a given language, e.g., when requesting a value for the Zipf frequency in the NL corpus.

  • No string to test; pass a value for <string> to FUNCTION()

    Croaked by several methods that expect a value for the named argument string, and when no such value is given. These methods require the letter-string to be passed to it as a key => value pair, with the key string followed by the value of the string to test.

  • No string(s) to test; pass one or more letter-strings named \'strings\' as a referenced array

    Same as above but specifically croaked by frq_hash which accepts more than one string in a single call.

  • Need to install and have access to module File::RandomLine

    Croaked by method random_string if the module it depends on (File::RandomLine) is not installed or accessible. This should have been installed (if not already) upon installation of the present module. See CPAN to download and install this module manually.

DEPENDENCIES

File::RandomLine : for random_string

Lingua::Orthon : for set_eq method

List::AllUtils : all, any, none, uniq and other functions

Number::Misc : is_numeric

Path::Tiny : for directory reading when calling new

Statistics::Lite : for various statistical methods

String::Trim : trim

String::Util : for determining valid string values

Text::CSV::Hashify : reads in the specs file

Text::CSV::Separator : for determining the field delimiter within the datafiles

Text::Unidecode : for plain ASCII transliterations of Unicode text

REFERENCES

Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Boelte, J., & Boehl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental Psychology, 58, 412-424. doi: 10.1027/1618-3169/a000123

Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977-990. doi: 10.3758/BRM.41.4.977

Brysbaert, M., New, B., & Keuleers,E. (2012). Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior Research Methods, 44, 991-997. doi: 10.3758/s13428-012-0190-4

Herdagdelen, A., & Marelli, M. (2017). Social media and language processing: How Facebook and Twitter provide the best frequency estimates for studying word recognition. Cognitive Science, 41, 976-995. doi:10.1111/cogs.12392

Keuleers, E., Brysbaert, M., & New, B. (2010). SUBTLEX-NL: A new frequency measure for Dutch words based on film subtitles. Behavior Research Methods, 42, 643-650. doi: 10.3758/BRM.42.3.643

New, B., Brysbaert, M., Veronis, J., & Pallier, C. (2007). The use of film subtitles to estimate word frequencies. Applied Psycholinguistics, 28, 661-677.

Soares, A. P., Machado, J., Costa, A., Comesaña, M., & Perea, M. (in press). On the advantages of frequency measures extracted from subtitles: The case of Portuguese. Quarterly Journal of Experimental Psychology.

Van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176-1190. doi: 10.1080/17470218.2013.850521

AUTHOR

Roderick Garton, <rgarton at cpan.org>

BUGS AND LIMITATIONS

Please report any bugs or feature requests to bug-lingua-norms-subtlfreq-0.06 at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lingua-Norms-SUBTLEX-0.06. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Lingua::Norms::SUBTLEX

You can also look for information at:

LICENSE AND COPYRIGHT

Copyright 2014-2018 Roderick Garton.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.

To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.