NAME
bin/freq.pl - term frequency counter for IAFA style templates
SYNOPSIS
freq.pl [-ad] [-f maxhits] [-m min-count] [-s sourcedir]
[-t tmpdir] [-A attrib1|attrib2|...|attribN]
DESCRIPTION
This Perl program will look at all the IAFA style templates in a given directory, and count the number of times each term found in the templates occurs. This has a number of uses - notably in determining an appropriate stop-list of words which should not be indexed, and in helping the user to devise an effective query.
Frequently appearing terms such as a, and the will likely cause large numbers of spurious hits when people search your database. To reduce the likelihood of this, we have added a ``stoplist'' feature to the ROADS search back end - this lets you arrange for certain search terms to be automatically removed, and we ship a sample stop list with the ROADS distribution.
The default behaviour is to sort the frequency count into order, and return the top fifty terms. This can be overridden by a set of command-line options.
OPTIONS
- -a
-
send back a complete frequency count, rather than just the most frequently used terms
- -d
-
produce verbose debugging output
- -f maxhits
-
send back at most the top maxhits most frequently used terms, e.g. to see the top 100 with debugging info
freq.pl -df 100
- -m min-count
-
stop once the frequency count falls below min-count, e.g. to get a list of all the terms which occur more than 999 times
freq.pl -m 999 | cut -f2 -d' '
- -s sourcedir
-
look for the templates in the directory sourcedir, e.g. to use the templates in the directory /work2/WWW/roads and return a complete frequency breakdown
freq.pl -as /work2/WWW/roads
- -t tmpdir
-
use tmpdir as temporary directory. This defaults to /tmp, but you may need to change the default if your machine does not have enough room in /tmp for any temporary files generated by freq.pl, e.g.
freq.pl -t /var/tmp
- -A attribute-list
-
only produce frequency list for the attributes listed in attribute-list. attribute-list is a '|' (pipe) separated list of attribute names, e.g.
freq.pl -A 'description|keywords'
OUTPUT FORMAT
The output of freq.pl consists of the frequency count for a term, followed by a single space character, followed by the term itself, e.g.
310 research
283 mailing
270 available
268 University
DEPENDENCIES
An external program called "sort" is used to sort the frequency count into descending order. This is a standard feature of most (all?) implementations of Unix, but the command line options it takes may differ from version to version. Let us know if you find a version which does not understand -r, -n or -T!
TODO
Nothing ? :-)
SEE ALSO
COPYRIGHT
Copyright (c) 1988, Martin Hamilton <martinh@gnu.org> and Jon Knight <jon@net.lut.ac.uk>. All rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
It was developed by the Department of Computer Studies at Loughborough University of Technology, as part of the ROADS project. ROADS is funded under the UK Electronic Libraries Programme (eLib), the European Commission Telematics for Research Programme, and the TERENA development programme.
AUTHOR
Martin Hamilton <martinh@gnu.org>