NAME
linkTables - input file of links and tokens for document set, and generated token and document tables.
SYNOPSIS
linkTables [--docs|--linktext|--nocase|--noclean|--titletext] [--mincount N] [--stopfile FILE] LINK-FILE STEM
Options:
LINK-FILE Filename for input link file usually created by XSL
STEM Stem for output file, several extensions read and made
--docs only update the .docs file, all else remains fixed
--linktext add link text, delimit by spaces, to text type
--mincount M only add tokens with this many
--nocase ignore case of URLs
--noclean don't use built-in URL cleaning
--stopfile F do not enter these words in text tables
--titletext add title text, delimit by spaces, to text type
-h, --help display help message and exit.
--man print man page and exit.
DESCRIPTION
Input file of links, link text and redirects in the data format described next. Use file name '-' to input stdin. Builds the tables used in bag processing: STEM.tokens N-th line is the token for items with index (N-1). STEM.words A map for the token file includes token, its type and the hash code. STEM.docs N-th line is the details for the N-th document STEM.docfeats mapping of token index to document index
The token to document index in .docfeats is implied after standardising OUTGOING-URLs for a document and the document URls themselves
DATA FORMAT
Input lines can have the R form for redirects: R <URL> <URL-REDIRECTED-TO>
These entries are ignored by this script, and should be first eliminated with linkRedir(1). The main input is the D form for documents and their links and link text D <URL> <HASHID> <TITLE> <OUTGOING-URL> <LINK-TEXT> ... EOL <TYPE> <TOKEN> ... EOD
The text "EOD" acts as a document terminator and can be missing if no tokens exist. The text "EOL" is a link terminator. The <URL>s and <HASHID>s must not have spaces or the processing will get confused since R and D records are split on spaces. Note text at the end of the line is an exception. <HASHID> is any externally defined record identifier. ALVIS default is a 32 character hexadecimal from an MD5 hash of the text.
<TYPE> is intended to be a short bit of alphabetic text describing the type such as 'person', 'company', etc. Reserved <TYPE>s are 'doc', link to a document in the collection, 'link' which is a link out of the collection, and 'text' which is any text.
SEE ALSO
Alvis::URLs(3), linkBags(1), linkMpca(1), linkRedir(1), mpdata(1).
MPCA website is http://www.componentanalysis.org
AUTHOR
Wray Buntine
COPYRIGHT AND LICENSE
Copyright (C) 2005-2006 Wray Buntine
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.