NAME

linkTables - input file of links and tokens for document set, and generated token and document tables.

SYNOPSIS

linkTables [--docs|--linktext|--nocase|--noclean|--titletext] [--mincount N] [--stopfile FILE] LINK-FILE STEM

Options:

LINK-FILE           Filename for input link file usually created by XSL
STEM                Stem for output file, several extensions read and made
--docs              only update the .docs file, all else remains fixed
--linktext          add link text, delimit by spaces, to text type
--mincount M        only add tokens with this many
--nocase            ignore case of URLs
--noclean           don't use built-in URL cleaning
--stopfile F        do not enter these words in text tables
--titletext         add title text, delimit by spaces, to text type
-h, --help          display help message and exit.
 --man              print man page and exit.

DESCRIPTION

Input file of links, link text and redirects in the data format described next. Use file name '-' to input stdin. Builds the tables used in bag processing: STEM.tokens N-th line is the token for items with index (N-1). STEM.words A map for the token file includes token, its type and the hash code. STEM.docs N-th line is the details for the N-th document STEM.docfeats mapping of token index to document index

The token to document index in .docfeats is implied after standardising OUTGOING-URLs for a document and the document URls themselves

DATA FORMAT

Input lines can have the R form for redirects: R <URL> <URL-REDIRECTED-TO>

These entries are ignored by this script, and should be first eliminated with linkRedir(1). The main input is the D form for documents and their links and link text D <URL> <HASHID> <TITLE> <OUTGOING-URL> <LINK-TEXT> ... EOL <TYPE> <TOKEN> ... EOD

The text "EOD" acts as a document terminator and can be missing if no tokens exist. The text "EOL" is a link terminator. The <URL>s and <HASHID>s must not have spaces or the processing will get confused since R and D records are split on spaces. Note text at the end of the line is an exception. <HASHID> is any externally defined record identifier. ALVIS default is a 32 character hexadecimal from an MD5 hash of the text.

<TYPE> is intended to be a short bit of alphabetic text describing the type such as 'person', 'company', etc. Reserved <TYPE>s are 'doc', link to a document in the collection, 'link' which is a link out of the collection, and 'text' which is any text.

SEE ALSO

Alvis::URLs(3), linkBags(1), linkMpca(1), linkRedir(1), mpdata(1).

MPCA website is http://www.componentanalysis.org

AUTHOR

Wray Buntine

COPYRIGHT AND LICENSE

Copyright (C) 2005-2006 Wray Buntine

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.