NAME
vector-input.pl - This program builds the term index file and co-occrrence matrix for umls-similarity.pl to calculate the vector relatedness.
SYNOPSIS
vector-input.pl takes the bigrams frequency input and build the index and the co-occurrence matrix.
DESCRIPTION
We build the index and co-occurrence matrix for the vector method of UMLS-Similarity. The index file helps to locate each term's vector by recording the start position and the length of its vector. The matrix file records every term's vector.
See perldoc vector-input.pl
USAGE
vector-input.pl INDEX MATRIX BIGRAMFILE
example: vector-input.pl Index.txt Matrix.txt BigramsList.txt
INPUT
Required Arguments:
INDEX
output file of the vector-input.pl. It records the index of each term and the vector start position and length f the co-occurrence matrix.
MATRIX
output file of the vector-input.pl. Each line is a vector for the term and its co-occurrence term and their frequency.
BIGRAMFILE
Input to vector-input.pl should be a single flat file generated by huge-count.pl of Text-NSP package. If the bigrams list is generated by count.pl, pleasue use count2huge.pl to convert the results to huge-count.pl. It sorts the bigrams in the alphabet order. When vector-input.pl generates the index and co-occurrence matrix file, it requires the bigrams which starts the same term t1 grouped together and lists next to each other. Because at this step, bigrams are not stored in memory. If the first term of the bigrams changes, it prints the output and index position of the vector for the term t1. Especially, if the bigrams are sorted in the alphabet order, it is faster for vector method of UMLS-Similarity to build the vector. Because for each concept, it searches the co-occurrence matrix to build the second order vector. If every term of the vector are sorted, the vector method can search the co-occurrence matrix from the beginning to the end by the index position and length. If the co-occurrence matrix is a huge file, it could save lots of execute time.
Other Options:
--stat
The bigram file is from statistics.pl rather than count.pl
--cutoff SCORE
Only use those ngrams that are greater than SCORE
--help
Displays the help information.
--version
Displays the version information.
AUTHOR
Ying Liu, liux0395 at umn.edu
SEE ALSO
home page: www.tc.umn.edu/~liux0395
COPYRIGHT
Copyright (C) 2010, Ying Liu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.