NAME
umls-similarity.pl - This program returns a semantic similarity score between two concepts.
SYNOPSIS
This is a utility that takes as input either two terms (DEFAULT) or two CUIs and returns the similarity between the two.
USAGE
Usage: umls-similarity.pl [OPTIONS] [CUI1|TERM1] [CUI2|TERM2]
INPUT
[CUI1|TERM1] [CUI2|TERM2]
The input are two terms or two CUIs associated to concepts in the UMLS.
OPTIONS
Optional command line arguements
General Options:
--config FILE
This is the configuration file. There are six configuration options that can be used depending on which measure you are using. The path, wup, cmatch, zhong, lch, lin, jcn and res measures require the SAB and REL options to be set while the vector and lesk measures require the SABDEF and RELDEF options.
The SAB and REL options are used to determine which sources and relations the path information is to be obtained from. The format of the configuration file is as follows:
SAB :: <include|exclude> <source1, source2, ... sourceN>
REL :: <include|exclude> <relation1, relation2, ... relationN>
For example, if we wanted to use the MSH vocabulary with only the RB/RN relations, the configuration file would be:
SAB :: include MSH REL :: include RB, RN
or
SAB :: include MSH REL :: exclude PAR, CHD
The SABDEF and RELDEF options are used to determine the sources and relations the extended definition is to be obtained from. We call the definition used by the measure, the extended definition because this may include definitions from related concepts.
The format of the configuration file is as follows:
SABDEF :: <include|exclude> <source1, source2, ... sourceN>
RELDEF :: <include|exclude> <relation1, relation2, ... relationN>
The possible relations that can be included in RELDEF are: 1. all of the possible relations in MRREL such as PAR, CHD, ... 2. CUI which refers the concepts definition 3. ST which refers to the concepts semantic types definition 4. TERM which refers to the concepts associated terms
For example, if we wanted to use the definitions from MSH vocabulary and we only wanted the definition of the CUI and the definitions of the CUIs SIB relation, the configuration file would be:
SABDEF :: include MSH RELDEF :: include CUI, SIB
Note: RELDEF takes any of MRREL relations and two special 'relations':
1. CUI which refers to the CUIs definition
2. TERM which refers to the terms associated with the CUI
If you go to the configuration file directory, there will be example configuration files for the different runs that you have performed.
For more information about the configuration options (including the RELA and RELADEF options) please see the README.
--realtime
This option will not create a database of the path information for all of concepts in the specified set of sources and relations in the config file but obtain the information for just the input concept
This is option is only available for the path and ic measures.
--forcerun
This option will bypass any command prompts such as asking if you would like to continue with the index creation.
This is also only necessary for the path and ic measures
--measure MEASURE
Use the MEASURE module to calculate the semantic similarity. The available measure are: 1. Leacock and Chodorow (1998) referred to as lch 2. Wu and Palmer (1994) referred to as wup 3. Zhong, et al. (2002) referred to as zhong 4. The basic path measure referred to as path 5. The undirected path measure referred to as upath 6. Rada, et. al. (1989) referred to as cdist 7. Nguyan and Al-Mubaid (2006) referred to as nam 8. Resnik (1996) referred to as res 9. Lin (1988) referred to as lin 10 Jiang and Conrath (1997) referred to as jcn 11. The vector measure referred to as vector 12. Pekar and Staab (2002) referred to as pks 13. Pirro and Euzenat (2010) referred to as faith 14. Maedche and Staab (2001) referred to as cmatch 15. Batet, et al (2011) referred to as batet 16. S{\'a}nchez, et al. (2012) referred to as sanchez
--original
This returns the original score of the measures proposed by Rada, et al (cdist), Nguyen & Al-Mubaid (nam) and Jiang \& Conrath (jcn). The default returns the reciprocal of these measures in order to use them as similarity measures.
--precision N
Displays values up to N places of decimal.
--allsenses
This option prints out all the possible CUIs pairs and their semantic similarity score if one of the inputs is a term that maps to more than one CUI. Right now we just return the CUIs that are the most similar.
--help
Displays the quick summary of program options.
--version
Displays the version information.
Input Options:
--infile FILE
A file containing pairs of concepts or terms in the following format:
term1<>term2
or
cui1<>cui2
or
cui1<>term2
or
term1<>cui2
Unless the --matrix option is chosen then it is just a list of CUIS: cui1 cui2 cui3 ...
--matrix
This option returns a matrix of similarity scores given a file containing a list of CUIs. The file is passed using the --infile option
--loadcache FILE
FILE containing similarity scores of cui pairs in the following format:
score<>CUI1<>CUI2
--getcache FILE
Outputs the cache into FILE once the program has finished.
Debug Options:
--debug
Sets the UMLS-Interface debug flag on for testing
--info
Displays information about the concept if it doesn't exist in the source.
--verbose
This option will print out the table information to the config file that you specified.
Database Options:
--username STRING
Username is required to access the umls database on mysql
--password STRING
Password is required to access the umls database on mysql
--hostname STRING
Hostname where mysql is located. DEFAULT: localhost
--socket STRING
Socket where the mysql.sock or mysqld.sock is located. DEFAULT: mysql.sock
--database STRING
Database contain UMLS DEFAULT: umls
Path-based Options
--undirected
The shortest path is undirected. This is only available with the path measure itself using the --realtime option. It is also currently limited to the RB/RN and/or PAR/CHD relations.
IC Measure Options:
--st
Uses the information content of the CUIs semantic type to calculate the res measures. If the --icpropagation or --icfrequency files are specified they must contain the probability or frequency counts of the semantic types. If they aren't specified the defaults will be used.
--intrinsic [seco|sanchez]
Uses intrinic information content of the CUIs defined by Sanchez and Betet 2011 or Seco, et al 2004.
--icpropagation FILE
FILE containing the propagation counts of the CUIs. This file must be in the following format:
CUI<>probability
where probability is the probability of the concept occurring.
See create-icpropagation.pl for more information.
--icfrequency FILE
FILE containing frequency counts of CUIs. This file must be in the following format:
CUI<>freq
where freq is the frequency in which the concept occurred in some text. See create-icfrequency.pl for more information.
--smooth
Incorporate Laplace smoothing, where the frequency count of each of the concepts in the taxonomy is incremented by one. The advantage of doing this is that it avoids having a concept that has a probability of zero. The disadvantage is that it can shift the overall probability mass of the concepts from what is actually seen in the corpus.
This can only be used in conjunction with the --icfrequency options
Vector Measure Options:
--vectormatrix FILE
This is the matrix file that contains the vector information to use with the vector measure. This is required if you specify vector with the --measure option.
This file is generated by the vector-input.pl program. An example of this file can be found in the samples/ directory and is called matrix.
--vectorindex FILE
This is the index file that contains the vector information to use with the vector measure. This is required if you specify vector with the --measure option.
This file is generated by the vector-input.pl program. An example of this file can be found in the samples/ directory and is called index.
--debugfile FILE
This prints the vector information to file, FILE, for debugging purposes.
--dictfile FILE
This is a dictionary file for the vector measure. It contains the 'definitions' of a concept or term which would be used rather than the definitions from the UMLS. If you would like to use dictfile as a augmentation of the UMLS definitions, then use the --config option in conjunction with the --dictfile option.
The expect format for the --dictfile file is:
CUI: <definition>
CUI: <definition>
TERM: <definition>
TERM: <definition>
There are three different option configurations that you have with the --dictfile.
1. No --dictfile - which will use the UMLS definitions
umls-similarity.pl --measure lesk hand foot
2. --dictfile - which will just use the dictfile definitions
umls-similarity.pl --measure lesk --dictfile samples/dictfile hand foot
3. --dictfile + --config - which will use both the UMLS and dictfile
definitions
umls-similarity.pl --measure lesk --dictfile samples/dictfile --config
configuration hand foot
Keep in mind, when using this file with the --config option, if one of the CUIs or terms that you are obtaining the similarity for does not exist in the file the vector will be empty which will lead to strange similarity scores.
An example of this file can be found in the samples/ directory and is called dictfile.
--defraw
This is a flag for the vector measures. The definitions used are 'cleaned'. If the --defraw flag is set they will not be cleaned.
--stoplist FILE
A file containing a list of words to be excluded from the features in the lesk and vector method on a word by word basis. The format required is one stopword per line, words are in the regular expression format. For example:
/\b[a-zA-Z]\b/
/\b[aA]board\b/
/\b[aA]bout\b/
/\b[aA]bove\b/
/\b[aA]cross\b/
/\b[aA]fter\b/
/\b[aA]gain\b/
The sample file, stoplist-nsp.regex, is under the samples directory.
--compoundfile FILE
This is a compound word file for the vector and lesks measures. It containsthe compound words which we want to consider them as one wordwhen we compare the relatedness. Each compound word is a line in the file and compound words are seperated by space. When using this option with vector, make sure the vectormatrix and vectorindex file are based on the corpus proprocessed by replacing the compound words in the Text-NSP package. An example is under /sample/compoundword.txt
--stem
This is a flag for the vector and lesk method. If the --stem flag is set, definition words are stemmed using the Lingua::Stem::En module.
SYSTEM REQUIREMENTS
Perl (version 5.8.5 or better) - http://www.perl.org
UMLS::Interface - http://search.cpan.org/dist/UMLS-Interface
UMLS::Similarity - http://search.cpan.org/dist/UMLS-Similarity
CONTACT US
If you have any trouble installing and using UMLS-Similarity,
please contact us via the users mailing list :
umls-similarity@yahoogroups.com
You can join this group by going to:
http://tech.groups.yahoo.com/group/umls-similarity/
You may also contact us directly if you prefer :
Bridget T. McInnes: bthomson at cs.umn.edu
Ted Pedersen : tpederse at d.umn.edu
AUTHOR
Bridget T. McInnes, University of Minnesota
COPYRIGHT
Copyright (c) 2007-2011,
Bridget T. McInnes, University of Minnesota
bthomson at cs.umn.edu
Ted Pedersen, University of Minnesota Duluth
tpederse at d.umn.edu
Siddharth Patwardhan, University of Utah, Salt Lake City
sidd at cs.utah.edu
Serguei Pakhomov, University of Minnesota Twin Cities
pakh0002 at umn.edu
Ying Liu, University of Minnesota Twin Cities
liux at umn.edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to:
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.