NAME

create-icpropagation.pl - This program determines the probability of the CUIs in a specified set of sources and relations.

SYNOPSIS

This program determines the probability of the CUIs in a specified set of sources and relations.

USAGE

Usage: create-icpropagation.pl [OPTIONS] OUTPUTFILE ICFREQUENCY_FILE

OUTPUTFILE

File in which the probability of the CUIs will be stored.

The ouput file containing the probability of the CUIs has the following format:

SMOOTH :: <0|1>
SAB :: (include|exclude) <sources>
REL :: (include|exclude) <relations>
N :: NUMBER
REL :: <relations>
RELA :: <relas>  <- if any are specified in the config
CUI<>probability
CUI<>probability
...

ICFREQUENCY FILE

File containing the icfrequency counts

The input file contains frequency counts for CUIs in the following format:

SAB :: (include|exclude) <sources>
REL :: (include|exclude) <relations>
N :: NUMBER
CUI<>freq
CUI<>freq
...

N is the total number of ngrams that occurred in the text used to create the icfrequency file.

Optional Arguments:

--st

This outputs the probability of the concepts semantic types rather than the concepts themselves. The frequencies for the st are propagated up the semantic network and therefore are source independent. Note, that the semantic types are expected in the icfrequency input file. This can be created using the create-icfrequency.pl program with the --st option.

If you are erroring out due to the header information on the top of the icfrequency file, try using the --disregard option.

--smooth

Incorporate Laplace smoothing, where the frequency count of each of the concepts in the taxonomy is incremented by one. The advantage of doing this is that it avoides having a concept that has a probability of zero. The disadvantage is that it can shift the overall probability mass of the concepts from what is actually seen in the corpus.

--config FILE

This is the configuration file. The format of the configuration file is as follows:

SAB :: <include|exclude> <source1, source2, ... sourceN>

REL :: <include|exclude> <relation1, relation2, ... relationN>

For example, if we wanted to use the MSH vocabulary with only the RB/RN relations, the configuration file would be:

SAB :: include MSH REL :: include RB, RN

or

SAB :: include MSH REL :: exclude PAR, CHD

If you go to the configuration file directory, there will be example configuration files for the different runs that you have performed.

Note: You can use relations other than PAR/CHD and RB/RN for propagation but we do not recommend it. The PAR/CHD and RB/RN relations are considered the heirarchical relations in the UMLS which is required for propagation to perform correctly.

--disregard

This ignores the SAB configuration that the icfrequency file was created with

--precision N

Displays values upto N places of decimal.

--username STRING

Username is required to access the umls database on MySql

--password STRING

Password is required to access the umls database on MySql

--hostname STRING

Hostname where mysql is located. DEFAULT: localhost

--database STRING

Database contain UMLS DEFAULT: umls

--debug

Sets the UMLS-Interface debug flag on for testing

--help

Displays the quick summary of program options.

--version

Displays the version information.

PROPAGATION

The Information Content (IC) is defined as the negative log of the probability of a concept.

The probability of a concept, c, is determine by summing the probability of the concept (P(c)) ocurring in some text plus the probability its decendants (P(d)) occuring in some text as see in below:

P(c*) = P(c) + \sum_{d\exists decendant(c)} P(d)

The initial probability of a concept (P(c)) and its decendants (P(d)) is obtained by dividing the number of times a concept is seen in the corpus (freq(d)) by the total number of concepts (N) as seen below:

P(d) = freq(d) / N

Not all of the concepts in the taxonomy will be seen in the corpus. The package includes the option of using Laplace smoothing, where the frequency count of each of the concepts in the taxonomy is incremented by one. The advantage of doing this is that it avoides having a concept that has a probability of zero. The disadvantage is that it can shift the overall probability mass of the concepts from what is actually seen in the corpus.

For more information on how this is calculated please see the README file.

SYSTEM REQUIREMENTS

  • Perl (version 5.8.5 or better) - http://www.perl.org

  • UMLS::Interface - http://search.cpan.org/dist/UMLS-Interface

  • UMLS::Similarity - http://search.cpan.org/dist/UMLS-Similarity

  • Text::NSP - http://search.cpan.org/dist/Text-NSP

  • MetaMap - http://mmtx.nlm.nih.gov/

CONTACT US

If you have any trouble installing and using CreatePropagationFile, 
please contact us via the users mailing list :
  
    umls-similarity@yahoogroups.com
   
You can join this group by going to:
  
    http://tech.groups.yahoo.com/group/umls-similarity/
   
You may also contact us directly if you prefer :
  
    Bridget T. McInnes: bthomson at cs.umn.edu 

    Ted Pedersen : tpederse at d.umn.edu

AUTHOR

Bridget T. McInnes, University of Minnesota

COPYRIGHT

Copyright (c) 2007-2011,

Bridget T. McInnes, University of Minnesota
bthomson at cs.umn.edu
   
Ted Pedersen, University of Minnesota Duluth
tpederse at d.umn.edu


Siddharth Patwardhan, University of Utah, Salt Lake City
sidd@cs.utah.edu

Serguei Pakhomov, University of Minnesota Twin Cities
pakh0002@umn.edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA  02111-1307, USA.