NAME

CUICollector.pl - Scrapes files from the MetaMapped Medline Baseline project to build a database of CUI bigram scores.

SYNOPSIS

$ perl CUICollector.pl --directory metamapped-baseline/2014/ 
CUICollector 0.04 - (C) 2015 Keith Herbert and Bridget McInnes, PhD
Released under the GNU GPL.
Connecting to CUI_DB on localhost
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_01.gz
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_02.gz
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_03.gz
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_02.gz
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_03.gz
Entering scores into CUI_DB
...
Finished

USAGE

Usage: CUICollector.pl [DATABASE OPTIONS] [OTHER OPTIONS] [FILES | DIRECTORIES]

INPUT

Required Arguments:

[FILES | DIRECTORIES]

Specify a directory containing *ONLY* compressed MetaMapped Medical Baseline files: --directory /path/to/files/

Multiple directories may also be supplied: --directory /path/to/first/folder/ /path/to/second/folder/

Likewise, specify a list of individual files --files text.out_01.txt.gz text_mm_out_42.txt.gz text_mm_out_314.txt.gz

a glob of files: --files /path/to/dir/*.gz

Or just one: --files text.out_01.txt.gz

Optional Arguments:

--database STRING

Database to contain the CUI bigram scores. DEFAULT: CUI_DB

If the database is not found in MySQL, CUICollector will create it for you.

--username STRING

Username is required to access the bigram database on MySql. You will be prompted for it if it is not supplied as an argument.

--password STRING

Password is required to access the umls database on MySql. You will be prompted for it if it is not supplied as an argument.

--hostname STRING

Hostname where mysql is located. DEFAULT: localhost

--port STRING

The port your mysql is using. DEFAULT: 3306

--file_step INTEGER

How many MetaMapped Medical Baseline files to read between writes to the database. DEFAULT: 5

MMO files can be rather large so setting a low file_step reduces the memory footprint of the script. However, setting a higher file_step reduces the number of write operations to the database.

--debug

Sets the debug flag for testing. NOTE: extremely verbose.

--verbose

Print the current status of the program to STDOUT. This indicates the files being processed and when the program is writing to the database. This is the default output setting.

--quiet

Don't print anything to STDOUT.

--help

Displays the quick summary of program options.

OUTPUT

By default, CUICollector prints he current status of the program as it works through the Metamapped Medline Output files (disable with `--quiet`). It creates a database (or connects to an existing one) and adds bigram scores of the CUIs it encounters in the MMO files.

The resulting database will have four tables:

N_11 cui_1 cui_2 n_11 This shows the count (n_11) for every time a particular CUI (cui_1) is immediately followed by another particular CUI (cui_2) in an utterance.
N_1P cui_1 n_1p This shows the count (n_11) for every time a particular CUI (cui_1) is followed by any CUI in an utterance.
N_P1 cui_2 n_p1 This shows the count (n_p1) for every time a particular CUI (cui_2) is immediately preceded by any CUI in an utterance.
N_PP n_pp This single value is the total count of all cui_1, cui_2 bigram pairs.

AUTHOR

Keith Herbert, Virginia Commonwealth University

COPYRIGHT

Copyright (c) 2015, Keith Herbert, Virginia Commonwealth University herbertkb at vcu edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA  02111-1307, USA.