NAME
CUICollector.pl - Scrapes MetaMap Machine Output (MMO) files to build a database of CUI bigram scores.
SYNOPSIS
$ perl CUICollector.pl --directory metamapped-baseline/2014/
CUICollector 0.04 - (C) 2015 Keith Herbert and Bridget McInnes
Released under the GNU GPL.
Connecting to database CUI_Bigrams on localhost
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_01.gz
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_02.gz
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_03.gz
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_02.gz
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_03.gz
Entering scores into CUI_Bigrams
...
Finished
USAGE
Usage: CUICollector.pl [DATABASE OPTIONS] [OTHER OPTIONS] [FILES | DIRECTORIES]
INPUT
Required Arguments:
[FILES | DIRECTORIES]
Specify a directory containing *ONLY* compressed MetaMapped Medical Baseline files: --directory /path/to/files/
Multiple directories may also be supplied: --directory /path/to/first/folder/ /path/to/second/folder/
Likewise, specify a list of individual files --files text.out_01.txt.gz text_mm_out_42.txt.gz text_mm_out_314.txt.gz
a glob of files: --files /path/to/dir/*.gz
Or just one: --files text.out_01.txt.gz
Optional Arguments:
--database STRING
Database to contain the CUI bigram scores. DEFAULT: CUI_Bigrams
If the database is not found in MySQL, CUICollector will create it for you.
--username STRING
Username is required to access the CUI bigram database on MySql. You will be prompted for it if it is not supplied as an argument.
--password STRING
Password is required to access the CUI bigram database on MySql. You will be prompted for it if it is not supplied as an argument.
--hostname STRING
Hostname where mysql is located. DEFAULT: localhost
--port STRING
The port your mysql is using. DEFAULT: 3306
--file_step INTEGER
How many MetaMap files to read between writes to the database. DEFAULT: 5
MMO files can be rather large so setting a low file_step reduces the memory footprint of the script. However, setting a higher file_step reduces the number of write operations to the database.
--debug
Sets the debug flag for testing. NOTE: extremely verbose.
--verbose
Print the current status of the program to STDOUT. This indicates the files being processed and when the program is writing to the database. This is the default output setting.
--quiet
Don't print anything to STDOUT.
--help
Displays the quick summary of program options.
OUTPUT
By default, CUICollector prints he current status of the program as it works through the Metamapped Medline Output files (disable with `--quiet`). It creates a database (or connects to an existing one) and adds bigram scores of the CUIs it encounters in the MMO files.
The resulting database will have four tables:
- N_11
-
cui_1 cui_2 n_11
This shows the count (n_11) for every time a particular CUI (cui_1) is immediately followed by another particular CUI (cui_2) in an utterance.
- N_1P
-
cui_1 n_1p
This shows the count (n_11) for every time a particular CUI (cui_1) is followed by any CUI in an utterance.
- N_P1
-
cui_2 n_p1
This shows the count (n_p1) for every time a particular CUI (cui_2) is immediately preceded by any CUI in an utterance.
- N_PP
-
n_pp
This single value is the total count of all cui_1, cui_2 bigram pairs.
AUTHOR
Keith Herbert, Virginia Commonwealth University
Amy Olex, Virginia Commonwealth University
COPYRIGHT
Copyright (c) 2015, Keith Herbert, Virginia Commonwealth University herbertkb at vcu edu
Amy Olex, Virginia Commonwealth University alolex at vcu dot edu
Bridget McInnes, Virginia Commonwealth University btmcinnes at vcu dot edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to:
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.