NAME

CUICollector.pl - Scrapes MetaMap Machine Output (MMO) files to build a database of CUI bigram scores.

SYNOPSIS

$ perl CUICollector.pl --directory metamapped-baseline/2014/ 
CUICollector 0.04 - (C) 2015 Keith Herbert and Bridget McInnes
Released under the GNU GPL.
Connecting to database CUI_Bigrams on localhost
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_01.gz
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_02.gz
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_03.gz
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_02.gz
Parsing file: /home/share/data/metamapped-baseline/2014/text.out_03.gz
Entering scores into CUI_Bigrams
...
Finished

USAGE

Usage: CUICollector.pl [DATABASE OPTIONS] [OTHER OPTIONS] [FILES | DIRECTORIES]

INPUT

Required Arguments:

[FILES | DIRECTORIES]

Specify a directory containing *ONLY* compressed MetaMapped Medical Baseline files: --directory /path/to/files/

Multiple directories may also be supplied: --directory /path/to/first/folder/ /path/to/second/folder/

Likewise, specify a list of individual files --files text.out_01.txt.gz text_mm_out_42.txt.gz text_mm_out_314.txt.gz

a glob of files: --files /path/to/dir/*.gz

Or just one: --files text.out_01.txt.gz

Optional Arguments:

--database STRING

Database to contain the CUI bigram scores. DEFAULT: CUI_Bigrams

If the database is not found in MySQL, CUICollector will create it for you.

--username STRING

Username is required to access the CUI bigram database on MySql. You will be prompted for it if it is not supplied as an argument.

--password STRING

Password is required to access the CUI bigram database on MySql. You will be prompted for it if it is not supplied as an argument.

--hostname STRING

Hostname where mysql is located. DEFAULT: localhost

--port STRING

The port your mysql is using. DEFAULT: 3306

--file_step INTEGER

How many MetaMap files to read between writes to the database. DEFAULT: 5

MMO files can be rather large so setting a low file_step reduces the memory footprint of the script. However, setting a higher file_step reduces the number of write operations to the database.

--debug

Sets the debug flag for testing. NOTE: extremely verbose.

--verbose

Print the current status of the program to STDOUT. This indicates the files being processed and when the program is writing to the database. This is the default output setting.

--quiet

Don't print anything to STDOUT.

--help

Displays the quick summary of program options.

OUTPUT

By default, CUICollector prints he current status of the program as it works through the Metamapped Medline Output files (disable with `--quiet`). It creates a database (or connects to an existing one) and adds bigram scores of the CUIs it encounters in the MMO files.

The resulting database will have four tables:

N_11
cui_1   cui_2   n_11

This shows the count (n_11) for every time a particular CUI (cui_1) is immediately followed by another particular CUI (cui_2) in an utterance.

N_1P
cui_1   n_1p

This shows the count (n_11) for every time a particular CUI (cui_1) is followed by any CUI in an utterance.

N_P1
cui_2   n_p1

This shows the count (n_p1) for every time a particular CUI (cui_2) is immediately preceded by any CUI in an utterance.

N_PP
n_pp

This single value is the total count of all cui_1, cui_2 bigram pairs.

AUTHOR

Keith Herbert, Virginia Commonwealth University

COPYRIGHT

Copyright (c) 2015, Keith Herbert, Virginia Commonwealth University herbertkb at vcu edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA  02111-1307, USA.