NAME
Word2vec::Xmltow2v - Medline XML-To-W2V Module.
SYNOPSIS
use Word2vec::Xmltow2v;
# Parameters: Debug Output = True, Write Log = False, StoreTitle = True, StoreAbstract = True, Quick Parse = True, CompoundifyText = True, Use Multi-Threading (Default = 1 Thread Per CPU Core)
my $xmlconv = Word2vec::Xmltow2v->new( 1, 0, 1, 1, 1, 1, 2 ); # Note: Specifying no parameters implies default settings.
$xmlconv->SetWorkingDir( "Medline/XML/Directory/Here" );
$xmlconv->SetSavePath( "textcorpus.txt" );
$xmlconv->SetStoreTitle( 1 );
$xmlconv->SetStoreAbstract( 1 );
$xmlconv->SetBeginDate( "01/01/2004" );
$xmlconv->SetEndDate( "08/13/2016" );
$xmlconv->SetOverwriteExistingFile( 1 );
# If Compound Word File Exists, Store It In Memory And Create Compound Word Binary Search Tree
$xmlconv->ReadCompoundWordDataFromFile( "compoundword.txt", 1 );
$xmlconv->CreateCompoundWordBST();
# Parse XML Files or Directory Of Files
$xmlconv->ConvertMedlineXMLToW2V( "/xmlDirectory/" );
undef( $xmlconv );
DESCRIPTION
Word2vec::Xmltow2v is a XML-to-text module which converts Medline XML article title and abstract data, given a date range, into a plain text corpus for use with Word2vec::Interface. It also "compoundifies" during text corpus compilation given a compound word file.
Main Functions
new
Description:
Returns a new 'Word2vec::Xmltow2v' module object.
Note: Specifying no parameters implies default options.
Default Parameters:
debugLog = 0
writeLog = 0
storeTitle = 1
storeAbstract = 1
quickParse = 0
compoundifyText = 0
storeAsSentencePerLine = 0
numOfThreads = Number of CPUs/CPU cores (1 thread per core/CPU)
workingDir = Current Directory
savePath = Current Directory
beginDate = "00/00/0000"
endDate = "99/99/9999"
xmlStringToParse = "(null)"
textCorpusString = ""
twigHandler = 0
parsedCount = 0
tempDate = ""
tempStr = ""
outputFileName = "textcorpus.txt"
compoundWordAry = ()
compoundWordBST = Word2vec::Bst->new()
maxCompoundWordLength = 0
overwriteExistingFile = 0
Input:
$debugLog -> Instructs module to print debug statements to the console. (1 = True / 0 = False)
$writeLog -> Instructs module to print debug statements to a log file. (1 = True / 0 = False)
$storeTitle -> Instructs module to store Medline article titles during text corpus compilation. (1 = True / 0 = False)
$storeAbstract -> Instructs module to store Medline article abstracts during text corpus compilation. (1 = True / 0 = False)
$quickParse -> Instructs module to utilize quick XML parsing Functions for known Medline article title and abstract tags. (1 = True / 0 = False)
$compoundifyText -> Instructs module to compoundify text on the fly given a compound word file. This is automatically set
when reading the compound word file to memory regardless of user setting. (1 = True / 0 = False)
$storeAsSentencePerLine -> Instructs module to store parsed medline data as a length single sentence or separate sentences on new lines based on period character. (1 = True / 0 = False)
$numOfThreads -> Specifies the number of worker threads which parse Medline XML files simultaneously to create the text corpus.
This speeds up text corpus generation by the number of physical cores present an a given machine. (Positive integer value)
ie. Using four threads of a Intel i7 core machine speeds up text corpus generation roughly four times faster than being single threaded.
$workingDir -> Specifies the current working directory. (String)
$savePath -> Specifies the save path for text corpus generation. (String)
$beginDate -> Specifies the beginning date range for Medline article text corpus composition. (Format: XX/XX/XXXX)
$endDate -> Specifies the ending date range for Medline article text corpus composition. (Format: XX/XX/XXXX)
$xmlStringToParse -> Storage location for the current Medline XML file in memory. (String)
$textCorpusString -> Temporary storage location for text corpus generation in memory. (String)
$twigHandler -> XML::Twig object location.
$parsedCount -> Number of parsed Medline articles during text corpus generation.
$tempDate -> Temporary storage location for current Medline article date during text corpus compilation.
$tempStr -> Temporary storage location for current Medline article title/abstract during text corpus compilation.
$outputFileName -> Output file path/name.
$compoundWordAry -> Storage location for compound words, used to compoundify text. (Array) <- Depreciated
$compoundWordBST -> Storage location for compound words, used to compoundify text. (Binary Search Tree) <- Supersedes '$compoundWordAry'
$maxCompoundWordLength -> Maximum number of words able to be compoundified in one phrase. ie "six_sea_snakes_were_sailing" = 5 compoundified words.
The compounding algorithm will attempt to compoundify no more than this set value, even-though the compound word list could
possibly contain larger compounded phrases.
$overwriteExistingFile -> Instructs the module to either overwrite any existing text corpus files or append to the existing file.
Note: It is not recommended to specify all new() parameters, as it has not been thoroughly tested. Maximum recommended parameters to be specified include:
"debugLog, writeLog, storeTitle, storeAbstract, quickParse, compoundifyText, numOfThreads, workingDir, savePath, beginDate, endDate"
Output:
Word2vec::Xmltow2v object.
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new(); # Note: Specifying no parameters implies default settings as listed above.
undef( $xmlconv );
# Or
use Word2vec::Xmltow2v;
# Parameters: Debug Output = True, Write Log = False, StoreTitle = True, StoreAbstract = True, Quick Parse = True, CompoundifyText = True, Use Multi-Threading (2 Threads)
my $xmlconv = new xmltow2v( 1, 0, 1, 1, 1, 1, 2 );
undef( $xmlconv );
DESTROY
Description:
Removes module objects and variables from memory.
Input:
None
Output:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->DESTROY();
undef( $xmlconv );
ConvertMedlineXMLToW2V
Description:
Parses specified parameter Medline XML file or directory of files, creating a text corpus. Returns 0 if successful or -1 during an error.
Note: Supports plain Medline XML or gun-zipped XML files.
Input:
$filePath -> XML file path to parse. (This can be a single file or directory of XML/XML.gz files).
Output:
$value -> '0' = Successful / '-1' = Un-Successful
Example:
use Word2vec::Xmltow2v;
$xmlconv = new xmltow2v(); # Note: Specifying no parameters implies default settings
$xmlconv->SetSavePath( "testCorpus.txt" );
$xmlconv->SetStoreTitle( 1 );
$xmlconv->SetStoreAbstract( 1 );
$xmlconv->SetBeginDate( "01/01/2004" );
$xmlconv->SetEndDate( "08/13/2016" );
$xmlconv->SetOverwriteExistingFile( 1 );
$xmlconv->ConvertMedlineXMLToW2V( "/xmlDirectory/" );
undef( $xmlconv );
_ThreadedConvert
Description:
Multi-Threaded Medline XML to text corpus conversion function.
Input:
$directory -> File directory or directory of files to parse.
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
Warning: This is a private function called by 'ConvertMedlineXMLToW2V()'. It should not be called outside of xmltow2v module.
_ParseXMLString
Description:
Parses passed string parameter for Medline XML article title and abstract data and appends found data to the text corpus.
Input:
$string -> Medline XML string data to parse.
Output:
None
Example:
Warning: This is a private function called by "ConvertMedlineXMLToW2V()" and "_ThreadedConvert()". It should not be called outside of xmltow2v module.
_CheckParseRequirements
Description:
Checks passed string parameter to see if it contains relevant data and XML::Twig handler is initialized.
Input:
$string -> String data to check
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
Warning: This is a private function called "_ParseXMLString()". It should not be called outside of xmltow2v module.
_CheckForNullData
Description:
Checks passed string parameter for "(null)" string.
Input:
$string -> String data to be checked.
Output:
$value -> '1' = True/Null data or '0' = False/Valid data
Example:
Warning: This is a private function called by "new()" and "_ParseXMLString()". It should not be called outside of xmltow2v module.
_RemoveXMLVersion
Description:
Removes the XML Version string prior to parsing the XML string data. (Depreciated)
Input:
$string -> Medline XML string data
Output:
None
Example:
Warning: This is a private function called by "new()" and "_ParseXMLString()". It should not be called outside of xmltow2v module.
_ParseMedlineCitationSet
Description:
Parses 'MedlineCitationSet' tag data in Medline XML file.
Input:
$twigHandler -> XML::Twig handler
$root -> Beginning of XML directory to parse. ( Directory in Medline XML string data )
Output:
None
Example:
Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.
_ParseMedlineArticle
Description:
Parses 'MedlineArticle' tag data in Medline XML file.
Input:
$medlineArticle -> Current Medline article directory in XML data (XML::Twig directory)
Output:
$value -> '1' = Finished parsing Medline article.
Example:
Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.
_ParseDateCreated
Description:
Parses 'DateCreated' tag data in Medline XML file.
Input:
$article -> Current Medline article in XML data (XML::Twig directory)
Output:
$date -> 'XX/XX/XXXX' (Month/Day/Year)
Example:
Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.
_ParseArticle
Description:
Parses 'Article' tag data in Medline XML file. Fetches 'ArticleTitle', 'Journal' and 'Abstract' XML tags.
Input:
$article -> Current Medline article in XML data (XML::Twig directory)
Output:
None
Example:
Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.
_ParseJournal
Description:
Parses 'Journal' tag data in Medline XML file. Fetches 'Title' XML tag.
Input:
$journalRoot -> Current Medline journal directory in XML data (XML::Twig directory)
Output:
None
Example:
Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.
_ParseOtherAbstract
Description:
Parses 'Abstract' tag data in Medline XML file. Fetches 'AbstractText' XML tag.
Input:
$abstractRoot -> Current Medline abstract directory in XML data (XML::Twig directory)
Output:
None
Example:
Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.
_QuickParseDateCreated
Description:
Parses 'DateCreated' tag data in Medline XML file. Used when 'QuickParse' member variable is enabled. Sets $tempDate member variable to parsed 'DateCreated' tag data.
Input:
$twigHandler -> 'XML::Twig' handler
$article -> Current Medline article directory in XML data (XML::Twig directory)
Output:
None
Example:
Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.
_QuickParseJournal
Description:
Parses 'Journal' tag data in Medline XML file. Fetches 'Title' XML tag. Used when 'QuickParse' member variable is enabled.
Sets $tempStr to parsed data and stores in text corpus.
Input:
$twigHandler -> 'XML::Twig' handler.
$journalRoot -> Current Medline journal directory in XML data (XML::Twig directory)
Output:
None
Example:
Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.
_QuickParseArticle
Description:
Parses 'Article' tag data in Medline XML file. Fetches 'ArticleTitle' and 'Abstract' XML tags. Used when 'QuickParse' member variable is enabled.
Sets $tempStr to parsed data and stores in text corpus.
Input:
$twigHandler -> 'XML::Twig' handler.
$article -> Current Medline article directory in XML data (XML::Twig directory)
Output:
None
Example:
Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.
_QuickParseOtherAbstract
Description:
Parses 'Abstract' tag data in Medline XML file. Fetches 'AbstractText' XML tag. Used when 'QuickParse' member variable is enabled.
Sets $tempStr to parsed data and stores in text corpus.
Input:
$twigHandler -> 'XML::Twig' handler.
$anstractRoot -> Current Medline abstract directory in XML data (XML::Twig directory)
Output:
None
Example:
Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.
CreateCompoundWordBST
Description:
Creates a binary search tree using compound word data in memory and stores root node. This also clears the compound word array afterwards.
Warning: Compound word file must be loaded into memory using ReadCompoundWordDataFromFile() prior to calling this method. This function
will also delete the compound word array upon completion as it will no longer be necessary.
Input:
None
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ReadCompoundWordDataFromFile( "samples/compoundword.txt" );
$xmlconv->CreateCompoundWordBST();
CompoundifyString
Description:
Compoundifies string parameter based on compound word data in memory using the compound word binary search tree.
Warning: Compound word file must be loaded into memory using ReadCompoundWordDataFromFile() prior to calling this method.
Input:
$string -> String to compoundify
Output:
$string -> Compounded string or "(null)" if string parameter is not defined.
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ReadCompoundWordDataFromFile( "samples/compoundword.txt" );
$xmlconv->CreateCompoundWordBST();
my $compoundedString = $xmlconv->CompoundifyString( "String to compoundify" );
print( "Compounded String: $compoundedString\n" );
undef( $xmlconv );
_CompoundifySearch
Description:
Recursive method used by CompoundifyString() to fetch compound word data in binary search tree.
Warning: This function requires specific parameters and should not be called outside of CompoundifyString() method.
Input:
$stringArrayRef -> Array reference containing string data
$oldNode -> Last 'Word2vec::Node' data match was found
$searchStr -> Search phrase
$index -> Current string array index
Output:
Word2vec::Node -> Last node containing positive search phrase match
Example:
Warning: This is a private function and is called by 'CompoundifyString()'. It should not be called outside of xmltow2v module.
ReadCompoundWordDataFromFile
Description:
Reads compound word file and stores in memory. $autoSetMaxCompWordLength parameter is not required to be set. This
parameter instructs the method to auto set the maximum compound word length dependent on the longest compound word found.
Note: $autoSetMaxCompWordLength options: defined = True and Undefined = False.
Input:
$filePath -> Compound word file path
$autoSetMaxCompWordLength -> Maximum length of a given compoundified phrase the module's compoundify algorithm will permit.
Note: Calling this method with $autoSetMaxCompWordLength defined will automatically set the maxCompoundWordLength variable to the longest compound phrase.
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ReadCompoundWordDataFromFile( "samples/compoundword.txt", 1 );
undef( $xmlconv );
SaveCompoundWordListToFile
Description:
Saves compound word data in memory to a specified file location.
Input:
$savePath -> Path to save compound word list to file.
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ReadCompoundWordDataFromFile( "samples/compoundword.txt" );
$xmlconv->SaveCompoundWordDataFromFile( "samples/newcompoundword.txt" );
undef( $xmlconv );
ReadTextFromFile
Description:
Reads a plain text file with utf8 encoding in memory. Returns string data if successful and "(null)" if unsuccessful.
Input:
$filePath -> Text file to read into memory
Output:
$string -> String data if successful or "(null)" if un-successful.
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $textData = $xmlconv->ReadTextFromFile( "samples/textcorpus.txt" );
print( "Text Data: $textData\n" );
undef( $xmlconv );
SaveTextToFile
Description:
Saves a plain text file with utf8 encoding in a specified location.
Input:
$savePath -> Path to save string data.
$string -> String to save
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $result = $xmlconv->SaveTextToFile( "text.txt", "Hello world!" );
print( "File saved\n" ) if $result == 0;
print( "File unable to save\n" ) if $result == -1;
undef( $xmlconv );
_ReadXMLDataFromFile
Description:
Reads an XML file from a specified location. Returns string in memory if successful and "(null)" if unsuccessful.
Input:
$filePath -> File to read given path
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
Warning: This is a private function and is called by XML::Twig parsing functions. It should not be called outside of xmltow2v module.
_SaveTextCorpusToFile
Description:
Saves text corpus data to specified file path. This method will append to any existing file if $appendToFile parameter
is defined or "overwrite" option is disabled. Enabling "overwrite" option will overwrite any existing files.
Input:
$savePath -> Path to save the text corpus
$appendToFile -> Specifies whether the module will overwrite any existing data or append to existing text corpus data.
Note: Leaving this variable undefined will fetch the "Overwrite" member variable and set the value to this parameter.
Output:
$value -> '0' = Successful / '-1' = Un-successful
Example:
Warning: This is a private function and is called by XML::Twig parsing functions. It should not be called outside of xmltow2v module.
IsDateInSpecifiedRange
Description:
Checks to see if $date is within $beginDate and $endDate range. Returns 1 if true and 0 if false.
Note: Date Format: XX/XX/XXXX (Month/Day/Year)
Input:
$date -> Date to check against minimum and maximum data range. (String)
$beginDate -> Minimum date range (String)
$endDate -> Maximum date range (String)
Output:
$value -> '1' = True/Date is within specified range Or '0' = False/Date is not within specified range.
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
print( "Is \"01/01/2004\" within the date range: \"02/21/1985\" to \"08/13/2016\"?\n" );
print( "Yes\n" ) if $xmlconv->IsDateInSpecifiedRange( "01/01/2004", "02/21/1985", "08/13/2016" ) == 1;
print( "No\n" ) if $xmlconv->IsDateInSpecifiedRange( "01/01/2004", "02/21/1985", "08/13/2016" ) == 0;
undef( $xmlconv );
IsFileOrDirectory
Description:
Checks to see if specified path is a file or directory.
Input:
$path -> File or directory path. (String)
Output:
$string -> Returns: "file" = file, "dir" = directory and "unknown" if the path is not a file or directory (undefined).
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $path = "path/to/a/directory";
print( "Is \"$path\" a file or directory? " . $xmlconv->IsFileOrDirectory( $path ) . "\n" );
$path = "path/to/a/file.file";
print( "Is \"$path\" a file or directory? " . $xmlconv->IsFileOrDirectory( $path ) . "\n" );
undef( $xmlconv );
RemoveSpecialCharactersFromString
Description:
Removes special characters from string parameter, removes extra spaces and converts text to lowercase.
Note: This method is called when parsing and compiling Medline title/abstract data.
Input:
$string -> String passed to remove special characters from and convert to lowercase.
Output:
$string -> String with all special characters removed and converted to lowercase.
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $str = "Heart Attack is$ an!@ also KNOWN as an Acute MYOCARDIAL inFARCTion!";
print( "Original String: $str\n" );
$str = $xmlconv->RemoveSpecialCharactersFromString( $str );
print( "Modified String: $str\n" );
undef( $xmlconv );
GetFileType
Description:
Returns file data type (string).
Input:
$filePath -> File to check located at file path
Output:
$string -> File type
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $fileType = $xmlconv->GetFileType( "samples/textcorpus.txt" );
undef( $xmlconv );
_DateCheck
Description:
Checks specified begin and end date strings for formatting and logic errors.
Input:
None
Output:
$value -> "0" = Passed Checks / "-1" = Failed Checks
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
print "Passed Date Checks\n" if ( $xmlconv->_DateCheck() == 0 );
print "Failed Date Checks\n" if ( $xmlconv->_DateCheck() == -1 );
undef( $xmlconv );
Accessor Functions
GetDebugLog
Description:
Returns the _debugLog member variable set during Word2vec::Xmltow2v object initialization of new function.
Input:
None
Output:
$value -> '0' = False, '1' = True
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $debugLog = $xmlconv->GetDebugLog();
print( "Debug Logging Enabled\n" ) if $debugLog == 1;
print( "Debug Logging Disabled\n" ) if $debugLog == 0;
undef( $xmlconv );
GetWriteLog
Description:
Returns the _writeLog member variable set during Word2vec::Xmltow2v object initialization of new function.
Input:
None
Output:
$value -> '0' = False, '1' = True
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $writeLog = $xmlconv->GetWriteLog();
print( "Write Logging Enabled\n" ) if $writeLog == 1;
print( "Write Logging Disabled\n" ) if $writeLog == 0;
undef( $xmlconv );
GetStoreTitle
Description:
Returns the _storeTitle member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$value -> '1' = True / '0' = False
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $storeTitle = $xmlconv->GetStoreTitle();
print( "Store Title Option: Enabled\n" ) if $storeTitle == 1;
print( "Store Title Option: Disabled\n" ) if $storeTitle == 0;
undef( $xmlconv );
GetStoreAbstract
Description:
Returns the _storeAbstract member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$value -> '1' = True / '0' = False
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $storeAbstract = $xmlconv->GetStoreAbstract();
print( "Store Abstract Option: Enabled\n" ) if $storeAbsract == 1;
print( "Store Abstract Option: Disabled\n" ) if $storeAbstract == 0;
undef( $xmlconv );
GetQuickParse
Description:
Returns the _quickParse member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$value -> '1' = True / '0' = False
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $quickParse = $xmlconv->GetQuickParse();
print( "Quick Parse Option: Enabled\n" ) if $quickParse == 1;
print( "Quick Parse Option: Disabled\n" ) if $quickParse == 0;
undef( $xmlconv );
GetCompoundifyText
Description:
Returns the _compoundifyText member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$value -> '1' = True / '0' = False
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $compoundify = $xmlconv->GetCompoundifyText();
print( "Compoundify Text Option: Enabled\n" ) if $compoundify == 1;
print( "Compoundify Text Option: Disabled\n" ) if $compoundify == 0;
undef( $xmlconv );
GetStoreAsSentencePerLine
Description:
Returns the _storeAsSentencePerLine member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$value -> '1' = True / '0' = False
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $storeAsSentencePerLine = $xmlconv->GetStoreAsSentencePerLine();
print( "Store As Sentence Per Line: Enabled\n" ) if $storeAsSentencePerLine == 1;
print( "Store As Sentence Per Line: Disabled\n" ) if $storeAsSentencePerLine == 0;
undef( $xmlconv );
GetNumOfThreads
Description:
Returns the _numOfThreads member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$value -> Number of threads
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $numOfThreads = $xmlconv->GetNumOfThreads();
print( "Number of threads: $numOfThreads\n" );
undef( $xmlconv );
GetWorkingDir
Description:
Returns the _workingDir member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$string -> Working directory string
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $workingDirectory = $xmlconv->GetWorkingDir();
print( "Working Directory: $workingDirectory\n" );
undef( $xmlconv );
GetSavePath
Description:
Returns the _saveDir member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$string -> Save directory string
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $savePath = $xmlconv->GetSavePath();
print( "Save Directory: $savePath\n" );
undef( $xmlconv );
GetBeginDate
Description:
Returns the _beginDate member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$date -> Beginning date range - Format: XX/XX/XXXX (Mon/Day/Year)
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $date = $xmlconv->GetBeginDate();
print( "Date: $date\n" );
undef( $xmlconv );
GetEndDate
Description:
Returns the _endDate member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$date -> End date range - Format: XX/XX/XXXX (Mon/Day/Year).
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $date = $xmlconv->GetEndDate();
print( "Date: $date\n" );
undef( $xmlconv );
GetXMLStringToParse
Returns the XML data (string) to be parsed.
Description:
Returns the _xmlStringToParse member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$string -> Medline XML data string
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $xmlStr = $xmlconv->GetXMLStringToParse();
print( "XML String: $xmlStr\n" );
undef( $xmlconv );
GetTextCorpusStr
Description:
Returns the _textCorpusStr member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$string -> Text corpus string
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $str = $xmlconv->GetTextCorpusStr();
print( "Text Corpus: $str\n" );
undef( $xmlconv );
GetFileHandle
Description:
Returns the _fileHandle member variable set during Word2vec::Xmltow2v object instantiation of new function.
Warning: This is a private function. File handle is used by WriteLog() method. Do not manipulate this file handle as errors can result.
Input:
None
Output:
$fileHandle -> Returns file handle for WriteLog() method.
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $fileHandle = $xmlconv->GetFileHandle();
undef( $xmlconv );
GetTwigHandler
Returns XML::Twig handler.
Description:
Returns the _twigHandler member variable set during Word2vec::Xmltow2v object instantiation of new function.
Warning: This is a private function and should not be called or manipulated.
Input:
None
Output:
$twigHandler -> XML::Twig handler.
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $xmlHandler = $xmlconv->GetTwigHandler();
undef( $xmlconv );
GetParsedCount
Description:
Returns the _parsedCount member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$value -> Number of parsed Medline articles.
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $numOfParsed = $xmlconv->GetParsedCount();
print( "Number of parsed Medline articles: $numOfParsed\n" );
undef( $xmlconv );
GetTempStr
Description:
Returns the _tempStr member variable set during Word2vec::Xmltow2v object instantiation of new function.
Warning: This is a private function and should not be called or manipulated. Used by module as a temporary storage
location for parsed Medline 'Title' and 'Abstract' flag string data.
Input:
None
Output:
$string -> Temporary string storage location.
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $tempStr = $xmlconv->GetTempStr();
print( "Temp String: $tempStr\n" );
undef( $xmlconv );
GetTempDate
Description:
Returns the _tempDate member variable set during Word2vec::Xmltow2v object instantiation of new function.
Used by module as a temporary storage location for parsed Medline 'DateCreated' flag string data.
Input:
None
Output:
$date -> Date string - Format: XX/XX/XXXX (Mon/Day/Year).
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $date = $xmlconv->GetTempDate();
print( "Temp Date: $date\n" );
undef( $xmlconv );
GetCompoundWordAry
Description:
Returns the _compoundWordAry member array reference set during Word2vec::Xmltow2v object instantiation of new function.
Warning: Compound word data must be loaded in memory first via ReadCompoundWordDataFromFile().
Input:
None
Output:
$arrayReference -> Compound word array reference.
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $arrayReference = $xmlconv->GetCompoundWordAry();
my @compoundWord = @{ $arrayReference };
print( "Compound Word Array: @compoundWord\n" );
undef( $xmlconv );
GetCompoundWordBST
Description:
Returns the _compoundWordBST member variable set during Word2vec::Xmltow2v object instantiation of new function.
Input:
None
Output:
$bst -> Compound word binary search tree.
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $bst = $xmlconv->GetCompoundWordBST();
undef( $xmlconv );
GetMaxCompoundWordLength
Description:
Returns the _maxCompoundWordLength member variable set during Word2vec::Xmltow2v object instantiation of new function.
Note: If not defined, it is automatically set to and returns 20.
Input:
None
Output:
$value -> Maximum number of compound words in a given phrase.
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $compoundWordLength = $xmlconv->GetMaxCompoundWordLength();
print( "Maximum Compound Word Length: $compoundWordLength\n" );
undef( $xmlconv );
GetOverwriteExistingFile
Description:
Returns the _overwriteExisitingFile member variable set during Word2vec::Xmltow2v object instantiation of new function.
Enables overwriting of existing text corpus if set to '1' or appends to the existing text corpus if set to '0'.
Input:
None
Output:
$value -> '1' = Overwrite existing file / '0' = Append to exiting file.
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
my $overwriteExitingFile = $xmlconv->GetOverwriteExistingFile();
print( "Overwrite Existing File? YES\n" ) if ( $overwriteExistingFile == 1 );
print( "Overwrite Existing File? NO\n" ) if ( $overwriteExistingFile == 0 );
undef( $xmlconv );
Mutator Functions
SetStoreTitle
Description:
Sets member variable to passed integer parameter. Instructs module to store article title if true or omit if false.
Input:
$value -> '1' = Store Titles / '0' = Omit Titles
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetStoreTitle( 1 );
undef( $xmlconv );
SetStoreAbstract
Description:
Sets member variable to passed integer parameter. Instructs module to store article abstracts if true or omit if false.
Input:
$value -> '1' = Store Abstracts / '0' = Omit Abstracts
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetStoreAbstract( 1 );
undef( $xmlconv );
SetWorkingDir
Description:
Sets member variable to passed string parameter. Represents the working directory.
Input:
$string -> Working directory string
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetWorkingDir( "/samples/" );
undef( $xmlconv );
SetSavePath
Description:
Sets member variable to passed integer parameter. Represents the text corpus save path.
Input:
$string -> Text corpus save path
Output:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetSavePath( "samples/textcorpus.txt" );
undef( $xmlconv );
SetQuickParse
Description:
Sets member variable to passed integer parameter. Instructs module to utilize quick parse
routines to speed up text corpus compilation. This method is somewhat less accurate due to its non-exhaustive nature.
Input:
$value -> '1' = Enable Quick Parse / '0' = Disable Quick Parse
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetQuickParse( 1 );
undef( $xmlconv );
SetCompoundifyText
Description:
Sets member variable to passed integer parameter. Instructs module to utilize 'compoundify' option if true.
Warning: This requires compound word data to be loaded into memory with ReadCompoundWordDataFromFile() method prior
to executing text corpus compilation.
Input:
$value -> '1' = Compoundify text / '0' = Do not compoundify text
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetCompoundifyText( 1 );
undef( $xmlconv );
SetStoreAsSentencePerLine
Description:
Sets member variable to passed integer parameter. Instructs module to utilize 'storeAsSentencePerLine' option if true.
Input:
$value -> '1' = Store as sentence per line / '0' = Do not store as sentence per line
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetStoreAsSentencePerLine( 1 );
undef( $xmlconv );
SetNumOfThreads
Description:
Sets member variable to passed integer parameter. Sets the requested number of threads to parse Medline XML files
and compile the text corpus.
Input:
$value -> Integer (Positive value)
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetNumOfThreads( 4 );
undef( $xmlconv );
SetBeginDate
Description:
Sets member variable to passed string parameter. Sets beginning date range for earliest articles to store, by
'DateCreated' Medline tag, within the text corpus during compilation.
Note: Expected format - "XX/XX/XXXX" (Mon/Day/Year)
Input:
$string -> Date string - Format: "XX/XX/XXXX"
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetBeginDate( "01/01/2004" );
undef( $xmlconv );
SetEndDate
Description:
Sets member variable to passed string parameter. Sets ending date range for latest article to store, by
'DateCreated' Medline tag, within the text corpus during compilation.
Note: Expected format - "XX/XX/XXXX" (Mon/Day/Year)
Input:
$string -> Date string - Format: "XX/XX/XXXX"
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetEndDate( "08/13/2016" );
undef( $xmlconv );
SetXMLStringToParse
Description:
Sets member variable to passed string parameter. This string normally consists of Medline XML data to be
parsed for text corpus compilation.
Warning: This is a private function and should not be called or manipulated.
Input:
$string -> String
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetXMLStringToParse( "Hello World!" );
undef( $xmlconv );
SetTextCorpusStr
Description:
Sets member variable to passed string parameter. Overwrites any stored text corpus data in memory to the string parameter.
Warning: This is a private function and should not be called or manipulated.
Input:
$string -> String
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetTextCorpusStr( "Hello World!" );
undef( $xmlconv );
AppendStrToTextCorpus
Description:
Sets member variable to passed string parameter. Appends string parameter to text corpus string in memory.
Warning: This is a private function and should not be called or manipulated.
Input:
$string -> String
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->AppendStrToTextCorpus( "Hello World!" );
undef( $xmlconv );
ClearTextCorpus
Description:
Clears text corpus data in memory.
Warning: This is a private function and should not be called or manipulated.
Input:
None
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ClearTextCorpus();
undef( $xmlconv );
SetTempStr
Description:
Sets member variable to passed string parameter. Sets temporary member string to passed string parameter.
(Temporary placeholder for Medline Title and Abstract data).
Note: This removes special characters and converts all characters to lowercase.
Warning: This is a private function and should not be called or manipulated.
Input:
$string -> String
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetTempStr( "Hello World!" );
undef( $xmlconv );
AppendToTempStr
Description:
Appends string parameter to temporary member string in memory.
Note: This removes special characters and converts all characters to lowercase.
Warning: This is a private function and should not be called or manipulated.
Input:
$string -> String
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->AppendToTempStr( "Hello World!" );
undef( $xmlconv );
ClearTempStr
Clears the temporary string storage in memory.
Warning: This is a private function and should not be called or manipulated.
Input:
None
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ClearTempStr();
undef( $xmlconv );
SetTempDate
Description:
Sets member variable to passed string parameter. Sets temporary date string to passed string.
Note: Date Format - "XX/XX/XXXX" (Mon/Day/Year)
Warning: This is a private function and should not be called or manipulated.
Input:
$string -> Date string - Format: "XX/XX/XXXX"
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetTempDate( "08/13/2016" );
undef( $xmlconv );
ClearTempDate
Description:
Clears the temporary date storage location in memory.
Warning: This is a private function and should not be called or manipulated.
Input:
None
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ClearTempDate();
undef( $xmlconv );
SetCompoundWordAry
Description:
Sets member variable to de-referenced passed array reference parameter. Stores compound word array by
de-referencing array reference parameter.
Note: Clears previous data if existing.
Warning: This is a private function and should not be called or manipulated.
Input:
$arrayReference -> Array reference of compound words
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my @compoundWordAry = ( "big dog", "respiratory failure", "seven large masses" );
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetCompoundWordAry( \@compoundWordAry );
undef( $xmlconv );
ClearCompoundWordAry
Description:
Clears compound word array in memory.
Warning: This is a private function and should not be called or manipulated.
Input:
None
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ClearCompoundWordAry();
undef( $xmlconv );
SetCompoundWordBST
Description:
Sets member variable to passed Word2vec::Bst parameter. Sets compound word binary search tree to passed binary tree parameter.
Note: Un-defines previous binary tree if existing.
Warning: This is a private function and should not be called or manipulated.
Input:
Word2vec::Bst -> Binary Search Tree
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my @compoundWordAry = ( "big dog", "respiratory failure", "seven large masses" );
@compoundWordAry = sort( @compoundWordAry );
my $arySize = @compoundWordAry;
my $bst = Word2vec::Bst;
$bst->CreateTree( \@compoundWordAry, 0, $arySize, undef );
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetCompoundWordBST( $bst );
undef( $xmlconv );
ClearCompoundWordBST
Description:
Clears/Un-defines existing compound word binary search tree from memory.
Warning: This is a private function and should not be called or manipulated.
Input:
None
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->ClearCompoundWordBST();
undef( $xmlconv );
SetMaxCompoundWordLength
Description:
Sets member variable to passed integer parameter. Sets maximum number of compound words in a phrase for comparison.
ie. "medical campus of Virginia Commonwealth University" can be interpreted as a compound word of 6 words.
Setting this variable to 3 will only attempt compoundifying a maximum amount of three words.
The result would be "medical_campus_of Virginia commonwealth university" even-though an exact representation
of this compounded string can exist. Setting this variable to 6 will result in compounding all six words if
they exists in the compound word array/bst.
Warning: This is a private function and should not be called or manipulated.
Input:
$value -> Integer
Ouput:
None
Example:
use Word2vec::Xmltow2v;
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->SetMaxCompoundWordLength( 8 );
undef( $xmlconv );
SetOverwriteExistingFile
Description:
Sets member variable to passed integer parameter. Sets option to overwrite existing text corpus during compilation
if 1 or append to existing text corpus if 0.
Input:
$value -> '1' = Overwrite existing text corpus / '0' = Append to existing text corpus during compilation.
Output:
None
Example:
use Word2vec::Xmltow2v;
my $xmltow2v = Word2vec::Xmltow2v->new();
$xmltow2v->SetOverWriteExistingFile( 1 );
undef( $xmltow2v );
Debug Functions
GetTime
Description:
Returns current time string in "Hour:Minute:Second" format.
Input:
None
Output:
$string -> XX:XX:XX ("Hour:Minute:Second")
Example:
use Word2vec::Xmltow2v:
my $xmlconv = Word2vec::Xmltow2v->new();
my $time = $xmlconv->GetTime();
print( "Current Time: $time\n" ) if defined( $time );
undef( $xmlconv );
GetDate
Description:
Returns current month, day and year string in "Month/Day/Year" format.
Input:
None
Output:
$string -> XX/XX/XXXX ("Month/Day/Year")
Example:
use Word2vec::Xmltow2v:
my $xmlconv = Word2vec::Xmltow2v->new();
my $date = $xmlconv->GetDate();
print( "Current Date: $date\n" ) if defined( $date );
undef( $xmlconv );
WriteLog
Description:
Prints passed string parameter to the console, log file or both depending on user options.
Note: printNewLine parameter prints a new line character following the string if the parameter
is undefined and does not if parameter is 0.
Input:
$string -> String to print to the console/log file.
$value -> 0 = Do not print newline character after string, all else prints new line character including 'undef'.
Output:
None
Example:
use Word2vec::Xmltow2v:
my $xmlconv = Word2vec::Xmltow2v->new();
$xmlconv->WriteLog( "Hello World" );
undef( $xmlconv );
Author
Clint Cuffy, Virginia Commonwealth University
COPYRIGHT
Copyright (c) 2016
Bridget T McInnes, Virginia Commonwealth University
btmcinnes at vcu dot edu
Clint Cuffy, Virginia Commonwealth University
cuffyca at vcu dot edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to:
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.