NAME

Word2vec::Word2vec - word2vec wrapper module.

SYNOPSIS

# Parameters: Enabled Debug Logging, Disabled Write Logging
my $w2v = Word2vec::Word2vec->new( 1, 0 );             # Note: Specifiying no parameters implies default settings.

$w2v->SetTrainFilePath( "textCorpus.txt" );
$w2v->SetOutputFilePath( "vectors.bin" );
$w2v->SetWordVecSize( 200 );
$w2v->SetWindowSize( 8 );
$w2v->SetSample( 0.0001 );
$w2v->SetNegative( 25 );
$w2v->SetHSoftMax( 0 );
$w2v->SetBinaryOutput( 0 );
$w2v->SetNumOfThreads( 20 );
$w2v->SetNumOfIterations( 12 );
$w2v->SetUseCBOW( 1 );
$w2v->SetOverwriteOldFile( 0 );

$w2v->ExecuteTraining();

undef( $w2v );

# or

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();             # Note: Specifying no parameters implies default settings.

$w2v->ExecuteTraining( $trainFilePath, $outputFilePath, $vectorSize, $windowSize, $minCount, $sample, $negative,
                       $alpha, $hs, $binary, $numOfThreads, $iterations, $useCBOW, $classes, $readVocab,
                       $saveVocab, $debug, $overwrite );

undef( $w2v );

DESCRIPTION

Word2vec::Word2vec is a word2vec package tool that trains text corpus data using the word2vec tool, provides multiple avenues for cosine similarity computation, manipulation of word vectors and conversion of word2vec's binary format to human readable text.

Main Functions

new

Description:

Returns a new "Word2vec::Word2vec" module object.

Note: Specifying no parameters implies default options.

Default Parameters:
   debugLog                    = 0
   writeLog                    = 0
   trainFileName               = ""
   outputFileName              = ""
   wordVecSize                 = 100
   sample                      = 5
   hSoftMax                    = 0
   negative                    = 5
   numOfThreads                = 12
   numOfIterations             = 5
   minCount                    = 5
   alpha                       = 0.05 (CBOW) or 0.025 (Skip-Gram)
   classes                     = 0
   debug                       = 2
   binaryOutput                = 1
   saveVocab                   = ""
   readVocab                   = ""
   useCBOW                     = 1
   workingDir                  = Current Directory
   hashRefOfWordVectors        = ()
   overwriteOldFile            = 0

Input:

$debugLog                    -> Instructs module to print debug statements to the console. (1 = True / 0 = False)
$writeLog                    -> Instructs module to print debug statements to a log file. (1 = True / 0 = False)
$trainFileName               -> Specifies the training text corpus file path. (String)
$outputFileName              -> Specifies the word2vec post training output file path. (String)
$wordVecSize                 -> Specifies word2vec word vector parameter size.(Integer)
$sample                      -> Specifies word2vec sample parameter value. (Integer)
$hSoftMax                    -> Specifies word2vec HSoftMax parameter value. (Integer)
$negative                    -> Specifies word2vec negative parameter value. (Integer)
$numOfThreads                -> Specifies word2vec number of threads parameter value. (Integer)
$numOfIterations             -> Specifies word2vec number of iterations parameter value. (Integer)
$minCount                    -> Specifies word2vec min-count parameter value. (Integer)
$alpha                       -> Specifies word2vec alpha parameter value. (Integer)
$classes                     -> Specifies word2vec classes parameter value. (Integer)
$debug                       -> Specifies word2vec debug training parameter value. (Integer: '0' = No Debug, '1' = Debug, '2' = Even more debug info)
$binaryOutput                -> Specifies word2vec binary output mode parameter value. (Integer: '1' = Binary, '0' = Plain Text)
$saveVocab                   -> Specifies word2vec save vocabulary file path. (String)
$readVocab                   -> Specifies word2vec read vocabulary file path. (String)
$useCBOW                     -> Specifies word2vec CBOW algorithm parameter value. (Integer: '1' = CBOW, '0' = Skip-Gram)
$workingDir                  -> Specifies module working directory. (String)
$hashRefOfWordVectors        -> Storage location for loaded word2vec trained vector data file in memory. (Hash)
$overwriteOldFile            -> Instructs the module to either overwrite any existing data with the same output file name and path. ( '1' or '0' )

Note: It is not recommended to specify all new() parameters, as it has not been thoroughly tested.

Output:

Word2vec::Word2vec object.

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();

undef( $w2v );

DESTROY

Description:

Removes member variables and file handle from memory.

Input:

None

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->DESTROY();

undef( $w2v );

ExecuteTraining

Executes word2vec training based on parameters. Parameter variables have higher precedence
than member variables. Any parameter specified will override its respective member variable.

Note: If no parameters are specified, this module executes word2vec training based on preset
member variables. Returns string regarding training status.

Input:

$trainFilePath  -> Specifies word2vec text corpus training file in a given path. (String)
$outputFilePath -> Specifies word2vec trained output data file name and save path. (String)
$vectorSize     -> Size of word2vec word vectors. (Integer)
$windowSize     -> Maximum skip length between words. (Integer)
$minCount       -> Disregard words that appear less than $minCount times. (Integer)
$sample         -> Threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled. (Float)
$negative       -> Number of negative examples. (Integer)
$alpha          -> Set that start learning rate. (Float)
$hs             -> Hierarchical Soft-max (Integer)
$binary         -> Save trained data as binary mode. (Integer)
$numOfThreads   -> Number of word2vec training threads. (Integer)
$iterations     -> Number of training iterations to run prior to completion of training. (Integer)
$useCBOW        -> Enable Continuous Bag Of Words model or Skip-Gram model. (Integer)
$classes        -> Output word classes rather than word vectors. (Integer)
$readVocab      -> Read vocabulary from file path without constructing from training data. (String)
$saveVocab      -> Save vocabulary to file path. (String)
$debug          -> Set word2vec debug mode. (Integer)
$overwrite      -> Instructs the module to either overwrite any existing text corpus files or append to the existing file. ( '1' = True / '0' = False )

Note: It is not recommended to specify all new() parameters, as it has not been thoroughly tested.

Output:

$value          -> '0' = Successful / '-1' = Un-successful

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetTrainFilePath( "textcorpus.txt" );
$w2v->SetOutputFilePath( "vectors.bin" );
$w2v->SetWordVecSize( 200 );
$w2v->SetWindowSize( 8 );
$w2v->SetSample( 0.0001 );
$w2v->SetNegative( 25 );
$w2v->SetHSoftMax( 0 );
$w2v->SetBinaryOutput( 0 );
$w2v->SetNumOfThreads( 20 );
$w2v->SetNumOfIterations( 15 );
$w2v->SetUseCBOW( 1 );
$w2v->SetOverwriteOldFile( 0 );
$w2v->ExecuteTraining();

undef( $w2v );

# or

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ExecuteTraining( "textcorpus.txt", "vectors.bin", 200, 8, 5, 0.001, 25, 0.05, 0, 0, 20, 15, 1, 0, "", "", 2, 0 );

undef( $w2v );

ExecuteStringTraining

Executes word2vec training based on parameters. Parameter variables have higher precedence
than member variables. Any parameter specified will override its respective member variable.

Note: If no parameters are specified, this module executes word2vec training based on preset
member variables. Returns string regarding training status.

Input:

$trainingStr    -> String to train with word2vec.
$outputFilePath -> Specifies word2vec trained output data file name and save path. (String)
$vectorSize     -> Size of word2vec word vectors. (Integer)
$windowSize     -> Maximum skip length between words. (Integer)
$minCount       -> Disregard words that appear less than $minCount times. (Integer)
$sample         -> Threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled. (Float)
$negative       -> Number of negative examples. (Integer)
$alpha          -> Set that start learning rate. (Float)
$hs             -> Hierarchical Soft-max (Integer)
$binary         -> Save trained data as binary mode. (Integer)
$numOfThreads   -> Number of word2vec training threads. (Integer)
$iterations     -> Number of training iterations to run prior to completion of training. (Integer)
$useCBOW        -> Enable Continuous Bag Of Words model or Skip-Gram model. (Integer)
$classes        -> Output word classes rather than word vectors. (Integer)
$readVocab      -> Read vocabulary from file path without constructing from training data. (String)
$saveVocab      -> Save vocabulary to file path. (String)
$debug          -> Set word2vec debug mode. (Integer)
$overwrite      -> Instructs the module to either overwrite any existing text corpus files or append to the existing file. ( '1' = True / '0' = False )

Note: It is not recommended to specify all new() parameters, as it has not been thoroughly tested.

Output:

$value          -> '0' = Successful / '-1' = Un-successful

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetOutputFilePath( "vectors.bin" );
$w2v->SetWordVecSize( 200 );
$w2v->SetWindowSize( 8 );
$w2v->SetSample( 0.0001 );
$w2v->SetNegative( 25 );
$w2v->SetHSoftMax( 0 );
$w2v->SetBinaryOutput( 0 );
$w2v->SetNumOfThreads( 20 );
$w2v->SetNumOfIterations( 15 );
$w2v->SetUseCBOW( 1 );
$w2v->SetOverwriteOldFile( 0 );
$w2v->ExecuteStringTraining( "string to train here" );

undef( $w2v );

# or

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ExecuteStringTraining( "string to train here", "vectors.bin", 200, 8, 5, 0.001, 25, 0.05, 0, 0, 20, 15, 1, 0, "", "", 2, 0 );

undef( $w2v );

ComputeCosineSimilarity

Description:

Computes cosine similarity between two words using trained word2vec vector data. Returns
float value or undefined if one or more words are not in the dictionary.

Note: Supports single words only and requires vector data to be in memory with ReadTrainedVectorDataFromFile() prior to function execution.

Input:

$string -> Single string word
$string -> Single string word

Output:

$value  -> Float or Undefined

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
print "Cosine similarity between words: \"of\" and \"the\": " . $w2v->ComputeCosineSimilarity( "of", "the" ) . "\n";

undef( $w2v );

ComputeAvgOfWordsCosineSimilarity

Description:

Computes cosine similarity between two words or compound words using trained word2vec vector data.
Returns float value or undefined.

Note: Supports multiple words concatenated by ' ' and requires vector data to be in memory prior
to method execution. This method will not error out when a word is not located within the dictionary.
It will take the average of all found words for each parameter then cosine similarity of both word vectors.

Input:

$string -> string of single or multiple words separated by ' ' (space).
$string -> string of single or multiple words separated by ' ' (space).

Output:

$value  -> Float or Undefined

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
print "Cosine similarity between words: \"heart attack\" and \"acute myocardial infarction\": " .
      $w2v->ComputeAvgOfWordsCosineSimilarity( "heart attack", "acute myocardial infarction" ) . "\n";

undef( $w2v );

ComputeMultiWordCosineSimilarity

Description:

Computes cosine similarity between two words or compound words using trained word2vec vector data.
Returns float value or undefined if one or more words are not in the dictionary.

Note: Supports multiple words concatenated by ' ' and requires vector data to be in memory prior to method execution.
This function will error out when a specified word is not found and return undefined.

Input:

$string -> string of single or multiple words separated by ' ' (space).
$string -> string of single or multiple words separated by ' ' (space).

Output:

$value  -> Float or Undefined

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
print "Cosine similarity between words: \"heart attack\" and \"acute myocardial infarction\": " .
      $w2v->ComputeMultiWordCosineSimilarity( "heart attack", "acute myocardial infarction" ) . "\n";

undef( $w2v );

ComputeCosineSimilarityOfWordVectors

Description:

Computes cosine similarity between two word vectors.
Returns float value or undefined if one or more words are not in the dictionary.

Note: Function parameters require actual word vector data with words removed.

Input:

$string -> string of word vector representation data separated by ' ' (space).
$string -> string of word vector representation data separated by ' ' (space).

Output:

$value  -> Float or Undefined

Example:

use Word2vec::Word2vec;

my $word2vec = Word2vec::Word2vec->new();
$word2vec->ReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
my $vectorAData = $word2vec->GetWordVector( "heart" );
my $vectorBData = $word2vec->GetWordVector( "attack" );

# Remove Words From Data
$vectorAData = RemoveWordFromWordVectorString( $vectorAData );
$vectorBData = RemoveWordFromWordVectorString( $vectorBData );

print "Cosine similarity between words: \"heart\" and \"attack\": " .
      $word2vec->ComputeCosineSimilarityOfWordVectors( $vectorAData, $vectorBData ) . "\n";

undef( $word2vec );

CosSimWithUserInput

Description:

Computes cosine similarity between two words using trained word2vec vector data based on user input.

Note: No compound word support.

Warning: Requires vector data to be in memory prior to method execution.

Input:

None

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
$w2v->CosSimWIthUserInputTest();

undef( $w2v );

MultiWordCosSimWithUserInput

Description:

Computes cosine similarity between two words or compound words using trained word2vec vector data based on user input.

Note: Supports multiple words concatenated by ':'.

Warning: Requires vector data to be in memory prior to method execution.

Input:

None

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
$w2v->MultiWordCosSimWithUserInput();

undef( $w2v );

ComputeAverageOfWords

Description:

Computes cosine similarity average of all found words given an array reference parameter of
plain text words. Returns average values (string) or undefined.

Warning: Requires vector data to be in memory prior to method execution.

Input:

$arrayReference -> Array reference of words

Output:

$string         -> String of word2vec word average values

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "sample/samplevectors.bin" );
my $data = $w2v->ComputeAverageOfWords( "of", "the", "and" );
print( "Computed Average Of Words: $data" ) if defined( $data );

undef( $w2v );

AddTwoWords

Description:

Adds two word vectors and returns the result.

Warning: This method also requires vector data to be in memory prior to method execution.

Input:

$string -> Word to add
$string -> Word to add

Output:

$string -> String of word2vec summed word values

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "sample/samplevectors.bin" );

my $data = $w2v->AddTwoWords( "heart", "attack" );
print( "Computed Sum Of Words: $data" ) if defined( $data );

undef( $w2v );

SubtractTwoWords

Description:

Subtracts two word vectors and returns the result.

Warning: This method also requires vector data to be in memory prior to method execution.

Input:

$string -> Word to subtract
$string -> Word to subtract

Output:

$string -> String of word2vec difference between word values

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "sample/samplevectors.bin" );

my $data = $w2v->SubtractTwoWords( "king", "man" );
print( "Computed Difference Of Words: $data" ) if defined( $data );

undef( $w2v );

AddTwoWordVectors

Description:

Adds two vector data strings and returns the result.

Warning: Text word must be removed from vector data prior to calling this method. This method
also requires vector data to be in memory prior to method execution.

Input:

$string -> Word2vec word vector data (with string word removed)
$string -> Word2vec word vector data (with string word removed)

Output:

$string -> String of word2vec summed word values

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "sample/samplevectors.bin" );
my $wordAData = $w2v->GetWordVector( "of" );
my $wordBData = $w2v->GetWordVector( "the" );

# Removing Words From Vector Data Array
$wordAData = RemoveWordFromWordVectorString( $wordAData );
$wordBData = RemoveWordFromWordVectorString( $wordBData );

my $data = $w2v->AddTwoWordVectors( $wordAData, $wordBData );
print( "Computed Sum Of Words: $data" ) if defined( $data );

undef( $w2v );

SubtractTwoWordVectors

Description:

Subtracts two vector data strings and returns the result.

Warning: Text word must be removed from vector data prior to calling this method. This method
also requires vector data to be in memory prior to method execution.

Input:

$string -> Word2vec word vector data (with string word removed)
$string -> Word2vec word vector data (with string word removed)

Output:

$string -> String of word2vec difference between word values

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "sample/samplevectors.bin" );
my $wordAData = $w2v->GetWordVector( "of" );
my $wordBData = $w2v->GetWordVector( "the" );

# Removing Words From Vector Data Array
$wordAData = RemoveWordFromWordVectorString( $wordAData );
$wordBData = RemoveWordFromWordVectorString( $wordBData );

my $data = $w2v->SubtractTwoWordVectors( $wordAData, $wordBData );
print( "Computed Difference Of Words: $data" ) if defined( $data );

undef( $w2v );

AverageOfTwoWordVectors

Description:

Computes the average of two vectors and returns the result.

Warning: Text word must be removed from vector data prior to calling this method. This method
also requires vector data to be in memory prior to method execution.

Input:

$string -> Word2vec word vector data (with string word removed)
$string -> Word2vec word vector data (with string word removed)

Output:

$string -> String of word2vec average between word values

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "sample/samplevectors.bin" );
my $wordAData = $w2v->GetWordVector( "of" );
my $wordBData = $w2v->GetWordVector( "the" );

# Removing Words From Vector Data Array
$wordAData = RemoveWordFromWordVectorString( $wordAData );
$wordBData = RemoveWordFromWordVectorString( $wordBData );

my $data = $w2v->AverageOfTwoWordVectors( $wordAData, $wordBData );
print( "Computed Difference Of Words: $data" ) if defined( $data );

undef( $w2v );

GetWordVector

Description:

Searches dictionary in memory for the specified string argument and returns the vector data.
Returns undefined if not found.

Warning: Requires vector data to be in memory prior to method execution.

Input:

$string -> Word to locate in word2vec vocabulary/dictionary

Output:

$string -> Found word2vec word + word vector data or undefined.

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "sample/samplevectors.bin" );
my $wordData = $w2v->GetWordVector( "of" );
print( "Word2vec Word Data: $wordData\n" ) if defined( $wordData );

undef( $w2v );

IsVectorDataInMemory

Description:

Checks to see if vector data has been loaded in memory.

Input:

None

Output:

$value -> '1' = True / '0' = False

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $result = $w2v->IsVectorDataInMemory();

print( "No vector data in memory\n" ) if $result == 0;
print( "Yes vector data in memory\n" ) if $result == 1;

$w2v->ReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );

print( "No vector data in memory\n" ) if $result == 0;
print( "Yes vector data in memory\n" ) if $result == 1;

undef( $w2v );

IsVectorDataSorted

Description:

Checks to see if vector data header is signed as sorted in memory.

Input:

None

Output:

$value -> '1' = True / '0' = False

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );

my $result = $w2v->IsVectorDataSorted();

print( "No vector data is not sorted\n" ) if $result == 0;
print( "Yes vector data is sorted\n" ) if $result == 1;

undef( $w2v );

CheckWord2VecDataFileType

Description:

Checks specified file to see if vector data is in binary or plain text format. Returns 'text'
for plain text and 'binary' for binary data.

Input:

$string -> File path

Output:

$string -> File Type ( "text" = Plain text file / "binary" = Binary data file )

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $fileType = $w2v->CheckWord2VecDataFileType( "samples/samplevectors.bin" );

print( "FileType: $fileType\n" ) if defined( $fileType );

undef( $fileType );

ReadTrainedVectorDataFromFile

Description:

Reads trained vector data from file path in memory.

Input:

$string     -> Word2vec trained vector data file path

Output:

$value      -> '0' = Successful / '-1' = Un-successful

Example:

# Loading data in a Binary Search Tree
use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $result = $w2v->ReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );

print( "Success Loading Data\n" ) if $result == 0;
print( "Un-successful, Data Not Loaded\n" ) if $result == -1;

undef( $w2v );

# or

# Loading data in an array
use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $result = $w2v->ReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );

print( "Success Loading Data\n" ) if $result == 0;
print( "Un-successful, Data Not Loaded\n" ) if $result == -1;

undef( $w2v );

SaveTrainedVectorDataToFile

Description:

Saves trained vector data at the location specified. Defining 'binaryFormat' parameter will
save in word2vec's binary format.

Input:

$string       -> Save Path
$binaryFormat -> Integer ( '1' = Save data in word2vec binary format / '0' = Save as plain text )

Note: Leaving $binaryFormat as undefined will save the file in plain text format.

Warning: If the vector data is stored as a binary search tree, this method will error out gracefully.

Output:

$value        -> '0' = Successful / '-1' = Un-successful

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();

$w2v->ReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
$w2v->SaveTrainedVectorDataToFile( "samples/newvectors.bin" );

undef( $w2v );

StringsAreEqual

Description:

Compares two strings to check for equality, ignoring case-sensitivity.

Note: This method is not case-sensitive. ie. "string" equals "StRiNg"

Input:

$string -> String to compare
$string -> String to compare

Output:

$value  -> '1' = Strings are equal / '0' = Strings are not equal

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $result = $w2v->StringsAreEqual( "hello world", "HeLlO wOrLd" );

print( "Strings are equal!\n" )if $result == 1;
print( "Strings are not equal!\n" ) if $result == 0;

undef( $w2v );

ConvertRawSparseTextToVectorDataAry

Description:

Converts sparse vector string to a dense vector format data array.

Input:

$string          -> Vector data string.

Output:

$arrayReference  -> Reference to array of vector data.

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $str = "cookie 1 0.234 9 0.0002 13 0.234 17 -0.0023 19 1.0000";

my @vectorData = @{ $w2v->ConvertRawSparseTextToVectorDataAry( $str ) };

print( "Data conversion successful!\n" ) if @vectorData > 0;
print( "Data conversion un-successful!\n" ) if @vectorData == 0;

undef( $w2v );

ConvertRawSparseTextToVectorDataHash

Description:

Converts sparse vector string to a dense vector format data hash.

Input:

$string          -> Vector data string.

Output:

$hashReference  -> Reference to array of hash data.

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $str = "cookie 1 0.234 9 0.0002 13 0.234 17 -0.0023 19 1.0000";

my %vectorData = %{ $w2v->ConvertRawSparseTextToVectorDataHash( $str ) };

print( "Data conversion successful!\n" ) if ( keys %vectorData ) > 0;
print( "Data conversion un-successful!\n" ) if ( keys %vectorData ) == 0;

undef( $w2v );

GetOSType

Description:

Returns (string) operating system type.

Input:

None

Output:

$string -> Operating System String

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $os = $w2v->GetOSType();

print( "Operating System: $os\n" );

undef( $w2v );

Accessor Functions

GetDebugLog

Description:

Returns the _debugLog member variable set during Word2vec::Word2vec object initialization of new function.

Input:

None

Output:

$value -> '0' = False, '1' = True

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new()
my $debugLog = $w2v->GetDebugLog();

print( "Debug Logging Enabled\n" ) if $debugLog == 1;
print( "Debug Logging Disabled\n" ) if $debugLog == 0;


undef( $w2v );

GetWriteLog

Description:

Returns the _writeLog member variable set during Word2vec::Word2vec object initialization of new function.

Input:

None

Output:

$value -> '0' = False, '1' = True

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $writeLog = $w2v->GetWriteLog();

print( "Write Logging Enabled\n" ) if $writeLog == 1;
print( "Write Logging Disabled\n" ) if $writeLog == 0;

undef( $w2v );

GetFileHandle

Description:

Returns the _fileHandle member variable set during Word2vec::Word2vec object instantiation of new function.

Warning: This is a private function. File handle is used by WriteLog() method. Do not manipulate this file handle as errors can result.

Input:

None

Output:

$fileHandle -> Returns file handle for WriteLog() method or undefined.

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $fileHandle = $w2v->GetFileHandle();

undef( $w2v );

GetTrainFilePath

Description:

Returns the _trainFilePath member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$string -> Returns word2vec training text corpus file path.

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $filePath = $w2v->GetTrainFilePath();
print( "Training File Path: $filePath\n" );

undef( $w2v );

GetOutputFilePath

Description:

Returns the _outputFilePath member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$string -> Returns post word2vec training output file path.

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $filePath = $w2v->GetOutputFilePath();
print( "File Path: $filePath\n" );

undef( $w2v );

GetWordVecSize

Description:

Returns the _wordVecSize member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns (integer) size of word2vec word vectors. Default value = 100

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetWordVecSize();
print( "Word Vector Size: $value\n" );

undef( $w2v );

GetWindowSize

Description:

Returns the _windowSize member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns (integer) word2vec window size. Default value = 5

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetWindowSize();
print( "Window Size: $value\n" );

undef( $w2v );

GetSample

Description:

Returns the _sample member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns (integer) word2vec sample size. Default value = 0.001

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetSample();
print( "Sample: $value\n" );

undef( $w2v );

GetHSoftMax

Description:

Returns the _hSoftMax member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns (integer) word2vec HSoftMax value. Default = 0

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetHSoftMax();
print( "HSoftMax: $value\n" );

undef( $w2v );

GetNegative

Description:

Returns the _negative member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns (integer) word2vec negative value. Default = 5

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetNegative();
print( "Negative: $value\n" );

undef( $w2v );

GetNumOfThreads

Description:

Returns the _numOfThreads member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns (integer) word2vec number of threads to use during training. Default = 12

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetNumOfThreads();
print( "Number of threads: $value\n" );

undef( $w2v );

GetNumOfIterations

Description:

Returns the _iterations member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns (integer) word2vec number of word2vec iterations. Default = 5

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetNumOfIterations();
print( "Number of iterations: $value\n" );

undef( $w2v );

GetMinCount

Description:

Returns the _minCount member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns (integer) word2vec min-count value. Default = 5

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetMinCount();
print( "Min Count: $value\n" );

undef( $w2v );

GetAlpha

Description:

Returns the _alpha member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns (integer) word2vec alpha value. Default = 0.05 for CBOW and 0.025 for Skip-Gram.

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetAlpha();
print( "Alpha: $value\n" );

undef( $w2v );

GetClasses

Description:

Returns the _classes member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns (integer) word2vec classes value. Default = 0

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetClasses();
print( "Classes: $value\n" );

undef( $w2v );

GetDebugTraining

Description:

Returns the _debug member variable set during Word2vec::Word2vec object instantiation of new function.

Note: 0 = No debug output, 1 = Enable debug output, 2 = Even more debug output

Input:

None

Output:

$value -> Returns (integer) word2vec debug value. Default = 2

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetDebugTraining();
print( "Debug: $value\n" );

undef( $w2v );

GetBinaryOutput

Description:

Returns the _binaryOutput member variable set during Word2vec::Word2vec object instantiation of new function.

Note: 1 = Save trained vector data in binary format, 2 = Save trained vector data in plain text format.

Input:

None

Output:

$value -> Returns (integer) word2vec binary flag. Default = 0

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetBinaryOutput();
print( "Binary Output: $value\n" );

undef( $w2v );

GetReadVocabFilePath

Description:

Returns the _readVocab member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$string -> Returns (string) word2vec read vocabulary file name or empty string if not set.

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $str = $w2v->GetReadVocabFilePath();
print( "Read Vocab File Path: $str\n" );

undef( $w2v );

GetSaveVocabFilePath

Description:

Returns the _saveVocab member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$string -> Returns (string) word2vec save vocabulary file name or empty string if not set.

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $str = $w2v->GetSaveVocabFilePath();
print( "Save Vocab File Path: $str\n" );

undef( $w2v );

GetUseCBOW

Description:

Returns the _useCBOW member variable set during Word2vec::Word2vec object instantiation of new function.

Note: 0 = Skip-Gram Model, 1 = Continuous Bag Of Words Model.

Input:

None

Output:

$value -> Returns (integer) word2vec Continuous-Bag-Of-Words flag. Default = 1

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetUseCBOW();
print( "Use CBOW?: $value\n" );

undef( $w2v );

GetWorkingDir

Description:

Returns the _workingDir member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns (string) working directory path or current directory if not specified.

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $str = $w2v->GetWorkingDir();
print( "Working Directory: $str\n" );

undef( $w2v );

GetWord2VecExeDir

Description:

Returns the _word2VecExeDir member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns (string) word2vec executable directory path or empty string if not specified.

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $str = $w2v->GetWord2VecExeDir();
print( "Word2Vec Executable File Directory: $str\n" );

undef( $w2v );

GetVocabularyHash

Description:

Returns the _hashRefOfWordVectors member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns array of vocabulary/dictionary words. (Word2vec trained data in memory)

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my @vocabulary = $w2v->GetVocabularyHash();

undef( $w2v );

GetOverwriteOldFile

Description:

Returns the _overwriteOldFile member variable set during Word2vec::Word2vec object instantiation of new function.

Input:

None

Output:

$value -> Returns 1 = True or 0 = False.

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
my $value = $w2v->GetOverwriteOldFile();
print( "Overwrite Exiting File?: $value\n" );

undef( $w2v );

Mutator Functions

SetTrainFilePath

Description:

Sets member variable to string parameter. Sets training file path.

Input:

$string -> Text corpus training file path

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetTrainFilePath( "samples/textcorpus.txt" );

undef( $w2v );

SetOutputFilePath

Description:

Sets member variable to string parameter. Sets output file path.

Input:

$string -> Post word2vec training save file path

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetOutputFilePath( "samples/tempvectors.bin" );

undef( $w2v );

SetWordVecSize

Description:

Sets member variable to integer parameter. Sets word2vec word vector size.

Input:

$value -> Word2vec word vector size

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetWordVecSize( 100 );

undef( $w2v );

SetWindowSize

Description:

Sets member variable to integer parameter. Sets word2vec window size.

Input:

$value -> Word2vec window size

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetWindowSize( 8 );

undef( $w2v );

SetSample

Description:

Sets member variable to integer parameter. Sets word2vec sample size.

Input:

$value -> Word2vec sample size

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetSample( 3 );

undef( $w2v );

SetHSoftMax

Description:

Sets member variable to integer parameter. Sets word2vec HSoftMax value.

Input:

$value -> Word2vec HSoftMax size

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetHSoftMax( 12 );

undef( $w2v );

SetNegative

Description:

Sets member variable to integer parameter. Sets word2vec negative value.

Input:

$value -> Word2vec negative value

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetNegative( 12 );

undef( $w2v );

SetNumOfThreads

Description:

Sets member variable to integer parameter. Sets word2vec number of training threads to specified value.

Input:

$value -> Word2vec number of threads value

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetNumOfThreads( 12 );

undef( $w2v );

SetNumOfIterations

Description:

Sets member variable to integer parameter. Sets word2vec iterations value.

Input:

$value -> Word2vec number of iterations value

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetNumOfIterations( 12 );

undef( $w2v );

SetMinCount

Description:

Sets member variable to integer parameter. Sets word2vec min-count value.

Input:

$value -> Word2vec min-count value

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetMinCount( 7 );

undef( $w2v );

SetAlpha

Description:

Sets member variable to float parameter. Sets word2vec alpha value.

Input:

$value -> Word2vec alpha value. (Float)

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetAlpha( 0.0012 );

undef( $w2v );

SetClasses

Description:

Sets member variable to integer parameter. Sets word2vec classes value.

Input:

$value -> Word2vec classes value.

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetClasses( 0 );

undef( $w2v );

SetDebugTraining

Description:

Sets member variable to integer parameter. Sets word2vec debug parameter value.

Input:

$value -> Word2vec debug training value.

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetDebugTraining( 0 );

undef( $w2v );

SetBinaryOutput

Description:

Sets member variable to integer parameter. Sets word2vec binary parameter value.

Input:

$value -> Word2vec binary output mode value. ( '1' = Binary Output / '0' = Plain Text )

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetBinaryOutput( 1 );

undef( $w2v );

SetSaveVocabFilePath

Description:

Sets member variable to string parameter. Sets word2vec save vocabulary file name.

Input:

$string -> Word2vec save vocabulary file name and path.

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetSaveVocabFilePath( "samples/vocab.txt" );

undef( $w2v );

SetReadVocabFilePath

Description:

Sets member variable to string parameter. Sets word2vec read vocabulary file name.

Input:

$string -> Word2vec read vocabulary file name and path.

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetReadVocabFilePath( "samples/vocab.txt" );

undef( $w2v );

SetUseCBOW

Description:

Sets member variable to integer parameter. Sets word2vec CBOW parameter value.

Input:

$value -> Word2vec CBOW mode value.

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetUseCBOW( 1 );

undef( $w2v );

SetWorkingDir

Description:

Sets member variable to string parameter. Sets working directory.

Input:

$string -> Working directory

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetWorkingDir( "/samples" );

undef( $w2v );

SetWord2VecExeDir

Description:

Sets member variable to string parameter. Sets word2vec executable file directory.

Input:

$string -> Word2vec directory

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetWord2VecExeDir( "/word2vec" );

undef( $w2v );

SetVocabularyHash

Description:

Sets vocabulary/dictionary array to de-referenced array reference parameter.

Warning: This will overwrite any existing vocabulary/dictionary array data.

Input:

$arrayReference -> Vocabulary/Dictionary array reference of word2vec word vectors.

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ReadTrainedVectorDataFromFile( "samples/samplevectors.bin" );
my @vocab = $w2v->GetVocabularyHash();
$w2v->SetVocabularyHash( \@vocab );

undef( $w2v );

ClearVocabularyHash

Description:

Clears vocabulary/dictionary array.

Input:

None

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->ClearVocabularyHash();

undef( $w2v );

AddWordVectorToVocabHash

Description:

Adds word vector string to vocabulary/dictionary.

Input:

$string -> Word2vec word vector string

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();

# Note: This is representational data of word2vec's word vector format and not actual data.
$w2v->AddWordVectorToVocabHash( "of 0.4346 -0.1235 0.5789 0.2347 -0.0056 -0.0001" );

undef( $w2v );

SetOverwriteOldFile

Description:

Sets member variable to integer parameter. Enables overwriting output file if one already exists.

Input:

$value -> '1' = Overwrite exiting file / '0' = Graceful termination when file with same name exists

Output:

None

Example:

use Word2vec::Word2vec;

my $w2v = Word2vec::Word2vec->new();
$w2v->SetOverwriteOldFile( 1 );

undef( $w2v );

Debug Functions

GetTime

Description:

Returns current time string in "Hour:Minute:Second" format.

Input:

None

Output:

$string -> XX:XX:XX ("Hour:Minute:Second")

Example:

use Word2vec::Word2vec:

my $w2v = Word2vec::Word2vec->new();
my $time = $w2v->GetTime();

print( "Current Time: $time\n" ) if defined( $time );

undef( $w2v );

GetDate

Description:

Returns current month, day and year string in "Month/Day/Year" format.

Input:

None

Output:

$string -> XX/XX/XXXX ("Month/Day/Year")

Example:

use Word2vec::Word2vec:

my $w2v = Word2vec::Word2vec->new();
my $date = $w2v->GetDate();

print( "Current Date: $date\n" ) if defined( $date );

undef( $w2v );

WriteLog

Description:

Prints passed string parameter to the console, log file or both depending on user options.

Note: printNewLine parameter prints a new line character following the string if the parameter
is undefined and does not if parameter is 0.

Input:

$string -> String to print to the console/log file.
$value  -> 0 = Do not print newline character after string, all else prints new line character including 'undef'.

Output:

None

Example:

use Word2vec::Word2vec:

my $w2v = Word2vec::Word2vec->new();
$w2v->WriteLog( "Hello World" );

undef( $w2v );

Author

Clint Cuffy, Virginia Commonwealth University

COPYRIGHT

Copyright (c) 2016

Bridget T McInnes, Virginia Commonwealth University
btmcinnes at vcu dot edu

Clint Cuffy, Virginia Commonwealth University
cuffyca at vcu dot edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA  02111-1307, USA.