The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

Algorithm::VSM --- A pure-Perl implementation for constructing a Vector Space Model (VSM) or a Latent Semantic Analysis Model (LSA) of a software library and for using such a model for efficient retrieval of files in response to search words.

SYNOPSIS

# FOR CONSTRUCTING A VSM MODEL FOR RETRIEVAL:

      use Algorithm::VSM;

      my $corpus_dir = "corpus";
      my @query = qw/ program listiterator add arraylist args /;
      my $stop_words_file = "stop_words.txt";  
      my $corpus_vocab_db = "corpus_vocab_db";
      my $doc_vectors_db  = "doc_vectors_db"; 
      my $vsm = Algorithm::VSM->new( 
                         corpus_directory         => $corpus_dir,
                         corpus_vocab_db          => $corpus_vocab_db,
                         doc_vectors_db           => $doc_vectors_db, 
                         stop_words_file          => $stop_words_file,
                         max_number_retrievals    => 10,
                         want_stemming            => 1,  
      #                  debug                    => 1,
      );
      $vsm->get_corpus_vocabulary_and_word_counts();
      $vsm->generate_document_vectors();
      $vsm->display_corpus_vocab();
      $vsm->display_doc_vectors();
      my $retrievals = $vsm->retrieve_for_query_with_vsm( \@query );
      $vsm->display_retrievals( $retrievals );

   The constructor parameter 'corpus_directory' is for naming the root of
   the directory whose VSM model you wish to construct.  The parameters
   'corpus_vocab_db' and 'doc_vectors_db' are for naming disk-based
   databases in which the VSM model will be stored.  Subsequently, these
   databases can be used for much faster retrieval from the same corpus.
   The parameter 'want_stemming' means that you would want the words in
   the documents to be stemmed to their root forms before the VSM model
   is constructed.  Stemming will reduce all words such as 'programming,'
   'programs,' 'program,' etc. to the same root word 'program.'

   The functions display_corpus_vocab() and display_doc_vectors() are
   there only for testing purposes with small corpora.  If you must use
   them for large libraries/corpora, you might wish to redirect the
   output to a file.  The 'debug' option, when turned on, will output a
   large number of intermediate results in the calculation of the model.
   It is best to redirect the output to a file if 'debug' is on.



# FOR CONSTRUCTING AN LSA MODEL FOR RETRIEVAL:

      my $corpus_dir = "corpus";
      my @query = qw/ program listiterator add arraylist args /;
      my $stop_words_file = "stop_words.txt";
      my $vsm = Algorithm::VSM->new( 
                         corpus_directory         => $corpus_dir,
                         corpus_vocab_db          => $corpus_vocab_db,
                         doc_vectors_db           => $doc_vectors_db,
                         lsa_doc_vectors_db       => $lsa_doc_vectors_db,
                         stop_words_file          => $stop_words_file,
                         want_stemming            => 1,
                         lsa_svd_threshold        => 0.01, 
                         max_number_retrievals    => 10,
      #                  debug                    => 1,
      );
      $vsm->get_corpus_vocabulary_and_word_counts();
      $vsm->generate_document_vectors();
      $vsm->display_corpus_vocab();           # only on a small corpus
      $vsm->display_doc_vectors();            # only on a small corpus
      $vsm->construct_lsa_model();
      my $retrievals = $vsm->retrieve_for_query_with_lsa( \@query );
      $vsm->display_retrievals( $retrievals );

  In the calls above, the constructor parameter lsa_svd_threshold
  determines how many of the singular values will be retained after we
  have carried out an SVD decomposition of the term-frequency matrix for
  the documents in the corpus.  Singular values smaller than this
  threshold fraction of the largest value are rejected.  The parameters
  that end in '_db' are for naming the database files in which the LSA
  model will be stored.  We have already mentioned the role played by the
  parameters 'corpus_vocab_db,' and 'doc_vectors_db (see the explanation
  that goes with the previous construct call example).  The database
  related parameter 'lsa_doc_vectors_db' is for naming the file in which
  we will store the reduced-dimensionality document vectors for the LSA
  model.  This would allow fast LSA-based search to be carried out
  subsequently.



# FOR USING A PREVIOUSLY CONSTRUCTED VSM MODEL FOR RETRIEVAL:

      my @query = qw/ program listiterator add arraylist args /;
      my $corpus_vocab_db = "corpus_vocab_db";
      my $doc_vectors_db  = "doc_vectors_db";
      my $vsm = Algorithm::VSM->new( 
                         corpus_vocab_db           => $corpus_vocab_db, 
                         doc_vectors_db            => $doc_vectors_db,
                         max_number_retrieval s    => 10,
      #                  debug                     => 1,
      );
      $vsm->upload_vsm_model_from_disk();
      $vsm->display_corpus_vocab();            # only on a small corpus
      $vsm->display_doc_vectors();             # only on a small corpus
      my $retrievals = $vsm->retrieve_with_vsm( \@query );
      $vsm->display_retrievals( $retrievals );



# FOR USING A PREVIOUSLY CONSTRUCTED LSA MODEL FOR RETRIEVAL:

      my @query = qw/ program listiterator add arraylist args /;
      my $corpus_vocab_db = "corpus_vocab_db";
      my $doc_vectors_db  = "doc_vectors_db";
      my $lsa_doc_vectors_db = "lsa_doc_vectors_db";
      my $vsm = Algorithm::VSM->new( 
                         corpus_vocab_db          => $corpus_vocab_db,
                         doc_vectors_db           => $doc_vectors_db,
                         lsa_doc_vectors_db       => $lsa_doc_vectors_db,
                         max_number_retrievals    => 10,
      #                  debug               => 1,
      );
      $vsm->upload_lsa_model_from_disk();
      $vsm->display_corpus_vocab();          # only on a small corpus
      $vsm->display_doc_vectors();           # only on a small corpus 
      $vsm->construct_lsa_model();
      my $retrievals = $vsm->retrieve_with_lsa( \@query );
      $vsm->display_retrievals( $retrievals );



# FOR MEASURING PRECISION VERSUS RECALL FOR VSM:

      my $corpus_dir = "corpus";   
      my $stop_words_file = "stop_words.txt";  
      my $query_file      = "test_queries.txt";  
      my $relevancy_file   = "relevancy.txt";   # All relevancy judgments
                                                # will be stored in this file
      my $vsm = Algorithm::VSM->new( 
                         corpus_directory    => $corpus_dir,
                         stop_words_file     => $stop_words_file,
                         query_file          => $query_file,
                         want_stemming       => 1,
                         relevancy_threshold => 5, 
                         relevancy_file      => $relevancy_file, 
      #                  debug               => 1,
      );

      $vsm->get_corpus_vocabulary_and_word_counts();
      $vsm->generate_document_vectors();
      $vsm->estimate_doc_relevancies("test_queries.txt");
      $vsm->display_corpus_vocab();                  # used only for testing
      $vsm->display_doc_relevancies();               # used only for testing
      $vsm->precision_and_recall_calculator('vsm');
      $vsm->display_precision_vs_recall_for_queries();
      $vsm->display_map_values_for_queries();

    Measuring precision and recall requires a set of queries.  These are
    supplied through the constructor parameter 'query_file'.  The format
    of the this file must be according to the sample file
    'test_queries.txt' in the 'examples' directory.  The module estimates
    the relevancies of the documents to the queries and dumps the
    relevancies in a file named by the 'relevancy_file' constructor
    parameter.  The constructor parameter 'relevancy_threshold' is used
    in deciding which of the documents are considered to be relevant to a
    query.  A document must contain at least the 'relevancy_threshold'
    occurrences of query words in order to be considered relevant to a
    query.



# FOR MEASURING PRECISION VERSUS RECALL FOR LSA:

      my $corpus_dir = "corpus";    
      my $stop_words_file = "stop_words.txt";  
      my $query_file      = "test_queries.txt"; 
      my $relevancy_file   = "relevancy.txt";  

      my $vsm = Algorithm::VSM->new( 
                         corpus_directory    => $corpus_dir,
                         stop_words_file     => $stop_words_file,
                         query_file          => $query_file,
                         want_stemming       => 1,
                         lsa_svd_threshold   => 0.01,
                         relevancy_threshold => 5,
                         relevancy_file      => $relevancy_file,
      #                   debug               => 1,
      );

      $vsm->get_corpus_vocabulary_and_word_counts();
      $vsm->generate_document_vectors();
      $vsm->construct_lsa_model();
      $vsm->estimate_doc_relevancies("test_queries.txt");
      $vsm->display_doc_relevancies();
      $vsm->precision_and_recall_calculator('lsa');
      $vsm->display_precision_vs_recall_for_queries();
      $vsm->display_map_values_for_queries();

    We have already explained the purpose of the constructor parameter
    'query_file' and about the constraints on the format of queries in
    the file named through this parameter.  As mentioned earlier, the
    module estimates the relevancies of the documents to the queries and
    dumps the relevancies in a file named by the 'relevancy_file'
    constructor parameter.  The constructor parameter
    'relevancy_threshold' is used in deciding which of the documents are
    considered to be relevant to a query.  A document must contain at
    least the 'relevancy_threshold' occurrences of query words in order
    to be considered relevant to a query.  We have previously explained
    the role of the constructor parameter 'lsa_svd_threshold'.


# FOR MEASURING PRECISION VERSUS RECALL FOR VSM USING FILE-BASED RELEVANCE JUDGMENTS:

      my $corpus_dir = "corpus";  
      my $stop_words_file = "stop_words.txt";
      my $query_file      = "test_queries.txt";
      my $relevancy_file   = "relevancy.txt";  

      my $vsm = Algorithm::VSM->new( 
                 corpus_directory    => $corpus_dir,
                 stop_words_file     => $stop_words_file,
                 query_file          => $query_file,
                 want_stemming       => 1,
                 relevancy_file      => $relevancy_file,
      #        debug               => 1,
      );

      $vsm->get_corpus_vocabulary_and_word_counts();
      $vsm->generate_document_vectors();
      $vsm->upload_document_relevancies_from_file();  
      $vsm->display_doc_relevancies();
      $vsm->precision_and_recall_calculator('vsm');
      $vsm->display_precision_vs_recall_for_queries();
      $vsm->display_map_values_for_queries();

  Now the filename supplied through the constructor parameter
  'relevancy_file' must contain relevance judgments for the queries that
  are named in the file supplied through the parameter 'query_file'.  The
  format of these two files must be according to what is shown in the
  sample files 'test_queries.txt' and 'relevancy.txt' in the 'examples'
  directory.



# FOR MEASURING PRECISION VERSUS RECALL FOR LSA USING FILE-BASED RELEVANCE JUDGMENTS:

      my $corpus_dir = "corpus";  
      my $stop_words_file = "stop_words.txt";
      my $query_file      = "test_queries.txt";
      my $relevancy_file   = "relevancy.txt";  

      my $vsm = Algorithm::VSM->new( 
                 corpus_directory    => $corpus_dir,
                 corpus_vocab_db     => $corpus_vocab_db,
                 doc_vectors_db      => $doc_vectors_db,
                 stop_words_file     => $stop_words_file,
                 query_file          => $query_file,
                 want_stemming       => 1,
                 lsa_svd_threshold   => 0.01,
                 relevancy_file      => $relevancy_file,
      #        debug               => 1,
      );

      $vsm->get_corpus_vocabulary_and_word_counts();
      $vsm->generate_document_vectors();
      $vsm->display_corpus_vocab();
      $vsm->display_doc_vectors();
      $vsm->upload_document_relevancies_from_file();  
      $vsm->display_doc_relevancies();
      $vsm->precision_and_recall_calculator('vsm');
      $vsm->display_precision_vs_recall_for_queries();
      $vsm->display_map_values_for_queries();

  As mentioned for the previous code block, the filename supplied through
  the constructor parameter 'relevancy_file' must contain relevance
  judgments for the queries that are named in the file supplied through
  the parameter 'query_file'.  The format of this file must be according
  to what is shown in the sample file 'relevancy.txt' in the 'examples'
  directory.  We have already explained the roles played by the
  constructor parameters such as 'lsa_svd_threshold'.

DESCRIPTION

Algorithm::VSM is a perl5 module for constructing a Vector Space Model (VSM) or a Latent Semantic Analysis Model (LSA) of a collection of documents, usually referred to as a corpus, and then retrieving the documents in response to search words in a query.

VSM and LSA models have been around for a long time in the Information Retrieval (IR) community. More recently such models have been shown to be effective in retrieving files/documents from software libraries. For an account of this research that was presented by Shivani Rao and the author of this module at the 2011 Mining Software Repositories conference, see http://portal.acm.org/citation.cfm?id=1985451.

VSM modeling consists of: (1) Extracting the vocabulary used in a corpus. (2) Stemming the words so extracted and eliminating the designated stop words from the vocabulary. Stemming means that closely related words like 'programming' and 'programs' are reduced to the common root word 'program' and the stop words are the non-discriminating words that can be expected to exist in virtually all the documents. (3) Constructing document vectors for the individual files in the corpus --- the document vectors taken together constitute what is usually referred to as a 'term-frequency' matrix for the corpus. (4) Constructing a query vector for the search query after the query is subject to the same stemming and stop-word elimination rules that were applied to the corpus. And, lastly, (5) Using a similarity metric to return the set of documents that are most similar to the query vector. The commonly used similarity metric is one based on the cosine distance between two vectors. Also note that all the vectors mentioned here are of the same size, the size of the vocabulary extracted from the corpus. An element of a vector is the frequency of the occurrence of the word corresponding to that position in the vector.

LSA modeling is a small variation on VSM modeling. Now you take VSM modeling one step further by subjecting the term-frequency matrix for the corpus to singular value decomposition (SVD). By retaining only a subset of the singular values (usually the N largest for some value of N), you can construct reduced-dimensionality vectors for the documents and the queries. In VSM, as mentioned above, the size of the document and the query vectors is equal to the size of the vocabulary. For large corpora, this size may involve tens of thousands elements --- this can slow down the VSM modeling and retrieval process. So you are very likely to get faster performance with retrieval based on LSA modeling, especially if you store the model once constructed in a database file on the disk and carry out retrievals using the disk-based model.

CAN THIS MODULE BE USED FOR GENERAL TEXT RETRIEVAL?

This module has only been tested for software retrieval. For more general text retrieval, you would need to replace the simple stemmer used in the module by one based on, say, Porter's Stemming Algorithm. You would also need to vastly expand the list of stop words appropriate to the text corpora of interest to you. As previously mentioned, the stop words are the commonly occurring words that do not carry much discriminatory power from the standpoint of distinguishing between the documents. See the file 'stop_words.txt' in the 'examples' directory for how such a file must be formatted.

HOW DOES ONE DEAL WITH VERY LARGE LIBRARIES/CORPORA?

It is not uncommon for large software libraries to consist of tens of thousands of documents that include source-code files, documentation files, README files, configuration files, etc. The bug-localization work presented recently by Shivani Rao and this author at the 2011 Mining Software Repository conference (MSR11) was based on a relatively small iBUGS dataset involving 6546 documents and a vocabulary size of 7553 unique words. (Here is a link to this work: http://portal.acm.org/citation.cfm?id=1985451. Also note that the iBUGS dataset was originally put together by V. Dallmeier and T. Zimmermann for the evaluation of automated bug detection and localization tools.) If V is the size of the vocabulary and M the number of the documents in the corpus, the size of each vector will be V and size of the term-frequency matrix for the entire corpus will be of size VxM. So if you were to duplicate the bug localization experiments in http://portal.acm.org/citation.cfm?id=1985451 you would be dealing with vectors of size 7553 and a term-frequency matrix of size 7553x6546. Extrapolating these numbers to really large libraries/corpora, we are obviously talking about very large matrices for SVD decomposition. For large libraries/corpora, it would be best to store away the model in a disk file and to base all subsequent retrievals on the disk-stored models. The 'examples' directory contains scripts that carry out retrievals on the basis of disk-based models. Further speedup in retrieval can be achieved by using LSA to create reduced-dimensionality representations for the documents and by basing retrievals on the stored versions of such reduced-dimensionality representations.

ESTIMATING RETRIEVAL PERFORMANCE WITH PRECISION VS. RECALL CALCULATIONS

The performance of a retrieval algorithm is typically measured by two properties, Precision and Recall, at a given rank r. As mentioned in the http://portal.acm.org/citation.cfm?id=1985451 publication, at given rank r, Precision is the ratio of the number of retrieved documents that are relevant to the total number of retrieved documents up to that rank. And, along the same lines, Recall at a given rank r is the ratio of the number of retrieved documents that are relevant to the total number of relevant documents. The area under the Precision--Recall curve is called the Average Precision for a query. When the Average Precision is averaged over all the queries, we obtain what is known as Mean Average Precision (MAP). For an oracle, the value of MAP should be 1.0. On the other hand, for purely random retrieval from a corpus, the value of MAP will be inversely proportional to the size of the corpus. (See the discussion in http://RVL4.ecn.purdue.edu/~kak/SignifanceTesting.pdf for further explanation on these performance evaluators.) This module includes methods that allow you to carry out these performance measurements using the relevancy judgments supplied through a disk file. If human-supplied relevancy judgments are not available, the module will be happy to estimate relevancies for you just by determining the number of query words that exist in a document. Note, however, that relevancy judgments estimated in this manner cannot be trusted. That is because ultimately it is the humans who are the best judges of the relevancies of documents to queries. The humans bring to bear semantic considerations on the relevancy determination problem that are beyond the scope of this module.

METHODS

The module provides the following methods for constructing VSM and LSA models of a corpus, for using the models thus constructed for retrieval, and for carrying out precision versus recall calculations for the determination of retrieval accuracy on the corpora of interest to you.

new():

A call to new() constructs a new instance of the Algorithm::VSM class:

my $vsm = Algorithm::VSM->new( 
                 corpus_directory    => "",
                 corpus_vocab_db     => "corpus_vocab_db",
                 doc_vectors_db      => "doc_vectors_db",
                 lsa_doc_vectors_db  => "lsa_doc_vectors_db",  
                 stop_words_file     => "", 
                 want_stemming       => 1,
                 min_word_length     => 4,
                 lsa_svd_threshold   => 0.01, 
                 query_file          => "",  
                 relevancy_threshold => 5, 
                 relevancy_file      => $relevancy_file,
                 max_number_retrievals    => 10,
                 debug               => 0,
                             );       

The values shown on the right side of the big arrows are the default values for the parameters. The following nested list will now describe each of the constructor parameters:

corpus_directory:

The parameter corpus_directory points to the root of the directory of documents for which you want to create a VSM or LSA model.

corpus_vocab_db:

The parameter corpus_vocab_db is for naming the DBM in which the corpus vocabulary will be stored after it is subject to stemming and the elimination of stop words. Once a disk-based VSM model is created and stored away in the file named by this parameter and the parameter to be described next, it can subsequently be used directly for speedier retrieval.

doc_vectors_db:

The database named by doc_vectors_db stores the document vector representation for each document in the corpus. Each document vector has the same size as the corpus-wide vocabulary; each element of such a vector is the number of occurrences of the word that corresponds to that position in the vocabulary vector.

lsa_doc_vectors_db

The database named by lsa_doc_vectors_db stores the reduced-dimensionality vectors for each of the corpus documents. These vectors are creating for LSA modeling of a corpus.

stop_words_file

The parameter stop_words_file is for naming the file that contains the stop words that you do not wish to include in the corpus vocabulary. The format of this file must be as shown in the sample file stop_words.txt in the 'examples' directory.

want_stemming

The boolean parameter want_stemming determines whether or not the words extracted from the documents would be subject to stemming. As mentioned elsewhere, stemming means that related words like 'programming' and 'programs' would both be reduced to the root word 'program'.

min_word_length

The parameter min_word_length sets the minimum number of characters in a word in order for it be included in the corpus vocabulary.

lsa_svd_threshold

The parameter lsa_svd_threshold is used for rejecting singular values that are smaller than this threshold fraction of the largest singular value. This plays a critical role in creating reduced-dimensionality document vectors in LSA modeling of a corpus.

lsa_svd_threshold

The parameter query_file points to a file that contains the queries to be used for calculating retrieval performance with Precision and Recall numbers. The format of the query file must be as shown in the sample file test_queries.txt in the 'examples' directory.

relevancy_threshold

The constructor parameter relevancy_threshold is used for automatic determination of document relevancies to queries on the basis of the number of occurrences of query words in a document. You can exercise control over the process of determining relevancy of a document to a query by giving a suitable value to the constructor parameter relevancy_threshold. A document is considered relevant to a query only when the document contains at least relevancy_threshold number of query words.

max_number_retrievals

The constructor parameter max_number_retrievals stands for what it means.

debug

Finally, when you set the boolean parameter debug, the module outputs a very large amount of intermediate results that are generated during model construction and during matching a query with the document vectors.


get_corpus_vocabulary_and_word_counts():

After you have constructed a new instance of the Algorithm::VSM class, you must now scan the corpus documents for constructing the corpus vocabulary. This you do by:

$vsm->get_corpus_vocabulary_and_word_counts();

The only time you do NOT need to call this method is when you are using a previously constructed disk-stored VSM or LSA model for retrieval.

display_corpus_vocab():

If you would like to see corpus vocabulary as constructed by the previous call, make the call

$vsm->display_corpus_vocab();

Note that this is a useful thing to do only on small test corpora. If you must call this method on a large corpus, you might wish to direct the output to a file. The corpus vocabulary is shown automatically when debug option is turned on.

generate_document_vectors():

This is a necessary step after the vocabulary used by a corpus is constructed. (Of course, if you will be doing document retrieval through a disk-stored VSM or LSA model, then you do not need to call this method. You construct document vectors through the following call:

$vsm->generate_document_vectors();
display_doc_vectors():

If you would like to see the document vectors constructed by the previous call, make the call:

$vsm->display_doc_vectors();

Note that this is a useful thing to do only on small test corpora. If you must call this method on a large corpus, you might wish to direct the output to a file. The document vectors are shown automatically when debug option is turned on.

retrieve_with_vsm():

After you have constructed a VSM model, you call this method for document retrieval for a given query @query. The call syntax is:

my $retrievals = $vsm->retrieve_with_vsm( \@query );

The argument, @query, is simply a list of words that you wish to use for retrieval. The method returns a hash whose keys are the document names and whose values the similarity distance between the document and the query. As is commonly the case with VSM, this module uses the cosine similarity distance when comparing a document vector with the query vector.

display_retrievals( $retrievals ):

You can display the retrieved document names by calling this method using the syntax:

$vsm->display_retrievals( $retrievals );

where $retrievals is a reference to the hash returned by a call to one of the retrieve methods. The display method shown here respects the retrieval size constraints expressed by the constructor parameter max_number_retrievals.

construct_lsa_model():

If after you have extracted the corpus vocabulary and constructed document vectors, you would do your retrieval with LSA modeling, you need to make the following call:

$vsm->construct_lsa_model();

The SVD decomposition that is carried out in LSA model construction uses the constructor parameter lsa_svd_threshold to decide how many of the singular values to retain for the LSA model. A singular is retained only if it is larger than the lsa_svd_threshold fraction of the largest singular value.

retrieve_with_lsa():

After you have built an LSA model through the call to construct_lsa_model(), you can retrieve the document names most similar to the query by:

my $retrievals = $vsm->retrieve_with_lsa( \@query );

Subsequently, you can display the retrievals by calling the display_retrievals($retrieval) method described previously.

upload_vsm_model_from_disk():

When you invoke the methods get_corpus_vocabulary_and_word_counts() and generate_document_vectors(), that automatically deposits the VSM model in the database files named with the constructor parameters corpus_vocab_db and doc_vectors_db. Subsequently, you can carry out retrieval by directly using this disk-based VSM model for speedier performance. In order to do so, you must upload the disk-based model by

$vsm->upload_vsm_model_from_disk();

Subsequently you call

my $retrievals = $vsm->retrieve_with_vsm( \@query );
$vsm->display_retrievals( $retrievals );

for retrieval and for displaying the results.

upload_lsa_model_from_disk():

When you invoke the methods get_corpus_vocabulary_and_word_counts(), generate_document_vectors() and construct_lsa_model(), that automatically deposits the LSA model in the database files named with the constructor parameters corpus_vocab_db, doc_vectors_db and lsa_doc_vectors_db. Subsequently, you can carry out retrieval by directly using this disk-based LSA model for speedier performance. In order to do so, you must upload the disk-based model by

$vsm->upload_lsa_model_from_disk();

Subsequently you call

my $retrievals = $vsm->retrieve_with_lsa( \@query );
$vsm->display_retrievals( $retrievals );

for retrieval and for displaying the results.

estimate_doc_relevancies($query_file):

Before you can carry out precision and recall calculations to test the accuracy of VSM and LSA based retrievals from a corpus, you need to have available the relevancy judgments for the queries. (A relevancy judgment for a query is simply the list of documents relevant to that query.) Relevancy judgments are commonly supplied by the humans who are familiar with the corpus. But if such human-supplied relevance judgments are not available, you can invoke the following method to estimate them:

$vsm->estimate_doc_relevancies("test_queries.txt");

For the above method call, a document is considered to be relevant to a query if it contains several of the query words. As to the minimum number of query words that must exist in a document in order for the latter to be considered relevant, that is determined by the relevancy_threshold parameter in the VSM constructor.

But note that this estimation of document relevancies to queries is NOT for serious work. The reason for that is because ultimately it is the humans who are the best judges of the relevancies of documents to queries. The humans bring to bear semantic considerations on the relevancy determination problem that are beyond the scope of this module.

The generated relevancies are deposited in a file named by the constructor parameter relevancy_file.

display_doc_relevancies():

If you would like to see the document relevancies generated by the previous method, you can call

$vsm->display_doc_relevancies()
precision_and_recall_calculator():

After you have created or obtained the relevancy judgments for your test queries, you can make the following call to calculate Precision@rank and Recall@rank:

$vsm->precision_and_recall_calculator('vsm');

or

$vsm->precision_and_recall_calculator('lsa');

depending on whether you are testing VSM-based retrieval or LSA-based retrieval.

display_precision_vs_recall_for_queries():

A call to precision_and_recall_calculator() will normally be followed by the following call

$vsm->display_precision_vs_recall_for_queries();

for displaying the Precision@rank and Recall@rank values.

display_map_values_for_queries():

The area under the precision vs. recall curve for a given query is called Average Precision for that query. When this area is averaged over all the queries, you get MAP (Mean Average Precision) as a measure of the accuracy of the retrieval algorithm. The Average Precision values for the queries and the overall MAP can be printed out by calling

$vsm->display_map_values_for_queries();
upload_document_relevancies_from_file():

When human-supplied relevancies are available, you can upload them into the program by calling

$vsm->upload_document_relevancies_from_file();

These relevance judgments will be read from a file that is named with the relevancy_file constructor parameter.

REQUIRED

This module requires the following modules:

SDBM_File
Storable
PDL
PDL::IO::Storable

The first two of these are needed for creating disk-based database records for the VSM and LSA models. The third is needed for calculating the SVD of the term-frequency matrix. (PDL stands for Perl Data Language.) The last is needed for disk storage of the reduced-dimensionality vectors produced during LSA calculations.

EXAMPLES

See the 'examples' directory in the distribution for the scripts listed below:

For Basic VSM-Based Retrieval:

For basic VSM-based model construction and retrieval, run the script:

retrieve_with_VSM.pl
For Basic LSA-Based Retrieval:

For basic LSA-based model construction and retrieval, run the script:

retrieve_with_LSA.pl

Both of the above scripts will store the corpus models created in disk-based databases.

For VSM-Based Retrieval with a Disk-Stored Model:

If you have previously run a script like retrieve_with_VSM.pl and no intervening code has modified the disk-stored VSM model of the corpus, you can run the script

retrieve_with_disk_based_VSM.pl

This would obviously work faster at retrieval since the VSM model would NOT need to constructed for each new query.

For LSA-Based Retrieval with a Disk-Stored Model:

If you have previously run a script like retrieve_with_LSA.pl and no intervening code has modified the disk-stored LSA model of the corpus, you can run the script

retrieve_with_disk_based_LSA.pl

The retrieval performance of such a script would be faster since the LSA model would NOT need to constructed for each new query.

For Precision and Recall Calculations with VSM:

To experiment with precision and recall calculations for VSM retrieval, run the script:

calculate_precision_and_recall_for_VSM.pl

Note that this script will carry out its own estimation of relevancy judgments --- which in most cases would not be a safe thing to do.

For Precision and Recall Calculations with LSA:

To experiment with precision and recall calculations for LSA retrieval, run the script:

calculate_precision_and_recall_for_LSA.pl

Note that this script will carry out its own estimation of relevancy judgments --- which in most cases would not be a safe thing to do.

For Precision and Recall Calculations for VSM with Human-Supplied Relevancies:

Precision and recall calculations for retrieval accuracy determination are best carried out with human-supplied judgments of relevancies of the documents to queries. If such judgments are available, run the script:

calculate_precision_and_recall_from_file_based_relevancies_for_VSM.pl

This script will print out the average precisions for the different test queries and calculate the MAP metric of retrieval accuracy.

For Precision and Recall Calculations for LSA with Human-Supplied Relevancies:

If human-supplied relevancy judgments are available and you wish to experiment with precision and recall calculations for LSA-based retrieval, run the script:

calculate_precision_and_recall_from_file_based_relevancies_for_LSA.pl

This script will print out the average precisions for the different test queries and calculate the MAP metric of retrieval accuracy.

EXPORT

None by design.

BUGS

Please notify the author if you encounter any bugs. When sending email, please place the string 'VSM' in the subject line to get past my spam filter.

INSTALLATION

The usual

perl Makefile.PL
make
make test
make install

if you have root access. If not,

perl Makefile.PL prefix=/some/other/directory/
make
make test
make install

THANKS

Many thanks are owed to Shivani Rao for sharing with me her deep insights in IR-based retrieval. She was also of much help with the debugging of this module by bringing to bear on its output her amazing software forensic skills.

AUTHOR

Avinash Kak, kak@purdue.edu

If you send email, please place the string "VSM" in your subject line to get past my spam filter.

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Copyright 2011 Avinash Kak

1 POD Error

The following errors were encountered while parsing the POD:

Around line 901:

=pod directives shouldn't be over one line long! Ignoring all 2 lines of content