NAME

WebService::GoogleHack - Perl package that ties together all GoogleHack modules.

SYNOPSIS

use WebService::GoogleHack;

my $google = new WebService::GoogleHack;

#Initializing the object to the contents of the configuration file
# API Key, GoogleSearch.wsdl file location.

$google->initConfig("initconfig.txt");

#Printing the contents of the configuration file
$google->printConfig();

#Measure the semantic relatedness between the words "white house" and 
#"president".

$measure=$google->measureSemanticRelatedness("white house","president");

print "\nRelatedness measure between white house and president is: ";
print $measure."\n";

#Going to search for words that are related to "toyota" and "ford" 
my @terms=();
push(@terms,"toyota");
push(@terms,"ford");

#The parameters are the search terms, number of web page results to look 
#at, the number of iterations,output file and the "true" indicates that the
#diagnostic data should be stored in the file "results.txt"

$results=$google->wordClusterInPage(\@terms,10,25,1,"results.txt","true");

print $results;

DESCRIPTION

WebService::GoogleHack - Is a Perl package that interacts with the Google API, and has some basic functionalities that allow the user to interact with Googleand retrieve results. It also has some Natural Language Processing capabilities, such as the ability to predict the semantic orienation of words, build word clusters, and find words that are common to a pair of words.

Related Modules:

WebService::GoogleHack::Text

WebService::GoogleHack::Search

WebService::GoogleHack::Rate

WebService::GoogleHack::Spelling

Required Packages

Brill Tagger

 Installation file and instructions @ : 

 http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z

Required PERL Modules

   SOAP::Lite;

   Set::Scalar;

   Text::English;

   LWP::Simple;

   URI::URL;

   LWP::UserAgent;

   HTML::LinkExtor;

   Data::Dumper;

PACKAGE METHODS

__PACKAGE__->new()

Purpose: This function creates an object of type GoogleHack and returns a blessed reference.

__PACKAGE__->initConfig(configLocation)

Purpose: This function is used to read a configuration file containing informaiton such as the Google-API key, the words list etc.

Valid arguments are :

  • configLocation

    string. Location of the configuration file.

returns : Returns an object which contains the parsed information.

__PACKAGE__->printConfig()

Purpose: This function is used to print the information read from a configuration file

No arguments.

__PACKAGE__->setMaxResults(maxResults)

Purpose: This function sets the maximum number of results retrieved

Valid arguments are :

  • maxResults

    Number. The maximum number of results we want to be able to retrieve from a Google search. Should be less than 10.

__PACKAGE__->setlr(lr)

Purpose: This this function used to set the language restriction.

Valid arguments are :

  • lr

    string. Language Restricion eg, "lang_eng", This will restrict the google search to web pages in english.

__PACKAGE__->setStartPos(StartPos)

Purpose: This function sets the startposition for the search results. This should be an integer between 0 and 1000.

Valid arguments are :

  • StartPos

    string.

__PACKAGE__->setRestrict(Restrict)

Purpose: This function sets the restrict search to a specific domain on.

Valid arguments are :

  • Restrict

    String. UncleSam for the US Government

__PACKAGE__->setSafeSearch(Restrict)

Purpose: This functions enables safe search, Restricts search to non-abusive material.

Valid arguments are :

  • Restrict

    Boolean. "True" or "False".

__PACKAGE__->measureSemanticRelatedness(searchString1,searchString2)

Purpose: this is function is used to measure the relatedness between two words it basically calls the measureSemanticRelatedness function which is in the Rate class

Valid arguments are :

  • searchString1

    string. The search string which can be a phrase or word

  • searchString2

    string. The search string which can be a phrase or word

Returns: Returns the object containing the PMI measure. ($search->{'PMI'}).

__PACKAGE__->predictSemanticOrientation(reviewfile,positive_inference,negative_inference,trace_file)

Purpose: this function tries to predict the semantic orientation of a paragraph of text need

Valid arguments are :

  • reviewfile

    string. The location of the review file

  • positive_inference.

    string. Positive inference such as excellent

  • negative_inference.

    string. Negative inference such a poor

  • trace_file.

    string. The location of the trace file. If a file_name is given, the results are stored in this file

Returns : the PMI measure and the prediction which is 0 or 1.

__PACKAGE__->phraseSpelling(searchString)

Purpose: This is function is used to retrieve a spelling suggestion from Google

Valid arguments are :

  • searchString

    string. Need to pass the search string, which can be a single word

Returns: Returns suggested spelling if there is one, otherwise returns "No Spelling Suggested":

__PACKAGE__->Search(searchString,num_results)

Purpose: This function is used to query googles

Valid arguments are :

  • searchString

    string. Need to pass the search string, which can be a single word or phrase, maximum ten words

  • num_results

    integer. The number of results you wast to retrieve, default is 10. Maximum is 1000.

Returns: Returns a GoogleHack object containing the search results.

__PACKAGE__->getSearchSnippetWords(searchString,numResults,trace_file)

Purpose: Given a search word, this function tries to retreive the text surrounding the search word in the retrieved snippets.

Valid arguments are :

  • searchString

    string. The search string which can be a word or phrase

  • numResults

    string. The number of results to be processed from google.

  • trace_file.

    string. The location of the trace file. If a file_name is given, the results are stored in this file

  • proximity

    string. The number of words surrounding the searchString (Not Implemented) yet

returns : Returns an object which contains the parsed information

__PACKAGE__->getCachedSurroundingWords(searchString,trace_file)

Purpose: Given a search word, this function tries to retreive the text surrounding the search word in the retrieved CACHED Web pages. It basically does the search and passes the search results to the WebService::GoogleHack::Text::getCachedSurroundingWords function.

Valid arguments are :

  • searchString

    string. The search string which can be a word or phrase

  • trace_file.

    string. The location of the trace file. If a file_name is given, the results are stored in this file

returns : Returns a hash with the keys being the words and the values being the frequency of occurence.

__PACKAGE__->getSearchSnippetSentences(searchString,trace_file)

Purpose: Given a search word, this function tries to retreive the sentences in the snippet.It basically does the search and passes the search results to the WebService::GoogleHack::Text::getSnippetSentences function

Valid arguments are :

  • searchString

    string. The search string which can be a word or phrase

  • trace_file.

    string. The location of the trace file. If a file_name is given, the results are stored in this file

returns : Returns an array of strings.

__PACKAGE__->getCachedSurroundingSentences(searchString,trace_file)

Purpose: Given a search word, this function tries to retreive the sentences in the cached web page.

Valid arguments are :

  • searchString

    string. The search string which can be a word or phrase

  • trace_file.

    string. The location of the trace file. If a file_name is given, the results are stored in this file

returns : Returns a hash which contains the parsed sentences as values and the key being the web URL.

__PACKAGE__->getSearchCommonWords(searchString1,searchString2,trace_file,stemmer)

Purpose:Given two search words, this function tries to retreive the common text/words surrounding the search strings in the retrieved snippets.

Valid arguments are :

  • searchString1

    string. The search string which can be a word or phrase

  • searchString2

    string. The search string which can be a word or phrase

  • trace_file.

    string. The location of the trace file. If a file_name is given, the results are stored in this file

  • stemmer.

    bool. Porter Stemmer on or off.

returns : Returns a hash which contains the intersecting words.

__PACKAGE__->getWordClustersInSnippets(searchString1,iterations,number,trace_file)

Purpose:Given a search string, this function retreive the top frequency words , and does a search on those words, and builds a list of words that can be regarded as a cluster of related words.

Valid arguments are :

  • searchString1

    string. The search string which can be a word or phrase

  • *=item *

    iterations

    number. The number of iterations that you want the function to search and build cluster on.

    trace_file.

    string. The location of the trace file. If a file_name is given, the results are stored in this file

returns : Returns a set of words as a hash.

__PACKAGE__->getClustersInSnippets(searchString1,searchString2,iterations,number,trace_file)

Purpose:Given two search strings, this function retreive the snippets for each string, and then finds the intersection of words, and then repeats the search with the intersection of words.

Valid arguments are :

  • searchString1

    string. The search string which can be a word or phrase

  • searchString2

    string. The search string which can be a word or phrase

  • iterations

    number. The number of iterations that you want the function to search and build cluster on.

  • trace_file.

    string. The location of the trace file. If a file_name is given, the results are stored in this file

returns : Returns a hash which contains the intersecting words as keys and the values being the frequency of occurence.

__PACKAGE__->getText(searchString,iterations,number,path_to_data_directory)

Purpose:Given a search string, this function will retreive the resulting URLs from Google, follow those links, and retrieve the text from there. The function will then clean up the text and store it in a file along with the URL, Date and time of retrieval.The file will be stored under the name of the search string.

Valid arguments are :

  • searchString

    string. The search string which can be a word or phrase.

  • iterations

    number. The number of iterations that you want the function to search and build cluster on.

  • path_to_data_directory.

    string. The location where the file containing the retrived information has to be stored.

returns : Returns nothing.

__PACKAGE__->getWordsInPage(searchTerms,numResults,frequencyCutoff,iteration,numberofSearchTerms,bigrams,trace_file_path)

Purpose:Given a set of search temrs, this function will retreive the resulting URLs from Google, it will then follow those links, and retrieve the text from there. Once all the text is collected, the function finds the intersecting or co-occurring words between the top N results. This function is basically used by the function wordClusterInPage.

Valid arguments are :

  • searchTerms

    string. An array which contains each search term (It can only be a word not phrase).

  • numResults

    number. The number of web pages results to be looked at.

  • frequencyCutoff

    number. Words occuring less than the frequencyCutoff would be excluded from results.

  • iteration

    number. The current iteration number.

  • numberofSearchTerms

    number. The number of search terms in the initial set.

  • bigrams

    number. The bigram cutoff.Set to 0 to exclude bigrams.

  • trace_file_path.

    string. The location of the trace file.

returns : Returns nothing.

__PACKAGE__->wordClusterInPage(searchTerms,numResults,frequencyCutoff,numIterations,path_to_data_directory, html)

Purpose:Given two or more words, this function tries to find a set of related words. This is the Google-Hack baseline algorithm 1.

  • searchTerms

    string. The array of search terms (Can only be a word). =item *

    numResults

    number. The number of web pages results to be looked at.

  • numResults

    number. The number of web pages results to be looked at.

  • frequencyCutoff

    number. Words occuring less than the frequencyCutoff would be excluded from results.

  • numIterations

    number. The number of iterations that you want the function to search and build cluster on.

  • path_to_data_directory.

    string. The location where the file containing the retreived information has to be stored.

returns : Returns an html or text version of the results.

__PACKAGE__->Algorithm2(searchTerms,numResults,frequencyCutoff,bigramCutoff,numIterations,scoreType,scoreCutOff,path_to_data_directory, html)

Purpose:Given two or more words, this function tries to find a set of related words. This is the Google-Hack baseline algorithm 1.

  • searchTerms

    string. The array of search terms (Can only be a word). =item *

    numResults

    number. The number of web pages results to be looked at.

  • numResults

    number. The number of web pages results to be looked at.

  • frequencyCutoff

    number. Words occuring less than the frequencyCutoff would be excluded from results.

  • bigramCutoff

    number. Bigrams occuring less than the bigramCutoff would be excluded from results.

  • numIterations

    number. The number of iterations that you want the function to search and build cluster on.

  • scoreType

    number. Takes on the values 1,2 or 3 indicating the relatedness measure to be used.

  • scoreCutOff

    number. Words and Bigrams with relatedness score greater than the scoreCutOff would be excluded from results.

  • path_to_data_directory.

    string. The location where the file containing the retreived information has to be stored.

returns : Returns an html or text version of the results.

__PACKAGE__->predictWordSentiment(infile,positive_inference,negative_inference,$htmlFlag,$traceFile)

Purpose:Given an file containing text, this function tries to find the positive and negative words.

  • infile

    string. The input file

  • positive_inference

    string. A positive word such as "Excellent"

  • negative_inference.

    string. A negative word such as "Bad"

  • htmlFlag.

    string. Set to "true" if you want the results to be HTML formatted

    tracefile.

    string. Set to a file if you want the results to be written to the given filename.

returns : Returns an html or text version of the results.

__PACKAGE__->predictPhraseSentiment(infile,positive_inference,negative_inference,$htmlFlag,$traceFile)

Purpose:Given an file containing text, this function tries to find the positive and negative phrases. The function selects phrases based on the patterns given in the "Thumbs up or Down" paper.

  • infile

    string. The input file

  • positive_inference

    string. A positive word such as "Excellent"

  • negative_inference.

    string. A negative word such as "Bad"

  • htmlFlag.

    string. Set to "true" if you want the results to be HTML formatted

    tracefile.

    string. Set to a file if you want the results to be written to the given filename.

returns : Returns an html or text version of the results.

AUTHOR

Pratheepan Raveendranathan, <rave0029@d.umn.edu>

Ted Pedersen, <tpederse@d.umn.edu>

BUGS

SEE ALSO

WebService::GoogleHack home page

Pratheepan Raveendranathan

Ted Pedersen

Google-Hack Maling List <google-hack-users@lists.sourceforge.net>

COPYRIGHT AND LICENSE

Copyright (c) 2003 by Pratheepan Raveendranathan, Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 484:

Expected '=item *'