NAME

URI::ParseSearchString - parse Apache refferer logs and extract search engine query strings.

VERSION

Version 2.5 (nearly at YAPC2008 version)

SYNOPSIS

use URI::ParseSearchString ;

my $uparse = new URI::ParseSearchString() ;

my $query_terms = 
    $uparse->se_term('http://www.google.com/search?hl=en&q=a+simple+test&btnG=Google+Search') ;
my $engine_name = 
   $uparse->se_name('http://www.google.com/search?hl=en&q=a+simple+test&btnG=Google+Search') ;
my $engine_hostname = 
   $uparse->se_host('http://www.google.com/search?hl=en&q=a+simple+test&btnG=Google+Search') ;

FUNCTIONS

new

Creates a new instance object of the module.

my $uparse = new URI::ParseSearchString() ;

parse_search_string

This module provides a simple function to parse and extract search engine query strings. It was designed and tested having Apache referrer logs in mind. It can be used for a wide number of purposes, including tracking down what keywords people use on popular search engines before they land on a site. It makes use of URI::split to extract the string and URI::Escape to un-escape the encoded characters in it. Although a number of existing modules and scripts exist for this purpose, the majority of them are either outdated using obsolete search strings associated with each engine.

The default function exported is "parse_search_string" which accepts an unquoted referrer string as input and returns the search engine query contained within. It currently works with both escaped and un-escaped queries and will translate the search terms before returning them in the latter case. The function returns undef in all other cases and errors.

for example:

$string = 
   $uparse->parse_search_string('http://www.google.com/search?hl=en&q=a+simple+test&btnG=Google+Search');

would return 'a simple test'

whereas

$string = 
   $uparse->parse_search_string('http://www.mamma.com/Mamma?utfout=1&qtype=0&query=a+more%21+complex_+search%24&Submit=%C2%A0%C2%A0Search%C2%A0%C2%A0');

would return 'a more! complex_ search$'

Currently supported search engines include:

  • Abacho

  • AOL (UK)

  • AOLSEARCH

  • AllTheWeb

  • ASK.com

  • Blueyonder (UK)

  • BBC search

  • Categorico (IT)

  • Conduit

  • Cuil

  • Feedster Blog Search

  • Fireball (DE)

  • Froogle

  • Froogle (UK)

  • Google & 231 other TLD's

  • Google Blog Search

  • Godado

  • Godado (IT)

  • HotBot

  • Ice Rocket Blog Search

  • ICQ.com

  • ilMotore.com

  • Ithaki.net

  • Kataweb (IT)

  • Lycos

  • Lycos (ES)

  • Lycos (IT)

  • Libero (IT)

  • Mamma

  • Megasearching.net

  • Mirago (UK)

  • MyWebSearch.com

  • MSN

  • Microsoft live.com

  • MyWay

  • Netscape

  • NTLworld

  • Orange

  • Ozu ES

  • Starware

  • Sweetim

  • Simpatico (IT)

  • Soso

  • Technorati Blog Search

  • Tesco Google search

  • Terra (ES)

  • Tiscali (UK)

  • TheSpider (IT)

  • VirginMedia

  • Web.de (DE)

  • Yahoo

  • Yahoo Japan

se_term

Same as parse_search_string().

findEngine

Returns the search engine hostname and name extracted by the supplied referrer URL.
 
 my $engine = 
    $uparse->findEngine('http://www.google.com/search?hl=en&q=a+simple+test&btnG=Google+Search') ;

This will return "google.com" as the search engine hostname and 'Google' as the name. Currently supports 231 Google TLD's & all the above mentioned search engines.

se_host

Wrapper around findEngine. Returns the search engine hostname.

se_name

Wrapper around findEngine. Returns the search engine canonical name.

AUTHOR

Spiros Denaxas, <s.denaxas at gmail.com>

BUGS

This is my first CPAN module so I encourage you to send all comments, especially bad, to my email address.

This could not have been possible without the support of my co-workers at http://nestoria.co.uk - the easiest way of finding UK property.

SUPPORT

For more information, you could also visit my blog:

http://idaru.blogspot.com

COPYRIGHT & LICENSE

Copyright 2008 Spiros Denaxas, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.