The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

X-Search -- Automated Web Searching and Search History Indexing

SYNOPSIS

X-Search [optional configuration file name/path argument]

Search commands are read from a configuration file.

DESCRIPTION

X-Search reads a series of search commands from a plain text configuration file and then retrieves the results from the specified search engine and stores them in individule dated files qid/YYYYMMDD.html which is a detailed web page record of the search results for the days date. Summaries of each search (if you have print summaries turned on) as well as a link history to each qid/YYYYMMDD.html file are maintained in one index.html file.

Any new search results for a 24 hour perioud are written to both the qid/YYYYMMDD.html and index.html files. If qid/YYYYMMDD.html already exists with previous search results for the date then it will be appended with newer results in a chronological order. If there is nothing new then nothing is written.

X-Search stores the url's from search results to a data file enabling it to track what it already has seen. This insures subsequent searches are unique and allows one to copy additional undesirable urls in blocks to this file to prevent X-Search from recording them if they are ever encountered in a future search.

X-Search is ideal for maintaining records of frequent news events and can safely be run as many times as desired daily to determine new news events to index that matches the users search requirements. For instance: You could track any number of newsgroups three times daily for new posts by passing the search option "groups=". So, in the option field in the configuration file you could insert |groups=alt.some.group| or, if you wanted to search all groups related to perl you could do this: |groups=*perl*|

X-Search Allows the option of verifying the url address to determine if it is valid or not. Any url's that are found not valid, i.e., moved, not found, are ignored.

X-Search allows one with a lot of flexibility to use in all sorts of neat applications.

CONFIGURATION FILE

X-Search is controled by a configuration file. This file can be any name you want. There are two methods to tell X-Search what configuration file to use and where.

Method 1: Simply define $qconfig in the script to point to the configuration file. By default it looks for a file called "query.ini" located in same directory as X-Search.

Method 2: Command line argument defining path and name of the configuration file. Example:

X-Search /home/xsearch/search.conf

X-Search would read /home/xsearch/search.conf for it's search commands. This allows easily using multi configuration files for different search setups.

This file is read to get the following user defined search commands:

1) The search engine to use
2) A nice Name to describe the top of the search to
   printed on the web pages (Like AutoSearch $query_name)
3) The query words for the search seperated by a space
4) Any search options to pass to search engine. This is optional 
   and can be left blank.
5) Max results to return
6) The B<qid>, query information directory, the directory name 
   to create to store dated web pages creadted from the search.

A typical configuration file would have one or more lines that follow this structure:

SEARCH ENGINE|SEARCH NAME|SEARCH WORDS|OPTIONS|MAX_TO_RETURN|QID|

The individule values are seperated with a | and a | must be found at the end of each line. There is no limit to how many searches you can define in the configuration file, but you may want to keep it resonable and to aid in managing multi searches, there is the option of turning off/on summaries being displayed in index.html.

Here is a sample of what a typical configuration file should look like:

------cut------------------

HotBot|Military|tank armor|RD=DM&Domain=.mil|40|tanks| Google|Tech News|parallel processing||200|parallel| Excite::News|News From Home|Palm Springs California||100|myhome| AltaVista|AZ Fishing|arizona lakes||60|lakes|

--------end-----------------

The Google command line would search the engine Google, print a nice list heading titled "Tech News", search and display results pertaining to "parallel processing", with no options, return a max of 200 results and store the dated search history pages in a directory called "parallel".

Obiviously, you want to define different qid names for all your different searches so that hot dog searches don't end up mixed with apple searches.

Note About Options

Multiply option pairs must be seperated with '&'. See HotBot search example above.

AUTO SEARCHING

X-Search can be run from a cron job to automate searching even more.

Example to run X-Search each Monday at 3:00 AM:

0 3 * * 1 /home/xsearch/X-Search

or if you want to specify a configuration file:

0 3 * * 1 /home/xsearch/X-Search /home/xsearch/cofig.conf

CHANGING THE APPERANCE OF THE WEB PAGES

X-Search web pages are easily customizable by simply changing the html in the subs "print_ihead" and "print_dhead". The sub print_idead produces the html for the index.html file. You can add whatever body tags you desire like background colors, images, fonts, etc. The sub print_dhead controls the html that goes into qid/YYYYMMDD.html files.

There is also a "print_footer" sub that prints a footer for all the pages, and I ask that my name and e-mail address remain intact if you decide to customize the footer as well. (Publicity is my only payment from this :-)

USER SETTINGS

There is a number of user settings that control the behavior of X-Search which is hard coded into the script.

$verbose

This just prints messages to screen while the script is running. This is nice for manual operation but not needed if run by cron.

$ck_url

$ck_url = "1";

$ck_url will verify if url's are good or bad. 0=No 1=Yes Setting $ck_url can slow the search down depending on how many bad urls are encountered.

$iDIR

$iDIR = "$dir/index";

Define a sub-directory name that will store the main index.html file. The name "index" should be just fine. qid directories will be created below this directory.

$print_summaries

$print_summaries = "0";

1=Yes 0=No

If you have many search events defined and running you may want to turn off printing summary results to keep the index.html file size within reason. Only links to the detailed qid/YYYYMMDD.html pages will be printed. Turn it on if you want summaries to be displayed in index.html

$oURLS

$oURLS = "urls.dat";

Define the name of our url's record file. Without this we are lost.

$qconfig

$qconfig = "query.ini";

Define path and name of the query configuration file. This file stores the search command, such as engines to use, search string, qid directory, max to return and so forth. SEE bottom of this script for more details. You MAY also pass this value as a arugument so you can run multi configuration files

$host $port

$port = ""; $host = "";

Define a host/port if required (most don't need to)

$sTEMP

$sTEMP = "TEMP";

This just defines a name for temporary working file X-Search uses to build a index.html file. No need to mess with it.

CHANGES

Version 1.03

- Created a hack to track Dejanews articles properly
- added escaping and unescaping ?'s in urls because 
  they would raise havoc with my regex leading to urls
  being printed over and over

AUTHOR

X-Search was written entierly by Jim Smyser E<jsmyser@bigfoot.com>.

BUGS

Shouldn't be any... but you never know! Report them to me.

COPYRIGHT

Copyright (c) 2000 by Jim Smyser All rights reserved.

You my use this program source provided that the above copyright notice and this paragraph are duplicated in all such forms and that any documentation, advertising materials, and other materials related to your distribution of this source code and use acknowledge Jim Smyser as the author/developer.

THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

3 POD Errors

The following errors were encountered while parsing the POD:

Around line 588:

'=item' outside of any '=over'

Around line 649:

You forgot a '=back' before '=head1'

Around line 660:

Unknown E content in E<jsmyser@bigfoot.com>