NAME

cgi-bin/search.pl - user/admin CGI front end to ROADS search

SYNOPSIS

cgi-bin/search.pl> [-C charset] [-L language] [-d]
  [-f form] [-l logfile] [-o protocols] [-u url]
  [-v view] [-w waylay_url]

aka admin.pl

DESCRIPTION

The search.pl program is a Common Gateway Interface (CGI) program used to provide an end user search front end to ROADS databases. When accessed with no CGI query, the program can return an HTML form to the user to fill in to make a query. This form can be customized by the ROADS administrator and can include a number of options.

When the ROADS software is installed, a symbolic link to the program is made from the ROADS admin-cgi directory under the name admin.pl. You may find that following symbolic links is disabled by default on your server for security reasons, though this can usually be overridden on a per directory basis. We used to actually copy cgi-bin/search.pl over to admin-cgi/admin.pl, but this made maintenance unnecessarily complex.

It is desirable to differentiate between the search program running as an admin user (who will be able to edit, create and delete records) and the search program running as an end user (who will only be able to search for and view records). This differentiation is done in practice by checking the name by which the program was invoked.

USAGE

The ROADS software comes with its own search subsystem, which is capable of dealing with small to medium size databases of tens of thousands of records. This consists of a Common Gateway Interface (CGI) based WWW front end, and as the back end, a WHOIS++ server which uses a simple inverted index. Whilst using our WHOIS++ implementation has benefits for distributed searching, it's not essential that you use this - e.g. we also provide tools to convert your ROADS data into a variety of other formats, such as the Summary Object Interchange Format (SOIF) used by Harvest and Glimpse, the Generic Record Syntax (GRS-1) format used by some Z39.50 servers, and the input format used by Bunyip's Digger WHOIS++ server.

The basic model for searching using the ROADS software is as follows:

1. Query submitted by end user via HTML form.
2. WWW search front end parses query and passes on to any number of back end WHOIS++ servers.
3. WHOIS++ servers return search results.
4. WWW search front end parses results and constructs an HTML document from them.

Queries may consist of:

  • single terms, e.g. podule, which will be matched in the right side of any records they occur in.

  • attribute/value pairs, e.g. title=podule, which will be matched in the specific attribute's value.

  • Phrases, such as podule module, which match the words supplied if they occur in the value component of a record with no other intervening text.

These may be combined in Boolean expressions, e.g. template=DOCUMENT and title=podule, each component of which will be evaluated separately and the results combined. Brackets may be used to group Booleans together. Boolean support extends to the AND, OR and NOT operations, though NOT is not recommended and may not work entirely as expected in complex expressions.

In addition, it is possible to constrain a search to a particular WHOIS++ server, search case sensitively or insensitively, display only the titles of the results, rank the results according to relevance, use stemming to match other similar words in addition to the ones supplied in the query, and perform query expansion using a thesaurus. All but the last of these options is configurable at search-time, whereas the thesaurus support is either enabled or disabled for searches as a whole.

SEARCHING ACROSS MULTIPLE WHOIS++ SERVERS

The default ROADS configuration assumes that you are going to run a single WHOIS++ server to make your ROADS database available for searching. A side effect of the use of WHOIS++ in searching is that it is also possible to search other WHOIS++ servers. You can tell your ROADS installation about other WHOIS++ servers by editing the file config/databases, to add their names and addresses.

With the default configuration we ship, the names of the other WHOIS++ servers your ROADS installation knows about will appear on the HTML returned by the ROADS search tool search.pl. You may wish to alter the HTML outline for this page to list only those WHOIS++ servers you want to make visible to end users. The end user can choose to have their search directed some or all of these, and search.pl will combine the results and present them in the form of HTML. See also "wig.pl" in bin for a more advanced way of searching across multiple servers using centroids.

Note that whilst you may be able to see multiple WHOIS++ servers using the admin search tool admin.pl, you can only edit ROADS database entries which are held locally.

ALTERING SEARCH BEHAVIOUR

Within the ROADS search subsystem there are a number of possibilities for local customisation (without modifying any code), substitution of locally written code for individual modules of the search subsystem, and enabling or disabling search features:

  • The CGI based search front end comes with a number of configurable options which are exposed to the end user by default. You may wish to hide some or all of these, e.g. to create ``simple'' and ``advanced'' search forms. This can be done by editing the outline HTML in config/multilingual/*/search/search.html and also config/multilingual/*/admin/search.html. We suggest that you make form elements which you would like to hide from your end users into hidden fields.

  • You can alter the way the search results are rendered into HTML by editing the outline HTML files in config/multilingual/*/search-views and config/multilingual/*/admin-views.

  • You can control which URL schemes are redirected to help pages in rendering the search results by putting their names in the file config/protocols. This is done by default for the mailto, wais and URL schemes other than ftp, gopher and http, and the outline HTML files can be found in config/multilingual/*/waylay.

  • The code used in ranking search results and rendering them into HTML lives in separate modules ROADS::Rank and ROADS::Render respectively, and can easily be updated or replaced as necessary.

  • The search and retrieval capability provided by the ROADS WHOIS++ server may be augmented by an external thesaurus module - this can be any arbitrary piece of code which will take a word or words, and perform query expansion on them. We also provide an internal query expansion capability which is intended for use with a small number of commonly occurring words, e.g. to expand ``colour'' to match ``color''. This is configured by editing the file config/expansions.

  • The search component of the WHOIS++ server may effectively be replaced by any piece of code which implements the WHOIS++ Gateway Interface, described in a separate document. This provides a way to use alternative back end databases with the ROADS software without having to do any network programming. WGI is also used to implement the external thesaurus feature. The locations of these WGI programs may be specified in ROADS.pm as the variables WGIPath and WGIThesaurus respectively.

SEARCH RESTRICTIONS

It is assumed that you will not want to make all of the information in your templates visible to the world at large. The attributes which can be searched on and the information which appears when a template is rendered into HTML are limited to those attributes and templates which are listed in the file config/search-restrict under the top level ROADS installation directory. The admin.pl program has its own list of restrictions in the file config/admin-restrict.

The defaults shipped have entries for a small subset of the attributes which may be found in the DOCUMENT, SERVICE and USER templates. If you want your users to be able to search on or see the contents of any other templates, you will need to add them to one or both of these lists. More information is provided on the admin.pl manual page and the search.pl manual page.

OPTIONS

-C charset

Character set to use.

-L language

Language to use.

-d

Whether to run in debug mode or not - default is not.

-f form

The default HTML form to return to the end user.

-l logfile

Log file to record search requests and results in

-o protocols

Protocols to override using the waylay.pl program.

-u url

The URL of this program

-v view

The search results view to use

-w waylay_url

The URL of the waylay.pl program. See its documentation for more information.

CGI VARIABLES

There are a number of inputs that the form must have for the program to execute correctly; these are listed below. Note that the end user need not necessarily be presented with these on their browser if an input type of "hidden" is used.

It is important to note that there are two way of composing queries - one way is to use a simple text entry box query, and the other is to use up to three attribute/value pairs, e.g. attrib1 and term1 would comprise one attribute/value pair. In the HTML form which the user fills in to generate a query, the attributes, the values, or even both, may be generated using a combination of HTML elements such as drop down lists and text entry boxes. This can be used to provide (for example) a way of selecting the attribute to search on using an HTML SELECT menu, or to constrain the value being searched for similarly.

attrib[123]

When constructing the query out of attribute/value pairs, these variables are the attributes corresponding to the terms term[123].

boolean

When constructing the query out of a combination of attrib[123] and term[123], this CGI variable specifies the Boolean operator which should be used. The only sensible choices for this are "and" and "or".

caseful

This is a Boolean variable that specifies whether a search should be case sensitive or not. The value "on" specifies that the search should take notice of the case of the terms, any other value (or none at all) implies that the search will be case insensitive.

charset

The character set to use.

database

This is a CGI variable that allows the database(s) that are to be searched for the query in this form to be specified. A fake database name of "ALL" tells the search.pl program to search through all the databases it knows about.

debug

This is a Boolean variable which specifies whether the search.pl program should operate in debug mode - in debug mode it generates copious extra HTML documenting its progress.

form

The HTML form to return to the end user if no query is supplied. The default form is search.html. This will be the name of a file in the config/multilingual/*/search/ directory, or the config/multilingual/*/admin/ directory.

headlines

This is a Boolean variable that specifies whether a search should return headlines instead of full template discriptions. It is included for compatibility with previous versions of ROADS, and actually has the effect of setting the results "view" to "headlines".

highlight

This is a Boolean variable which specifies whether search results should have matches (rendered in bold) for the original query highlighted.

language

The language to use.

query

This is the query as entered by the user. This will typically be a text input element in the form. See also the CGI variables admin[123] and term[123].

ranking

This is a Boolean variable which specifies whether the results should be ranked into order, based on the frequency with which the words in the query occur in the records which were returned as a result of the search.

referrals

A Boolean variable specifying whether or not the search.pl program should follow referrals generated in the process of carrying out a WHOIS++ search.

stemming

This is a Boolean variable which indicates to search.pl whether the query terms should be stemmed when searching the database. The ROADS software currently implements the Porter stemming algorithm, with hooks for user supplied stemming or thesaurus lookup. If the value "on" is returned, the software will use stemming, otherwise the search terms will be used as is.

templatetype

This CGI variable permits the end user or ROADS administrator to limit the returned resources down to those that are in an IAFA template of the specified type. A special template type of "ALL" is understood by search.pl to mean all template types. All the template types should be in upper case.

term[123]

When constructing the query out of attribute/value pairs, these fields are the values corresponding to the attributes attrib[123].

view

The name of a "view" to use when rendering the search results into HTML. The default view is "default". This will be the name of a subdirectory of config/multilingual/*/search-views/ or of config/multilingual/*/admin-views/.

FILES

config/databases - known WHOIS++ servers.

config/protocols - protocols to override using waylay.pl.

config/multilingual/*/search/nohits.html - default HTML form sent to end user when no query is specified.

config/multilingual/*/search/noconnect.html - default HTML form sent to end user when no query is specified.

config/multilingual/*/search/nosearchterm.html - default HTML form sent to end user when no query is specified.

config/multilingual/*/search/search.html - default HTML form sent to end user when no query is specified.

config/multilingual/*/search/syntax.html - default HTML form sent to end user when no query is specified.

config/multilingual/*/search-views/* -

logs/search-hits - searches carried out and result details.

All of the search and search-views files and directories have admin and admin-views equivalents when the program is run as admin.pl.

FILE FORMATS

The format of the search-hits and admin-hits logfiles is as per the WWW Common Log File format :-

Client domain name

If domain name lookups enabled on HTTP server or IP address.

Remote user name

as returned by AUTH/IDENT lookup if enabled on the HTTP server.

Remote user name

as provided by HTTP authentication, if authentication is required by the HTTP server configuration.

Date of the request.
The query string itself.
The number of local hits

i.e. hits resulting from local records on the WHOIS++ servers being queried.

The number of referral hits

i.e. hits resulting from referrals sent back by the WHOIS++ servers being queried.

This file can be used to assess which terms are being searched for most frequently, how many searches are not matching anything in the available database and other statistics which may provide useful feedback to the ROADS administrator.

SEE ALSO

"mktemp.pl" in admin-cgi, "tempbyhand.pl" in cgi-bin

COPYRIGHT

Copyright (c) 1988, Martin Hamilton <martinh@gnu.org> and Jon Knight <jon@net.lut.ac.uk>. All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

It was developed by the Department of Computer Studies at Loughborough University of Technology, as part of the ROADS project. ROADS is funded under the UK Electronic Libraries Programme (eLib), and the European Commission Telematics for Research Programme, and the TERENA development programme.

AUTHOR

Martin Hamilton <martinh@gnu.org>, Jon Knight <jon@net.lut.ac.uk>