NAME
KinoSearch::Docs::Tutorial::Simple - Bare-bones search app.
Setup
Copy/move the directory containing the html presentation of the US Constitution from the sample
directory of the KinoSearch distribution to the base level of your web server's htdocs
directory.
$ mv sample/us_constitution /usr/local/apache2/htdocs/
Indexing: invindexer.plx
Our first task will be to create an application called invindexer.plx
which builds a searchable "inverted index" from a collection of documents.
We'll start by creating a KinoSearch::Simple object, telling it where we'd like the index to be located and the language of the source material.
#!/usr/bin/perl
use strict;
use warnings;
use KinoSearch::Simple;
use File::Spec;
# (change these as needed)
my $source_dir = '/usr/local/apache2/htdocs/us_constitution';
my $index_loc = '/path/to/index';
my $base_url = '/us_constitution';
my $simple = KinoSearch::Simple->new(
path => $index_loc,
language => 'en',
);
Next, we'll add a subroutine which reads in and extracts plain text from an html source file. KinoSearch::Simple won't be of any help here. For the most part, KinoSearch is not equipped to deal with source files directly -- it remains deliberately ignorant on the vast subject of file formats, preferring to focus instead on its core competencies of indexing and search.
# Parse an HTML file from our US Constitution collection and return a
# hashref with three keys: title, body, and url.
sub slurp_and_parse_file {
my $filename = shift;
my $filepath = File::Spec->catfile( $source_dir, $filename );
open( my $fh, '<', $filepath )
or die "Can't open '$filepath': $!";
my $raw = do { local $/; <$fh> };
# build up a document hash
my $url = "$base_url/$filename";
my %doc = ( url => $url );
$raw =~ m#<title>(.*?)</title>#s
or die "couldn't isolate title in '$filepath'";
$doc{title} = $1;
$raw =~ m#<div id="bodytext">(.*?)</div><!--bodytext-->#s
or die "couldn't isolate bodytext in '$filepath'";
$doc{content} = $1;
$doc{content} =~ s/<.*?>/ /gsm; # quick and dirty tag stripping
return \%doc;
}
Note that parsing HTML using regexes is generally an awful idea, and that this ultra-simple parsing sub only works because the source material's formatting is 100% controlled by us. Under most circumstances, you want HTML::Parser or the like instead of regexes.
Add some more generic directory reading code...
# Collect names of source files.
opendir( my $source_dh, $source_dir )
or die "Couldn't opendir '$source_dir': $!";
my @filenames;
for my $filename ( readdir $source_dh ) {
next unless $filename =~ /\.html/;
next if $filename eq 'index.html';
push @filenames, $filename;
}
closedir $source_dh or die "Couldn't closedir '$source_dir': $!";
... and now we're ready for the meat of invindexer.plx:
foreach my $filename (@filenames) {
my $doc = slurp_and_parse_file($filename);
$simple->add_doc($doc); # ta-da!
}
That's it.
Search: search.cgi
As with our indexing app, the bulk of the code in our search script won't be KinoSearch-specific.
The beginning is dedicated to CGI processing and configuration.
#!/usr/bin/perl -T
use strict;
use warnings;
### In order for search.cgi to work, $index_loc must be modified so
### that it points to the invindex created by invindexer.plx, and
### $base_url may have to change to reflect where a web-browser should
### look for the us_constitution directory.
my $index_loc = '';
my $base_url = '/us_constitution';
use CGI;
use List::Util qw( max min );
use POSIX qw( ceil );
use KinoSearch::Simple;
my $cgi = CGI->new;
my $q = $cgi->param('q');
my $offset = $cgi->param('offset');
my $hits_per_page = 10;
$q = '' unless defined $q;
$offset = 0 unless defined $offset;
Once we have those tasks out of the way, we create our KinoSearch::Simple object and feed it a query string.
my $simple = KinoSearch::Simple->new(
path => $index_loc,
language => 'en',
);
my $hit_count = $simple->search(
query => $q,
offset => $offset,
num_wanted => $hits_per_page,
);
The value returned by search() is the total number of documents in the collection which matched the query. Our script uses the parameters offset
and num_wanted
along with this hit count to break up results into "pages" of manageable size.
Calling search() on our Simple object turns it into an iterator. Invoking fetch_hit_hashref() now returns our stored documents (augmented with a score
), starting with the most relevant.
# create result list
my $report = '';
while ( my $hit = $simple->fetch_hit_hashref ) {
my $score = sprintf( "%0.3f", $hit->{score} );
$report .= qq|
<p>
<a href="$hit->{url}"><strong>$hit->{title}</strong></a>
<em>$score</em>
<br>
<span class="excerptURL">$hit->{url}</span>
</p>
|;
}
The rest of the script is just text wrangling.
#---------------------------------------------------------------#
# No tutorial material below this point - just html generation. #
#---------------------------------------------------------------#
# generate paging links and hit count, print and exit
my $paging_links = generate_paging_info( $q, $hit_count );
blast_out_content( $q, $report, $paging_links );
exit;
# Create html fragment with links for paging through results n-at-a-time.
sub generate_paging_info {
my ( $query_string, $total_hits ) = @_;
$query_string = CGI::escapeHTML($query_string);
my $paging_info;
if ( !length $query_string ) {
# no query, no display
$paging_info = '';
}
elsif ( $total_hits == 0 ) {
# alert the user that their search failed
$paging_info
= qq|<p>No matches for <strong>$query_string</strong></p>|;
}
else {
# calculate the nums for the first and last hit to display
my $last_result = min( ( $offset + $hits_per_page ), $total_hits );
my $first_result = min( ( $offset + 1 ), $last_result );
# display the result nums, start paging info
$paging_info = qq|
<p>
Results <strong>$first_result-$last_result</strong>
of <strong>$total_hits</strong> for
<strong>$query_string</strong>.
</p>
<p>
Results Page:
|;
# calculate first and last hits pages to display / link to
my $current_page = int( $first_result / $hits_per_page ) + 1;
my $last_page = ceil( $total_hits / $hits_per_page );
my $first_page = max( 1, ( $current_page - 9 ) );
$last_page = min( $last_page, ( $current_page + 10 ) );
# create a url for use in paging links
my $href = $cgi->url( -relative => 1 ) . "?" . $cgi->query_string;
$href .= ";offset=0" unless $href =~ /offset=/;
# generate the "Prev" link;
if ( $current_page > 1 ) {
my $new_offset = ( $current_page - 2 ) * $hits_per_page;
$href =~ s/(?<=offset=)\d+/$new_offset/;
$paging_info .= qq|<a href="$href"><= Prev</a>\n|;
}
# generate paging links
for my $page_num ( $first_page .. $last_page ) {
if ( $page_num == $current_page ) {
$paging_info .= qq|$page_num \n|;
}
else {
my $new_offset = ( $page_num - 1 ) * $hits_per_page;
$href =~ s/(?<=offset=)\d+/$new_offset/;
$paging_info .= qq|<a href="$href">$page_num</a>\n|;
}
}
# generate the "Next" link
if ( $current_page != $last_page ) {
my $new_offset = $current_page * $hits_per_page;
$href =~ s/(?<=offset=)\d+/$new_offset/;
$paging_info .= qq|<a href="$href">Next =></a>\n|;
}
# close tag
$paging_info .= "</p>\n";
}
return $paging_info;
}
# Print content to output.
sub blast_out_content {
my ( $query_string, $hit_list, $paging_info ) = @_;
$query_string = CGI::escapeHTML($query_string);
print "Content-type: text/html\n\n";
print qq|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-type"
content="text/html;charset=ISO-8859-1">
<link rel="stylesheet" type="text/css" href="$base_url/uscon.css">
<title>KinoSearch: $query_string</title>
</head>
<body>
<div id="navigation">
<form id="usconSearch" action="">
<strong>
Search the <a href="$base_url/index.html">US Constitution</a>:
</strong>
<input type="text" name="q" id="q" value="$query_string">
<input type="submit" value="=>">
<input type="hidden" name="offset" value="0">
</form>
</div><!--navigation-->
<div id="bodytext">
$hit_list
$paging_info
<p style="font-size: smaller; color: #666">
<em>Powered by
<a href="http://www.rectangular.com/kinosearch/">
KinoSearch
</a>
</em>
</p>
</div><!--bodytext-->
</body>
</html>
|;
}
OK... now what?
KinoSearch::Simple is perfectly adequate for some tasks, but it's not very flexible. Many people will find that it doesn't do at least one or two things they can't live without.
In our next tutorial chapter, BeyondSimple, we'll rewrite our indexing and search scripts using the classes that KinoSearch::Simple hides from view, opening up the possibilities for expansion. Then, we'll spend the rest of the tutorial chapters expanding.
COPYRIGHT
Copyright 2005-2007 Marvin Humphrey
LICENSE, DISCLAIMER, BUGS, etc.
See KinoSearch version 0.20.