NAME

SWISH::3 - Perl interface to libswish3

SYNOPSIS

use SWISH::3 qw(:constants);
my $handler = sub {
   my $s3_data   = shift;
   my $props     = $s3_data->properties;
   my $prop_hash = $s3_data->config->get_properties;

   print "Properties\n";
   for my $p ( sort keys %$props ) {
       print " key: $p\n";
       my $prop = $prop_hash->get($p);
       printf( "    <%s type='%s'>%s</%s>\n",
           $prop->name, $prop->type, $s3_data->property($p), $prop->name );
   }    

   print "Doc\n";
   for my $d (SWISH_DOC_FIELDS) {
       printf( "%15s: %s\n", $d, $s3_data->doc->$d );
   }    

   print "TokenList\n";
   my $tokens = $s3_data->tokens;
   while ( my $token = $tokens->next ) {
       print '-' x 50, "\n";
       for my $field (SWISH_TOKEN_FIELDS) {
           printf( "%15s: %s\n", $field, $token->$field );
       }    
   }
};
my $swish3 = SWISH::3->new(
               config      => 'path/to/config.xml',
               handler     => $handler,
               regex       => qr/\w+(?:'\w+)*/,
               );
$swish3->parse( 'path/to/file.xml' )
   or die "failed to parse file: " . $swish3->error;

printf "libxml2 version %s\n", $swish3->xml2_version;
printf "libswish3 version %s\n", $swish3->version;

DESCRIPTION

SWISH::3 is a Perl interface to the libswish3 C library.

CONSTANTS

All the SWISH_* constants defined in libswish3.h are available and can be optionally imported with the :constants keyword.

use SWISH::3 qw(:constants);

See the SWISH::3::Constants section below.

In addition, the SWISH::3 Perl class defines some Perl-only constants:

SWISH_DOC_FIELDS

An array of method names that can be called on a SWISH::3::Doc object in your handler method.

SWISH_TOKEN_FIELDS

An array of method names that can be called on a SWISH::3::Token object.

SWISH_DOC_FIELDS_MAP

A hashref of method names to id integer values. The integer values are assigned in libswish3.h.

SWISH_DOC_PROP_MAP

A hashref of built-in property names to docinfo attribute names. The values of SWISH_DOC_PROP_MAP are the keys of SWISH_DOC_FIELDS_MAP.

FUNCTIONS

default_handler

The handler used if you do not specify one. By default is simply prints the contents of SWISH::3::Data to stderr.

CLASS METHODS

new( args )

args should be an array of key/value pairs. See SYNOPSIS.

Returns a new SWISH::3 instance.

xml2_version

Returns the libxml2 version used by libswish3.

version

Returns the libswish3 version.

refcount( object )

Returns the Perl reference count for object.

wc_report( codepoint )

Prints a isw* summary to stderr for codepoint. codepoint should be a positive integer representing a Unicode codepoint.

This prints a report similar to the swish_isw.c example script.

slurp( filename )

Returns the contents of filename as a scalar string. May also be called as an object method.

OBJECT METHODS

get_file_ext( filename )

Returns file extension for filename.

get_mime( filename )

Returns the configured MIME type for filename based on file extension.

get_real_mime( filename )

Returns the configured MIME type for filename, ignoring any .gz extension. See looks_like_gz.

looks_like_gz( filename )

Returns true if filename has a file extension indicating it is gzip'd. Wraps the swish_fs_looks_like_gz() C function.

parse( filename_or_filehandle_or_string )

Wrapper around parse_file(), parse_buffer() and parse_fh() that tries to Do the Right Thing.

parse_file( filename )

Calls the C function of the same name on filename.

parse_buffer( str )

Calls the C function of the same name on str. Note that str should contain the API headers.

parse_fh( filehandle )

Not yet implemented.

error

Returns the error message from the last call to parse(), parse_file() parse_buffer() or parse_fh(). If there was no error on the last call to one of those methods, returns undef.

set_config( swish_3_config )

Set the Config object.

get_config

Returns SWISH::3::Config object.

config

Alias for get_config().

set_analyzer( swish_3_analyzer )

Set the Analyzer object.

get_analyzer

Returns SWISH::3::Analyzer object.

analyzer

Alias for get_analyzer()

set_parser( swish_3_parser )

Set the Parser object.

get_parser

Returns SWISH::3::Parser object.

parser

Alias for get_parser().

set_handler( \&handler )

Set the parser handler CODE ref.

get_handler

Returns a CODE ref for the handler.

set_data_class( class_name )

Default class_name is SWISH::3::Data.

get_data_class

Returns class name.

set_parser_class( class_name )

Default class_name is SWISH::3::Parser.

get_parser_class

Returns class name.

set_config_class( class_name )

Default class_name is SWISH::3::Config.

get_config_class

Returns class name.

set_analyzer_class( class_name )

Default class_name is SWISH::3::Analyzer.

get_analyzer_class

Returns class name.

set_regex( qr/\w+(?:'\w+)*/ )

Set the regex used in tokenize().

get_regex

Returns the regex used in tokenize().

regex

Alias for get_regex().

get_stash

Returns the SWISH::3::Stash object used internally by the SWISH::3 object. You typically do not need to access this object as a user of SWISH::3, but if you are developing code that needs to access objects within a handler function, you can put it in the Stash object and then retrieve it later.

Example:

my $s3    = SWISH::3->new( handler => \&handler );
my $stash = $s3->get_stash();
$stash->set('my_indexer' => $indexer);

# later..
sub handler {
    my $data  = shift;
    my $indexer = $data->s3->get_stash->get('my_indexer');
    $indexer->add_doc( $data );
}

tokenize( string [, metaname, context ] )

Returns a SWISH::3::TokenIterator object representing string. The tokenizer uses the regex defined in set_regex().

tokenize_native( string [, metaname, context ] )

Returns a SWISH::3::TokenIterator object representing string. The tokenizer uses the built-in libswish3 tokenizer, not a regex.

DEVELOPER METHODS

ref_cnt

Returns the internal reference count for the underlying C struct pointer.

debug([n])

Get/set the internal debugging level.

describe( object )

Like calling Devel::Peek::Dump on object.

mem_debug

Calls the C function swish_memcount_debug().

get_memcount

Returns the global C malloc counter value.

dump

A wrapper around describe() and Data::Dump::dump().

SWISH::3::Analyzer

new( swish_3_config )

Returns a new SWISH::3::Analyzer instance.

set_regex( qr/\w+/ )

Set the regex used in SWISH::3->tokenize().

get_regex

Returns a qr// regex object.

get_tokenize

Get the tokenize flag. Default is true.

set_tokenize( 0|1 )

Toggle the tokenize flag. Default is true (tokenize contents when file is parsed).

SWISH::3::Config

set_default

set_properties

Not yet implemented.

get_properties

Returns SWISH::3::PropertyHash object.

set_metanames

Not yet implemented.

get_metanames

Returns SWISH::3::MetaNameHash object.

set_mimes

Not yet implemented.

get_mimes

Returns SWISH::3::xml2Hash object.

set_parsers

Not yet implemented.

get_parsers

Returns SWISH::3::xml2Hash object.

set_aliases

Not yet implemented.

get_aliases

Returns SWISH::3::xml2Hash object.

set_index

Not yet implemented.

get_index

Returns SWISH::3::xml2Hash object.

set_misc

Not yet implemented.

get_misc

Returns SWISH::3::xml2Hash object.

debug

add(file_or_xml)

An alias for add() is merge().

delete

delete() is NOT YET IMPLEMENTED.

read( filename )

Returns SWISH::3::Config object.

write( filename )

SWISH::3::Data

s3

Get the parent SWISH::3 object.

config

Get the parent SWISH::3::Config object.

property( name )

Returns the string value of PropertyName name.

metaname( name )

Returns the string value of MetaName name.

properties

Returns a hashref of name/value pairs.

metanames

Returns a hashref of name/value pairs.

doc

Returns a SWISH::3::Doc object.

tokens

Returns a SWISH::3::TokenIterator object.

SWISH::3::Doc

mtime

Returns the last modified time as epoch int.

size

Returns the size in bytes.

nwords

Returns the number of tokenized words in the Doc.

encoding

Returns the string encoding of Doc.

uri

Returns the URI value.

ext

Returns the file extension.

mime

Returns the mime type.

parser

Returns the name of the parser used (TXT, HTML, or XML).

action

Returns the intended action (e.g., add, delete, update).

SWISH::3::MetaName

new( name )

Returns a new SWISH::3::MetaName instance.

TODO: there are no set methods so this isn't of much use.

id

Returrns the id integer.

name

Returns the name string.

bias

Returns the bias integer.

alias_for

Returns the alias_for string.

SWISH::3::MetaNameHash

get( name )

Get the SWISH::3::MetaName object for name

set( name, swish_3_metaname )

Set the SWISH::3::MetaName for name.

keys

Returns array of names.

SWISH::3::Property

id

Returns the id integer.

name

Returns the name string.

ignore_case

Returns the ignore_case boolean.

type

Returns the type integer.

verbatim

Returns the verbatim boolean.

max

Returns the max integer.

sort

Returns the sort boolean.

alias_for

Returns the alias_for string.

SWISH::3::PropertyHash

get( name )

Get the SWISH::3::Property object for name

set( name, swish_3_property )

Set the SWISH::3::Property for name.

keys

Returns array of names.

SWISH::3::Stash

get( key )

set( key, value )

keys

values

SWISH::3::Token

value

Returns the value string.

meta

Returns the SWISH::3::MetaName object for the Token.

meta_id

Returns the id integer for the related MetaName.

context

Returns the context string.

pos

Returns the position integer.

len

Returns the length in bytes of the Token.

SWISH::3::TokenIterator

next

Returns the next SWISH::3::Token.

SWISH::3::xml2Hash

get( key )

set( key, value )

keys

SWISH::3::Constants

The following constants are imported directly from libswish3 and are defined there.

SWISH_ALIAS
SWISH_BODY_TAG
SWISH_BUFFER_CHUNK_SIZE
SWISH_CASCADE_META_CONTEXT
SWISH_CLASS_ATTRIBUTES
SWISH_CONTRACTIONS
SWISH_DATE_FORMAT_STRING
SWISH_DEFAULT_ENCODING
SWISH_DEFAULT_METANAME
SWISH_DEFAULT_MIME
SWISH_DEFAULT_PARSER
SWISH_DEFAULT_PARSER_TYPE
SWISH_DEFAULT_VALUE
SWISH_DOM_CHAR
SWISH_DOM_STR
SWISH_ENCODING_ERROR
SWISH_ESTRAIER_FORMAT
SWISH_EXT_SEP
SWISH_FALSE
SWISH_FOLLOW_XINCLUDE
SWISH_HEADER_FILE
SWISH_HEADER_ROOT
SWISH_IGNORE_XMLNS
SWISH_INCLUDE_FILE
SWISH_INDEX
SWISH_INDEX_FILEFORMAT
SWISH_INDEX_FILENAME
SWISH_INDEX_FORMAT
SWISH_INDEX_LOCALE
SWISH_INDEX_STEMMER_LANG
SWISH_INDEX_NAME
SWISH_KINOSEARCH_FORMAT
SWISH_LATIN1_ENCODING
SWISH_LOCALE
SWISH_LUCY_FORMAT
SWISH_MAXSTRLEN
SWISH_MAX_FILE_LEN
SWISH_MAX_HEADERS
SWISH_MAX_SORT_STRING_LEN
SWISH_MAX_WORD_LEN
SWISH_META
SWISH_MIME
SWISH_MIN_WORD_LEN
SWISH_PARSERS
SWISH_PARSER_HTML
SWISH_PARSER_TXT
SWISH_PARSER_XML
SWISH_PATH_SEP_STR
SWISH_PREFIX_MTIME
SWISH_PREFIX_URL
SWISH_PROP
SWISH_PROP_DATE
SWISH_PROP_DBFILE
SWISH_PROP_DESCRIPTION
SWISH_PROP_DOCID
SWISH_PROP_DOCPATH
SWISH_PROP_ENCODING
SWISH_PROP_INT
SWISH_PROP_MIME
SWISH_PROP_MTIME
SWISH_PROP_NWORDS
SWISH_PROP_PARSER
SWISH_PROP_RANK
SWISH_PROP_RECCNT
SWISH_PROP_SIZE
SWISH_PROP_STRING
SWISH_PROP_TITLE
SWISH_RD_BUFFER_SIZE
SWISH_SPECIAL_ARG
SWISH_STACK_SIZE
SWISH_SWISH_FORMAT
SWISH_TITLE_METANAME
SWISH_TITLE_TAG
SWISH_TOKENIZE
SWISH_TOKENPOS_BUMPER
SWISH_TOKEN_LIST_SIZE
SWISH_TRUE
SWISH_UNDEFINED_METATAGS
SWISH_UNDEFINED_XML_ATTRIBUTES
SWISH_URL_LENGTH
SWISH_VERSION
SWISH_WORDS
SWISH_XAPIAN_FORMAT

BUGS AND LIMITATIONS

libswish3 is not yet ported to Windows.

AUTHOR

Peter Karman perl@peknet.com

COPYRIGHT

Copyright 2010 Peter Karman.

This file is part of libswish3.

libswish3 is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

libswish3 is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

SEE ALSO

http://swish3.dezi.org/, http://swish-e.org/

SWISH::Prog, Dezi::App