NAME

Seq - A search and set manipulation engine for text documents

ABSTRACT

THIS IS ALPHA SOFTWARE

Seq is a text indexer, search utility, and document set manipulator written in Perl and C. It has several special features not to be found in most available similar programs, because of a special set of requirements. Searches consist of arbitrarily complex nested queries, with conjunction, disjunction and negation of words, phrases, or otherwise designated whole document sets. Phrases have proximity operators, with both exact counts and limits. Any document or result set may be included in a subsequent query, as if it were a subquery. Also there are no stop words, so that any query, even one consisting of very common words, (eg <with the> or <for the>) will generate a result set.

The index format draws some ideas from the Lucene search engine, with some simplifications and enhancements. The index segments are stored in an SQL database, defaulting to the zero-configuration SQLite.

SEARCH/SET MANIPULATION SYNTAX

Search and set operations are internally identical, therefore any search query can be nested inside a set operation query, and vice versa.

Phrases are enclosed in angle brackets '<' and '>'. Alternations, or disjunctions, are enclosed in square brackets '[' and ']'. Conjunctions are enclosed in parentheses '(' and ')'. Any delimited entity inside of another is called a subquery. Any number of entities (subqueries or words) may be in a subquery. These may be nested one within another without limit, as long as it makes logical sense. "<the quick [brown grey] fox>" is a simple valid phrase. When a subquery occurs in the context of a phrase, it is called "positional". "Positional" context is opposed to overall document set context. For instance, in the query "[dog cat]" the words do not have positional context, so it doesn't matter where they occur in the document. In "<the [dog cat]>" the subquery "[dog cat]" has positional context, so it matters where they are in a document (namely they must come right after the word "the").

Negation of entities are indicated with a preceding '^'. Negations can be of any entity, including words, phrases, disjunctions, and conjunctions. Negation must not occur in a void context. For instance, the whole query cannot be negated, eg "^<quick brown fox>" is not allowed but "(^<quick brown fox> <lazy dog>)" is. There must be a positive component to the query at the same level as the negation.

Search terms may be arbitrarily complex. For example:

"<At [a the] council [of for] the [gods <deities we [love <believe in>] with all our [hearts strength]>] on olympus [hercules apollo athena <some guys>] [was were] [praised <condemned to eternal suffering in the underworld>]>"

"(<frankly my dear> ^(<whatever shall I do> <wherever shall I go>))"

Two operators are available to do proximity searches within phrases. '#wN' represents *at least* N intervening skips between words (the number of between words plus 1). Thus "<The #w8 dog>" would match the text "The quick brown fox jumped over the lazy dog". If #w7 or lesser had been used it would not match, but if #w9 or greater had been used it would still match. Also there is the '#dN' operator, which represents *exactly* N intervening skips. Thus for the above example "<The #d8 dog>", and no other value, would match. These operators can be used within phrases only.

INDEXING

Several programs are provided to use for indexing documents. A typical index is created like this:

# cat textcorpus.txt | tokenizestream | indexstream index_dir
# optimize index_dir

The indexstream program writes index segments, which are then folded together into one by the optimize program.

At minimum, documents must be marked up with <DOCID> and <TEXT> sections. If raw text files are to be indexed, there is a "docize" program which can take a list of filenames and output them to stdout with the appropriate markup, using the file path as docid. For example:

# docize dir_of_textfiles/ | tokenizestream | indexstream index_dir

TCP/IP SERVICE API

There is a program included which implements a socket-based service. To use it call:

# bin/seqsvc listen-port index_dir

The listener takes requests in the form:

function-name
arg1
...
argn

with two newlines terminating the request. So for instance to issue a search request with the query "<foo bar>" and retrieve the 20th - 30th document results, open a socket and write

search
<foo bar>
20
10

The seqsvc program responds with the results and closes the socket.

Results are returned in JSON format (see http://www.crockford.com/JSON/index.html). This is a simple format easily parsed by most programming languages. In particular, if only lists are used, the representation is identical to a Perl list (of lists of lists etc) and therefore can simply be eval'd in Perl.

The structure of particular results are dependent on the function called, but the most common case is the search() function, whose results are in the form of two lists. The first list contains:

query elapsed time
a return code (1 for success, 0 for failure)
result set unique ID
canonicalized query
result document count
result match count

The second list contains the slice of results requested. These are of the form:

[ docid, match1, runlength1, ... matchN, runlengthN ]

Matches are indicated as deltas. That is, each represents the distance in tokens from the previous match or beginning of the document. Runlength is the length in tokens of the match. If no offset and slice size are given, the first ten result documents are returned.

Here is a complete example of a call and response:

search
<for the>
0
10

[
  [
    0.255496,
    1,
    '@0',
    '< for the >',
    1698,
    8534
  ],
  [
    [ '333916', 232, 1, 59, 1, 171, 1, 399, 1, 107, 1 ],
    [ '333920', 81, 1, 7, 1 ],
    [ '333921', 70, 1 ],
    [ '333924', 120, 1, 320, 1, 111, 1, 67, 1 ],
    [ '333925', 347, 1, 67, 1 ],
    [ '333926', 114, 1, 7, 1, 212, 1, 168, 1 ],
    [ '333927', 297, 1 ],
    [ '333930', 44, 1 ],
    [ '333935', 181, 1, 62, 1, 290, 1, 16, 1, 29, 1, 72, 1, 83, 1 ],
    [ '333943', 52, 1 ]
  ]
]

Other functions, such as for sampling, are in development.

INTERNALS

Each query is compiled into a syntax tree and evaluated recursively. The operations all boil down to merging two posting lists, the inverted index data structure for words/result sets. In the following I'll describe the posting lists for single words (tokens), although the format is nearly the same as for general result sets.

A posting list is referred to in the code as an "isr" for historical reasons. The data structure is an array containing, most importantly, a list of documents, and a list of position lists. A position list contains all the positions in a particular document where the word appears.

The process of executing a query boils down to merging two posting lists into their left complement, intersection, and right complement (think of a Venn diagram with two circles intersecting, creating three regions). All the set and search operations are performed by doing this and then performing different combinations of functions on the documents depending on which region they are in. For instance, consider two posting lists for tokens A and B. Merging them gives us the left complement (all documents where A occurs but B does not), the intersection (where both A and B occur) and the right complement (where B occurs but A does not). If the operation is conjunction (intersection), then both left and right complement are simply ignored, and the intersection is taken as the result after merging the individual position lists. If the operation is disjunction (union), all documents found are returned, with the position lists of the complements essentially unchanged and those in the intersection merged. If the operation is a sequence (phrase search), the two complements are ignored and the intersection's position lists are checked and returned if A and B are found in the correct relation. Another example: If the operation involves a negated token, such as "(A ^B)", then both the intersection and right complement are ignored, and the left complement is taken as the result set.

Where possible, the postings are merged in the optimal order, going from least frequent to most frequent tokens.

TO DO

Not all behavior is exactly as described in this document. 1 day.

There is a bug in sequencing causing matches to be missed in rare cases. The algorithm needs redesigning. 3 days.

Positional negation eg "<what ^ever>". 2-3 days.

Error handling, for instance syntax checks. 2-3 days.

The database backend to be generalized and network-based rather than on local disk. The socket service to be multiprocess as in typical pre-forking http server. 4-5 days.

Partial result sets for speedier response. 3 days.

Design a parallel database querying system which produces Seq result sets in the Seq backend. ? days.

PROGRAMMING API

Seq is a library, you can use it directly in a perl program.

use Seq;

# open for indexing
$index = Seq->open_write( "indexname" );
$index->index_document( "docname", $string );
$index->close_index();

Seq::fold_segments("indexname"); # fold all segments together

# open for searching
$index = Seq->open_read( "indexname" );

# Find all docs containing a phrase
$result = $index->search( "<this phrase and no other phrase>" );

$result is two lists. The first list is data about the result set:

[ return-code, setid, canonical-query, ndocs, nmatches ],

Second list is a slice of the results, default results 0-9: [ [ docid1, match1, length1, match2, length2 ... matchN, lengthN ], [ docid2, match1, length1, ... ] ]

... where 'match' is the delta of the token location of each match within that doc.

AUTHOR

Ira Joseph Woodhead, ira at sweetpota dot to

SEE ALSO

Lucene Plucene Inline