NAME

btparse - BibTeX parsing and processing library

SYNOPSIS

#include <btparse.h>

void bt_initialize (void);
void bt_free_ast (AST *ast);
void bt_cleanup (void);

void   bt_set_filename (char *filename);
void   bt_set_stringopts (bt_metatype_t metatype, ushort options);
ushort bt_parse_entry_s (char *  entry_text, 
                         ushort  options,
                         int     line,
                         AST **  top);
ushort bt_parse_entry   (FILE *  infile,
                         ushort  options,
                         AST **  top);
ushort bt_parse_file    (char *  filename, 
                         ushort  options,
                         AST **  top);

bt_metatype_t bt_entry_metatype (AST *entry);
char *bt_entry_type (AST *entry);
char *bt_cite_key (AST *entry);
AST * bt_next_field (AST *entry, AST *prev, char **name);
AST * bt_next_macro (AST *entry, AST *prev, char **name);
AST * bt_next_value (AST *head, 
                    AST *prev,
                    bt_nodetype_t *nodetype,
                    char **text);
char *bt_get_text (AST *node);

bt_stringlist * bt_split_list (char *   string,
                               char *   delim,
                               char *   filename,
                               int      line,
                               char *   description);
void bt_free_list (bt_stringlist *list);
bt_name * bt_split_name (char *  name,
                         char *  filename, 
                         int     line,
                         int     name_num);

bt_tex_tree * bt_build_tex_tree (char * string);
void          bt_free_tex_tree (bt_tex_tree **top);
void          bt_dump_tex_tree (bt_tex_tree *node, int depth, FILE *stream);
char *        bt_flatten_tex_tree (bt_tex_tree *top);

DESCRIPTION

btparse is a C library for parsing and processing BibTeX files. It provides a lexical scanner and LR parser (constructed by PCCTS) that are efficient and provide good error detection and recovery; a set of functions for traversing the AST (abstract syntax tree) generated by the parser; and utility functions for manipulating strings according to BibTeX conventions. (Note that nothing in the library assumes that you're using BibTeX files for their original purpose of bibliographic data for scholarly publications; you could use the file format for any conceivable purpose that fits it. However, there is some code in the library that is really only appropriate for use with strings meant to be processed in the same way that BibTeX itself does. This is all entirely optional, though.)

Note that the interface provided by btparse, while complete, is fairly low-level. If you have more sophisticated needs, you might be interested my Text::BibTeX module for Perl 5 (available on CPAN).

CONCEPTS AND TERMINOLOGY

To understand this document and use btparse, you should already be familiar with the BibTeX language---more specifically, the BibTeX data description language. (BibTeX being the complex beast that it is, one can conceive of the term applying to the program, the data language, the particular database structure described in the original BibTeX documentation, the ".bst" formatting language, and the set of conventions embodied in the standard styles included with the BibTeX distribution. In this document, I'll stick to the first two meanings---the data language because that's what btparse deals with, and the program because it's occasionally necessary to explain differences between my parser and BibTeX's.)

In particular, you should have a good idea what's going on in the following:

@string{and = { and },
        joe = "Blow, Joe",
        john = "John Smith"}

@book(ourbook,
      author = joe # and # john,
      title = {Our Little Book})

If this looks like something you want to parse, but don't want to have to write your own parser for, you've come to the right place.

Before going much further, though, you're going to have to learn some of the terminology I use for describing BibTeX data. Most of it's the same as you'll find in any BibTeX documentation, but it's important to be sure that we're talking about the same things here. So, some definitions:

top-level

All text in a BibTeX file from the start of the file to the start of the first entry, and between entries thereafter.

name

A string of letters, digits, and the following characters:

: + ' . - _

A name must start with a letter. (This is a radically different definition from that used by BibTeX, which lists certain characters not allowed in names. Thus, for BibTeX, @!^631 is a perfectly valid name; btparse rejects this.) Examples from above: string, and.

entry

A chunk of text starting with an "at" sign (@) at top-level, followed by a name (the entry type), followed by an entry delimiter ({ or (), and proceeding to the matching closing delimiter. Also, the data structure that results from parsing this chunk of text.

entry type

The name that comes right after an @ at top-level. Examples from above: string, book.

entry metatype

A classification of entry types that allows us to group all "regular" entries (i.e., anything other than string, comment, or preamble) together. Also the corresponding metatype to a "string" entry type is "macro definition," to avoid confusion between BibTeX strings and "string" entries.

entry delimiters

{ and }, or ( and ): the pair of characters that (almost) mark the boundaries of an entry. "Almost" because the start of an entry is marked by an @, not by the "entry open" delimiter.

citation key

(Or just key when it's clear what we're speaking of.) The name immediately following the entry open delimiter in a regular entry. Example from above: ourbook.

field

A name to the left of an equals sign in a regular or macro-definition entry. In the latter context, might also be called a macro name. Examples from above: joe, author.

field list

In a regular entry, everything between the entry delimiters except for the citation key. In a macro definition entry, everything between the entry delimiters (possibly also called a macro list).

compound value

(Usually just "value".) The text that follows an equals sign (=) in a regular or macro definition entry, up to a comma or the entry close delimiter; a list of one or more simple values joined by hash signs (#).

simple value

A string, macro, or number.

string

(Or, sometimes, "quoted string.") A chunk of text between quotes (") or braces ({ and }). Braces must balance: {this is a {string} is not a BibTeX string, but {this is a {string}} is. ("this is a {string" is also illegal, mainly to avoid the possibility of generating bogus TeX code--which BibTeX will do in certain cases.)

macro

A name that appears on the right-hand side of an equals sign (i.e. as one simple value in a compound value). Implies that this name was defined as a macro in an earlier macro definition entry, but this is only checked if btparse is being asked to expand macros to their full definitions.

number

An unquoted string of digits.

Working with btparse generally consists of passing the library some BibTeX data (or a source for some BibTeX data, such as a filename or a file pointer), which it then lexically scans, parses, and constructs an abstract-syntax tree (AST) from. It returns this AST to you, and you call other btparse functions to traverse and query the tree.

The contents of AST nodes are the private domain of the library, and you shouldn't go poking into them. This being C, though, there's nothing to prevent you from doing so except good manners (and the possibility that I might change the AST structure in future releases, breaking any badly-behaved code). Also, it's not necessary to know the structural relationships between nodes in the AST---that's taken care of by the query/traversal functions.

However, it's useful to know some of the things that btparse deposits in the AST and returns to you through those query/traversal functions. First off, each node has a "node type," which records the syntactic element corresponding to each node. For instance, the entry

@book{mybook, author = "Joe Blow", title = "My Little Book"}

is rooted by an "entry" node; under this would be found a "key" node (for the citation key), two "field" nodes (for the "author" and "title" fields); and associated with each field node would be a "string" node. The only time this concerns you is when you ask the library for a simple value; just looking at the text is not enough to distinguish quoted strings, numbers, and macro names, so btparse returns the nodetype as well.

In addition to the nodetype, btparse records the metatype of each "entry" node. This allows you (and the library) to distinguish, say, regular entries from comment entries. Not only do they have very different structures and must therefore be traversed differently by the library, but certain traversal functions make no sense on certain entry metatypes---thus it's necessary for you to be able to make the distinction as well.

That said, everything you need to know to work with the AST is explained in bt_traversal.

DATA TYPES AND MACROS

btparse defines several types required for the external interface. First, it trivially defines a boolean type (along with TRUE and FALSE macros). This might affect you when including the btparse.h header in your own code---since it's not possible for the code to detect if there is already a boolean type defined, you might have to define the HAVE_BOOLEAN pre-processor token to deactivate btparse.h's typedef of boolean.

Next, two enumeration types are defined: bt_metatype_t and bt_nodetype_t. Both of these are used extensively in the library itself, and are made available to users of the library because they can be found in nodes of the btparse AST (abstract syntax tree). (I.e., querying the AST can give you bt_metatype_t and bt_nodetype_t values, so the typedefs must be available to your code.)

Entry metatype enum

bt_metatype_t has the following values:

BTE_UNKNOWN
BTE_REGULAR
BTE_COMMENT
BTE_PREAMBLE
BTE_MACRODEF

which are determined by the "entry type" token. (@string entries have the BTE_MACRODEF metatype; @comment and @preamble correspond to BTE_COMMENT and BTE_PREAMBLE; and any other entry type has the BTE_REGULAR metatype.)

AST nodetype enum

bt_nodetype_t has the following values:

BTAST_UNKNOWN
BTAST_ENTRY
BTAST_KEY
BTAST_FIELD
BTAST_STRING
BTAST_NUMBER
BTAST_MACRO

Of these, you'll only ever deal with the last three. They are returned when you query the AST for a simple value---just seeing the text isn't enough to distinguish between a quoted string, a number, and a macro, so the AST nodetype is supplied along with the text.

String processing option macros

Since BibTeX is essentially a system for glueing strings together in a wide variety of ways, the processing done to its strings is fairly important. Most of the string transformations are done outside of the lexer/parser; this reduces their complexity, and makes it easier to switch different transformations on and off. This switching is done with an "options" bitmap which can be specified on a per-entry-metatype basis. (That is, you can have one set of transformations done to the strings in all regular entries, another set done to the strings in all macro definition entries, and so on.) If you need finer control than that, it's currently unavailable outside of the library (but it's just a matter of making a couple functions available and documenting them---so bug me if you need this feature).

There are four basic macros for constructing this bitmap:

BTO_DELQUOTES

Strip quotes ({ and } or ") from strings.

BTO_EXPAND

Expand macro invocations to the full macro text.

BTO_PASTE

Paste simple values together.

BTO_COLLAPSE

Collapse whitespace according to the BibTeX rules.

For instance, supplying BTO_DELQUOTES | BTO_EXPAND as the string options bitmap for the BTE_REGULAR metatype means that strings in all "regular" entries will have quotes stripped and macros expanded, but nothing else. See bt_set_stringopts() and the bt_parse_*() functions for more information on the various parsing/post-processing options.

USING THE LIBRARY

The following code is a skeletal example of using the btparse library:

#include <btparse.h>

int main (void)
{
   bt_initialize ();

   /* process some data */

   bt_cleanup ();
   exit (0);
}

Please note the call to bt_initialize(); this is very important! Without it, the library may crash or fail mysteriously. You must call bt_initialize() before calling any other btparse functions. bt_cleanup() just frees the memory allocated by bt_initialize(); if you are careful to call it before exiting, and bt_free_ast() on any abstract-syntax trees generated by btparse when you are done with them, then your program shouldn't have any memory leaks. (Unless they're due to your own code, of course!)

SEE ALSO

The various functions available to btparse users are documented in the bt_input, bt_traverse, and bt_strings man pages; the language recognized by the parser is more formally described in the bt_language man page.

AUTHOR

Greg Ward <greg@bic.mni.mcgill.ca>

COPYRIGHT

Copyright (c) 1996-97 by Gregory P. Ward.

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public License for more details.

You should have received a copy of the GNU Library General Public License along with this library; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.