NAME¶
btparse - C library for parsing and processing BibTeX data files
SYNOPSIS¶
#include <btparse.h>
/* Basic library initialization / cleanup */
void bt_initialize (void);
void bt_free_ast (AST *ast);
void bt_cleanup (void);
/* Input / interface to parser */
void bt_set_stringopts (bt_metatype_t metatype, btshort options);
AST * bt_parse_entry_s (char * entry_text,
char * filename,
int line,
btshort options,
boolean * status);
AST * bt_parse_entry (FILE * infile,
char * filename,
btshort options,
boolean * status);
AST * bt_parse_file (char * filename,
btshort options,
boolean * overall_status);
/* AST traversal/query */
AST * bt_next_entry (AST * entry_list,
AST * prev_entry)
AST * bt_next_field (AST *entry, AST *prev, char **name);
AST * bt_next_value (AST *head,
AST *prev,
bt_nodetype_t *nodetype,
char **text);
bt_metatype_t bt_entry_metatype (AST *entry);
char *bt_entry_type (AST *entry);
char *bt_entry_key (AST *entry);
char *bt_get_text (AST *node);
/* Splitting names and lists of names */
bt_stringlist * bt_split_list (char * string,
char * delim,
char * filename,
int line,
char * description);
void bt_free_list (bt_stringlist *list);
bt_name * bt_split_name (char * name,
char * filename,
int line,
int name_num);
void bt_free_name (bt_name * name);
/* Formatting names */
bt_name_format * bt_create_name_format (char * parts, boolean abbrev_first);
void bt_free_name_format (bt_name_format * format);
void bt_set_format_text (bt_name_format * format,
bt_namepart part,
char * pre_part,
char * post_part,
char * pre_token,
char * post_token);
void bt_set_format_options (bt_name_format * format,
bt_namepart part,
boolean abbrev,
bt_joinmethod join_tokens,
bt_joinmethod join_part);
char * bt_format_name (bt_name * name, bt_name_format * format);
/* Construct tree from TeX groups */
bt_tex_tree * bt_build_tex_tree (char * string);
void bt_free_tex_tree (bt_tex_tree **top);
void bt_dump_tex_tree (bt_tex_tree *node, int depth, FILE *stream);
char * bt_flatten_tex_tree (bt_tex_tree *top);
/* Miscellaneous string utilities */
void bt_purify_string (char * string, btshort options);
void bt_change_case (char transform, char * string, btshort options);
DESCRIPTION¶
btparse is a C library for parsing and processing BibTeX files. It
provides a lexical scanner and LR parser (constructed by PCCTS), both of which
are efficient and offer good error detection and recovery; a set of functions
for traversing the AST (abstract syntax tree) generated by the parser; and
utility functions for manipulating strings according to BibTeX conventions.
(Note that nothing in the library assumes that you're using BibTeX files for
their original purpose of bibliographic data for scholarly publications; you
could use the file format for any conceivable purpose that fits it. However,
there is some code in the library that is really only appropriate for use with
strings meant to be processed in the same way that BibTeX itself does. This is
all entirely optional, though.)
Note that the interface provided by
btparse, while complete, is fairly
low-level. If you have more sophisticated needs, you might be interested my
"Text::BibTeX" module for Perl 5 (available on CPAN).
CONCEPTS AND TERMINOLOGY¶
To understand this document and use
btparse, you should already be
familiar with the BibTeX language---more specifically, the BibTeX data
description language. (BibTeX being the complex beast that it is, one can
conceive of the term applying to the program, the data language, the
particular database structure described in the original BibTeX documentation,
the ".bst" formatting language, and the set of conventions embodied
in the standard styles included with the BibTeX distribution. In this
document, I'll stick to the first two meanings---the data language because
that's what
btparse deals with, and the program because it's
occasionally necessary to explain differences between my parser and BibTeX's.)
In particular, you should have a good idea what's going on in the following:
@string{and = { and },
joe = "Blow, Joe",
john = "John Smith"}
@book(ourbook,
author = joe # and # john,
title = {Our Little Book})
If this looks like something you want to parse, but don't want to have to write
your own parser for, you've come to the right place.
Before going much further, though, you're going to have to learn some of the
terminology I use for describing BibTeX data. Most of it's the same as you'll
find in any BibTeX documentation, but it's important to be sure that we're
talking about the same things here. So, some definitions:
- top-level
- All text in a BibTeX file from the start of the file to the
start of the first entry, and between entries thereafter.
- name
- A string of letters, digits, and the following characters:
! $ & * + - . / : ; < > ? [ ] ^ _ ` |
A "name" is a catch-all used for entry types, entry keys, and
field and macro names. For BibTeX compatibility, there are slightly
different rules for these four entities; currently, the only such rule
actually implemented is that field and macro names may not begin with a
digit. Some names in the above example: "string",
"and".
- entry
- A chunk of text starting with an "at" sign
("@") at top-level, followed by a name (the entry type),
an entry delimiter ("{" or "("), and proceeding
to the matching closing delimiter. Also, the data structure that results
from parsing this chunk of text. There are two entries in the above
example.
- entry type
- The name that comes right after an "@" at
top-level. Examples from above: "string", "book".
- entry metatype
- A classification of entry types that allows us to group one
or more entry types under the same heading. With the standard BibTeX
database structure, "article", "book",
"inbook", etc. all fall under the "regular entry"
metatype. Other metatypes are "macro definition" (for
"string" entries), "preamble" (for
"preamble") entries, and "comment"
("comment" entries). In fact, any entry whose type is not one of
"string", "preamble", or "comment" is called
a "regular" entry.
- entry delimiters
- "{" and "}", or "(" and
")": the pair of characters that (almost) mark the boundaries of
an entry. "Almost" because the start of an entry is marked by an
"@", not by the "entry open" delimiter.
- entry key
- (Or just key when it's clear what we're speaking
of.) The name immediately following the entry open delimiter in a regular
entry, which uniquely identifies the entry. Example from above:
"ourbook". Only regular entries have keys.
- field
- A name to the left of an equals sign in a regular or
macro-definition entry. In the latter context, might also be called a
macro name. Examples from above: "joe", "author".
- field list
- In a regular entry, everything between the entry delimiters
except for the entry key. In a macro definition entry, everything between
the entry delimiters (possibly also called a macro list).
- compound value
- (Usually just "value".) The text that follows an
equals sign ("=") in a regular or macro definition entry, up to
a comma or the entry close delimiter; a list of one or more simple values
joined by hash signs ("#").
- simple value
- A string, macro, or number.
- string
- (Or, sometimes, "quoted string.") A chunk of text
between quotes (""") or braces ("{" and
"}"). Braces must balance: "{this is a {string}" is
not a BibTeX string, but "{this is a {string}}" is. ("this
is a {string" is also illegal, mainly to avoid the possibility of
generating bogus TeX code--which BibTeX will do in certain cases.)
- macro
- A name that appears on the right-hand side of an equals
sign (i.e. as one simple value in a compound value). Implies that this
name was defined as a macro in an earlier macro definition entry, but this
is only checked if btparse is being asked to expand macros to their
full definitions.
- number
- An unquoted string of digits.
Working with
btparse generally consists of passing the library some
BibTeX data (or a source for some BibTeX data, such as a filename or a file
pointer), which it then lexically scans, parses, and constructs an abstract
syntax tree (AST) from. It returns this AST to you, and you call other
btparse functions to traverse and query the tree.
The contents of AST nodes are the private domain of the library, and you
shouldn't go poking into them. This being C, though, there's nothing to
prevent you from doing so except good manners and the possibility that I might
change the AST structure in future releases, breaking any badly-behaved code.
Also, it's not necessary to know the structural relationships between nodes in
the AST---that's taken care of by the query/traversal functions.
However, it's useful to know some of the things that
btparse deposits in
the AST and returns to you through those query/traversal functions. First off,
each node has a "node type," which records the syntactic element
corresponding to each node. For instance, the entry
@book{mybook, author = "Joe Blow", title = "My Little Book"}
is rooted by an "entry" node; under this would be found a
"key" node (for the entry key), two "field" nodes (for the
"author" and "title" fields); and associated with each
field node would be a "string" node. The only time this concerns you
is when you ask the library for a simple value; just looking at the text is
not enough to distinguish quoted strings, numbers, and macro names, so
btparse returns the nodetype as well.
In addition to the nodetype,
btparse records the metatype of each
"entry" node. This allows you (and the library) to distinguish, say,
regular entries from comment entries. Not only do they have very different
structures and must therefore be traversed differently by the library, but
certain traversal functions make no sense on certain entry metatypes---thus
it's necessary for you to be able to make the distinction as well.
That said, everything you need to know to work with the AST is explained in
bt_traversal.
DATA TYPES AND MACROS¶
btparse defines several types required for the external interface. First,
it trivially defines a "boolean" type (along with "TRUE"
and "FALSE" macros). This might affect you when including the
btparse.h header in your own code---since it's not possible for the
code to detect if there is already a "boolean" type defined, you
might have to define the "HAVE_BOOLEAN" pre-processor token to
deactivate
btparse.h's "typedef" of "boolean".
Next, two enumeration types are defined: "bt_metatype" and
"bt_nodetype". Both of these are used extensively in the library
itself, and are made available to users of the library because they can be
found in nodes of the "btparse" AST (abstract syntax tree). (I.e.,
querying the AST can give you "bt_metatype" and
"bt_nodetype" values, so the "typedef"s must be available
to your code.)
Entry metatype enum¶
"bt_metatype_t" has the following values:
- •
- "BTE_UNKNOWN"
- •
- "BTE_REGULAR"
- •
- "BTE_COMMENT"
- •
- "BTE_PREAMBLE"
- •
- "BTE_MACRODEF"
which are determined by the "entry type" token. (@string entries have
the "BTE_MACRODEF" metatype; @comment and @preamble correspond to
"BTE_COMMENT" and "BTE_PREAMBLE"; and any other entry type
has the "BTE_REGULAR" metatype.)
AST nodetype enum¶
"bt_nodetype" has the following values:
- •
- "BTAST_UNKNOWN"
- •
- "BTAST_ENTRY"
- •
- "BTAST_KEY"
- •
- "BTAST_FIELD"
- •
- "BTAST_STRING"
- •
- "BTAST_NUMBER"
- •
- "BTAST_MACRO"
Of these, you'll only ever deal with the last three. They are returned when you
query the AST for a simple value---just seeing the text isn't enough to
distinguish between a quoted string, a number, and a macro, so the AST
nodetype is supplied along with the text.
String processing option macros¶
Since BibTeX is essentially a system for glueing strings together in a wide
variety of ways, the processing done to its strings is fairly important. Most
of the string transformations are done outside of the lexer/parser; this
reduces their complexity, and makes it easier to switch different
transformations on and off. This switching is done with an "options"
bitmap which can be specified on a per-entry-metatype basis. (That is, you can
have one set of transformations done to the strings in all regular entries,
another set done to the strings in all macro definition entries, and so on.)
If you need finer control than that, it's currently unavailable outside of the
library (but it's just a matter of making a couple functions available and
documenting them---so bug me if you need this feature).
There are three basic macros for constructing this bitmap:
- "BTO_CONVERT"
- Convert "number" values to strings. (The
conversion is trivial, involving changing the type of the AST node
representing the number from "BTAST_NUMBER" to
"BTAST_STRING". "Number" values are stored as strings
of digits, just as they are in the input data.)
- "BTO_EXPAND"
- Expand macro invocations to the full macro text.
- "BTO_PASTE"
- Paste simple values together.
- "BTO_COLLAPSE"
- Collapse whitespace according to the BibTeX rules.
For instance, supplying "BTO_CONVERT | BTO_EXPAND" as the string
options bitmap for the "BTE_REGULAR" metatype means that all simple
values in "regular" entries will be converted to strings: numbers
will simply have their "nodetype" changed, and macros will be
expanded. Nothing else will be done to the simple values, though---they will
not be concatenated, nor will whitespace be collapsed. See the
"bt_set_stringopts()" and "bt_parse_*()" functions in
bt_input for more information on the various options for parsing; see
bt_postprocess for details on the post-processing.
USING THE LIBRARY¶
The following code is a skeletal example of using the
btparse library:
#include <btparse.h>
int main (void)
{
bt_initialize ();
/* process some data */
bt_cleanup ();
exit (0);
}
Please note the call to "bt_initialize()"; this is very important!
Without it, the library may crash or fail mysteriously. You
must call
"bt_initialize()" before calling any other
btparse functions.
"bt_cleanup()" just frees the memory allocated by
"bt_initialize()"; if you are careful to call it before exiting, and
"bt_free_ast()" on any abstract syntax trees generated by
btparse when you are done with them, then your program shouldn't have
any memory leaks. (Unless they're due to your own code, of course!)
BUGS AND LIMITATIONS¶
btparse has several inherent limitations that are due to the lexical
scanner and parser generated by PCCTS 1.x. In short, the scanner and parser
are both heavily dependent on global variables, meaning that thread safety --
or even the ability to have two files open and being parsed at the same time
-- is well-nigh impossible. This will not change until I get with the times
and adopt ANTLR 2.0, the successor to PCCTS -- presuming of course that it can
generate more modular C scanners and parsers.
Another limitation that is due to PCCTS: entries with a large number of fields
(more than about 90, if each field value is just a single string) will cause
the parser to crash. This is unavoidable due to the parser using
statically-allocated stacks for attributes and abstract-syntax tree nodes. I
could increase the static allocation, but that would just decrease the
likelihood of encountering the problem, not make it go away. Again, the
chances of this changing as long as I'm using PCCTS 1.x are nil.
Apart from those inherent limitations, there are no known bugs in
btparse. Any segmentation faults or bus errors from the library should
be considered bugs. They probably result from using the library incorrectly
(eg. attempting to interleave the parsing of two files), but I do make an
attempt to catch all such mistakes, and if I've missed any I'd like to know
about it.
Any memory leaks from the library are also a concern; as long as you are
conscientious about calling the cleanup functions ("bt_free_ast()"
and "bt_cleanup()"), then the library shouldn't leak.
SEE ALSO¶
To read and parse BibTeX data files, see bt_input.
To traverse the syntax tree that results, see bt_traversal.
To learn what is done to values in parsed entries, and how to customize that
munging, see bt_postprocess.
To learn how
btparse deals with strings, see bt_strings (oops, I haven't
written this one yet!).
To manipulate and access the
btparse macro table, see bt_macros.
For splitting author names and lists "the BibTeX way" using
btparse, bt_split_names.
To put author names back together again, see bt_format_names.
Miscellaneous functions for processing strings "the BibTeX way":
bt_misc.
A semi-formal language definition is in bt_language.
AUTHOR¶
Greg Ward <gward@python.net>
COPYRIGHT¶
Copyright (c) 1996-97 by Gregory P. Ward.
This library is free software; you can redistribute it and/or modify it under
the terms of the GNU Library General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your option) any
later version.
This library is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
A PARTICULAR PURPOSE. See the GNU Library General Public License for more
details.
You should have received a copy of the GNU Library General Public License along
with this library; if not, write to the Free Software Foundation, Inc., 675
Mass Ave, Cambridge, MA 02139, USA.
AVAILABILITY¶
The btOOL home page, where you can get up-to-date information about
btparse (and download the latest version) is
http://starship.python.net/~gward/btOOL/
You will also find the latest version of
Text::BibTeX, the Perl library
that provides a high-level front-end to
btparse, there.
btparse
is needed to build "Text::BibTeX", and must be downloaded
separately.
Both libraries are also available on CTAN (the Comprehensive TeX Archive
Network, "
http://www.ctan.org/tex-archive/") and CPAN (the
Comprehensive Perl Archive Network, "
http://www.cpan.org/"). Look in
biblio/bibtex/utils/btOOL/ on CTAN, and
authors/Greg_Ward/ on
CPAN. For example,
http://www.ctan.org/tex-archive/biblio/bibtex/utils/btOOL/
http://www.cpan.org/authors/Greg_Ward
will both get you to the latest version of "Text::BibTeX" and
btparse -- but of course, you should always access busy sites like CTAN
and CPAN through a mirror.