/* ================================================================= *\ bibindex -- a program to index bibtex files, used in conjunction with biblook This program was specifically developed for use with the computational geometry bibliographic database, and assumes the rules thereof. The database may be obtained by anonymous ftp from cs.usask.ca in the file "pub/geometry/geombib.tar.Z". Version 1.0 written by Jeff Erickson , 27 Mar 92 Version 2.0 written by Jeff Erickson , 17 Jun 92 This program is in the public domain. You may use it or modify it to your heart's content, at your own risk. Bouquets, brickbats, and bug fixes may be sent to Jeff Erickson, jeffe@cs.berkeley.edu. %Make% gcc -O -o bibindex bibindex.c Usage: bibindex bibfile [-i field ...] ----------------------------------------------------------------- HOW IT WORKS: The bibtex file is read field by field. The file offset beginning each record and each record's citation key are recorded. A list of words is extracted from each field. These words are placed into tables, which remember which records contain them in their respective fields. Once the file has been completely read, the hash tables are compacted and sorted. The hash tables are extensible, since we have to maintain one for each possible field type, and static tables would be way too big. Initially, each table holds 1K entries, and the tables are doubled whenever they reach 75% capacity. Each table entry is at least 24 bytes. If the resulting hash tables use too much memory, the entries should be changed to pointers, allocated on the fly. The entry lists associated with each word are implemented as extensible arrays. Initially, each list holds eight entries. If a new entry is inserted into a full list, the list is doubled first. The index file has the following format (loosely): version info # entries array of offsets into bib file -- one per entry # field types array of field names -- one per field type array of -- one per field type # words array of -- one per word word -- in alphabetical order # locations array of entry #s -- one per location There are advantages and disadvantages of having multiple hash tables instead of a single table. I am starting with the premise that the lookup program should be very fast. Consequently, I can't make it determine which fields contain a given word. Doing so would require putting ALL of the entry-parsing code into the lookup program. It would also mean potentially parsing a lot of extraneous entries to find relatively common words in relatively uncommon places (eg, "title edelsbrunner"). If there were a single word table, each word list would have to include bitmasks to indicate the appropriate fields along with the entry numbers. Assuming between 16 and 32 field types (the CG bib uses about 24), this would triple the size of each entry. On the average, each word occurs in less than two field types. The bitmask approach would also require knowledge of the field names in advance; the multiple table approach does not. ----------------------------------------------------------------- VERSION HISTORY: 1.0 3/26/92 Initial version completed 1.1 3/27/92 Words restricted to letters only; special rules added for apostrophes, so that words like "didn't" and "O'Rourke" can be handled correctly. 1.2 3/30/92 Faster hash function; now compressing hash tables before sorting them. Execution time on the CG bib reduced by a factor of thirty-five. 1.3 4/2/92 Toyed with the hash function some more, trying to reduce collisions, with limited success. 1.4 4/17/92 Added exit(0) at the end of main() [I thought that was supposed to be automatic!] 2.0 6/12/92 First major revision completed. 1. Major change in file format -- word tables for every field instead of just author, title, and keywords. 2. Word hash tables made extensible. 3. Fixed bug in GetNextWord that would cause the function to return inside nested braces. 4. Fixed bug in MungeField that would kill everything in an entry after an abbreviation. Abbreviations now go into the appropriate hash table with the other words. 5. Made GetNextWord consider numbers and math expressions to be words. 6. Added WriteWord, resulting in 40% savings in disk space. 7. Added -i flag and black holes. Ignoring "oldlabel" gives another 15% savings (6/92 version of CGbib). 2.1 7/9/92 Minor bug fixes. 2.2 Nelson H. F. Beebe 03-Oct-1992 Testing with >100K lines of .bib files led to these changes: 1. Add support for complete BibTeX keyword name syntax with iskeychar() function. 2. Add support for trailing comma after last key = "value" entry. 3. Record input line number for display in error messages. 4. Record initial line number of each BibTeX entry for display in error messages to better localize error. 5. Add test for word buffer overflow in MungeField() to prevent run-time core dumps, and increase supported word size from 15 to 31 (a word-length histogram of a 116921-word dictionary found words up to 28 characters long, with 1% longer than 15). File version increased to 2 because of word size change. 6. Add typecasts in calls to qsort() and comparisons of unsigned short with short, change main() from void to int, remove variable from initializer of msg[2], and add void to IndexBibFile() definition to allow compilation with C++ as well as C for better compile-time checking. 7. In MungeEntry(), do an ungetc() after key name collection. Otherwise, a key="value" pair without a blank before the = will not be recognized because the = read on lookahead has not been put back. 8. Revise table compression code in OutputTables() and code in FreeTables() to avoid duplicate frees, which is a fatal error on many systems, and specified to result in undefined behavior by the 1989 ANSI/ISO C Standard. 9. Define bcopy() as a macro to invoke standard memcpy() instead. 10. Include time.h, unistd.h, and malloc.h to get proper function declarations for library routines needed here. 11. Add DEBUG_MALLOC support. 12. Change several char* types to const char*. 13. Correct some typographical errors in comment. 14. Write UNIX manual pages for bibindex and biblook. 15. Allow command-line to specify a filename with .bib extension. 16. Add help support to biblook. 17. Correct error in FreeTables() in biblook.c; inner loop incremented i instead of j. 2.3 Bill Jones 93/01/29 1. Declarations common to bibindex.c and biblook.c factored out to new file biblook.h. 2. Index type of (signed) short overflows early; created typedef Index_t, defined as unsigned short. 3. Changed hash tables to extend at 75% capacity rather than 50%. 2.4 Nelson H. F. Beebe [01-Jun-1993] 1. Remove some mixed-mode arithmetic. 2. Increase MAXFIELDS from 64 to 127 to deal with TeX User Group bibliography collection 3. Correct error in GetHashTable(); older versions got into an infinite loop if MAXFIELDS field names were already stored, and a new one was encountered. \* ================================================================= */ #include "biblook.h" static long line_number = 1L; /* for debug messages */ static long initial_line_number = 1L; /* ======================= UTILITY FUNCTIONS ======================= */ /* ----------------------------------------------------------------- *\ | void die(const char *msg1, const char *msg2) -- print an error message and die \* ----------------------------------------------------------------- */ void die(const char *msg1, const char *msg2) { fprintf(stderr, "\nError:\tin BibTeX entry starting at line %ld, \ error detected at line %ld:\n\t%s %s\n", initial_line_number, line_number, msg1, msg2); exit(1); } /* ----------------------------------------------------------------- *\ | char safegetc(FILE *fp, const char *where) | | Get the next character safely. Used by routines that assume that | they won't run into the end of file. \* ----------------------------------------------------------------- */ char safegetc(FILE *fp, const char *where) { char c; if (feof(fp)) die("Unexpected end of file in", where); c = getc(fp); if (c == '\n') line_number++; return (c); } /* ----------------------------------------------------------------- *\ | void *safemalloc(unsigned howmuch, const char *msg1, const char *msg2) | | Allocate memory safely. Used by routines that assume they won't | run out of memory. \* ----------------------------------------------------------------- */ void *safemalloc(unsigned howmuch, const char *msg1, const char *msg2) { register void *tmp = NULL; tmp = malloc(howmuch); if (tmp == NULL) die(msg1, msg2); return tmp; } /* ====================== HASH TABLE FUNCTIONS ===================== *\ The hash tables start small and double whenever they reach 75% capacity. Hashing is performed by going through the string one character at a time, multiplying by a constant and adding in the new character value each time. The constant is defined to give the same even spread (about size/sqrt(2)) between successive multiples, as long as the hash table size is a power of two. Collisions are resolved by double hashing. Since the hash table size is always a power of two, the secondary hash value has to be odd to avoid loops. The field tables are non-extensible hash tables, otherwise handled the same way. It is probably well worth the effort to fine tune the field table hash function in order to avoid collisions. The field tables associated with ignored fields are black holes. Everything is the same, except that InsertEntry doesn't actually DO anything. \* ================================================================= */ #define MAXFIELDS 127 /* intentionally way too much */ /* NB: limited by char type of numfields */ #define INIT_HASH_SIZE 256 #define HASH_CONST 1482907 /* prime close to 2^{20.5} */ typedef struct /* Hash table entry */ { Word theword; /* the hashed word */ Index_t number; /* number of references in the list */ Index_t size; /* real size of reference list */ Index_t *refs; /* actual list of references */ } HashCell, *HashPtr; typedef struct /* Extensible hash table */ { Word thefield; /* the field type */ Index_t number; /* number of words in the hash table */ Index_t size; /* real size of the hash table */ HashPtr words; /* the actual hash table */ } ExHashTable; static ExHashTable fieldtable[MAXFIELDS]; /* the field tables */ static char numfields; /* number of fields */ /* ----------------------------------------------------------------- *\ | void InitTables(void) | | Initialize the field tables \* ----------------------------------------------------------------- */ void InitTables(VOID) { register unsigned short i; numfields = 0; for (i=0; inumber = 0; htable->size = INIT_HASH_SIZE; htable->words = (HashPtr) safemalloc(INIT_HASH_SIZE*sizeof(HashCell), "Can't create hash table for", htable->thefield); for (i=0; iwords[i].theword[0] = 0; htable->words[i].number = 0; htable->words[i].size = 0; htable->words[i].refs = NULL; } } /* ----------------------------------------------------------------- *\ | void FreeTables(void) | | Free the tables \* ----------------------------------------------------------------- */ void FreeTables(VOID) { register unsigned short i; Index_t j; for (i=0; i < (unsigned short)numfields; i++) { if (fieldtable[i].words) { for (j=0; j < fieldtable[i].number; j++) { if (fieldtable[i].words[j].refs) free(fieldtable[i].words[j].refs); } free(fieldtable[i].words); } } } /* ----------------------------------------------------------------- *\ | ExHashTable *GetHashTable(char *field) | | Get the hash table associated with the given field. If the table | is unclaimed, claim it and initialize it. \* ----------------------------------------------------------------- */ ExHashTable *GetHashTable(char *field) { register unsigned long hash=0; /* primary hash value */ register unsigned long skip=1; /* secondary hash value */ register int i; for (i=0; field[i]; i++) { hash = (hash*HASH_CONST + field[i]) % MAXFIELDS; skip += 2*hash; } while (fieldtable[hash].thefield[0] && /* cell not empty and */ strcmp(fieldtable[hash].thefield, field)) /* not the right field */ { hash = (hash+skip) % MAXFIELDS; } if (!fieldtable[hash].thefield[0]) { strcpy(fieldtable[hash].thefield, field); InitOneField(fieldtable+hash); numfields++; if (numfields >= MAXFIELDS) /* NB: NOT > because that produces */ /* an infinite while() loop above on */ /* next entry! */ die("too many field names",field); } return fieldtable+hash; } /* ----------------------------------------------------------------- *\ | void InitBlackHole(char *field) | | Intitialize a black hole for the given field. \* ----------------------------------------------------------------- */ void InitBlackHole(char *field) { ExHashTable *hole; hole = GetHashTable(field); free(hole->words); hole->words = NULL; } /* ----------------------------------------------------------------- *\ | HashPtr GetHashCell(ExHashTable *table, char *word) | | Get the hash table cell associated with the given word. If the cell | is unclaimed, claim it, initialize it, and update the table's word | count. \* ----------------------------------------------------------------- */ HashPtr GetHashCell(ExHashTable *htable, char *word) { register HashPtr table, cell; register unsigned long hash=0; /* primary hash value */ register unsigned long skip=1; /* secondary hash value */ register int i; table = htable->words; for (i=0; word[i]; i++) { hash = (hash*HASH_CONST + word[i]) % htable->size; skip += 2*hash; } while (table[hash].theword[0] && /* cell not empty and */ strcmp(table[hash].theword, word)) /* not the right word */ { hash = (hash+skip) % htable->size; } cell = table+hash; if (!cell->theword[0]) { strcpy(cell->theword, word); cell->size = 8; cell->refs = (Index_t *) safemalloc(cell->size * sizeof(Index_t), "Can't create entry list for", word); htable->number++; } return cell; } /* ----------------------------------------------------------------- *\ | void ExtendHashTable(ExHashTable *htable) | | Double the size of the hash table and rehash everything. \* ----------------------------------------------------------------- */ void ExtendHashTable(ExHashTable *htable) { register HashPtr newcell; register HashPtr oldtable; Index_t i; Index_t oldsize; oldsize = htable->size; oldtable = htable->words; htable->number = 0; htable->size *= 2; htable->words = (HashPtr) safemalloc(sizeof(HashCell)*htable->size, "Can't extend hash table for", htable->thefield); for (i=0; i < htable->size; i++) { htable->words[i].theword[0] = 0; htable->words[i].number = 0; htable->words[i].size = 0; htable->words[i].refs = NULL; } for (i=0; i< oldsize; i++) { if (oldtable[i].theword[0]) { newcell = GetHashCell(htable, oldtable[i].theword); *newcell = oldtable[i]; } } free(oldtable); } /* ----------------------------------------------------------------- *\ | void InsertEntry(ExHashTable *htable, char *word, Index_t entry) | | Insert the word/entry pair into the hash table, unless it's | already there. \* ----------------------------------------------------------------- */ void InsertEntry(ExHashTable *htable, char *word, Index_t entry) { register HashPtr cell; Index_t *newlist; if (htable->words == NULL) /* Is it a black hole? */ return; if ((Index_t)(htable->number * 4) > (Index_t)(htable->size * 3)) ExtendHashTable(htable); cell = GetHashCell(htable, word); if (cell->number && (cell->refs[cell->number-1] == entry)) return; if (cell->number == cell->size) /* expand the array */ { cell->size *= 2; newlist = (Index_t *) safemalloc(cell->size * sizeof(Index_t), "Can't extend entry list for", word); bcopy(cell->refs, newlist, cell->number * sizeof(Index_t)); free(cell->refs); cell->refs = newlist; } cell->refs[cell->number++] = entry; if (cell->number <= 0) die("hash type overflow", htable->thefield); } /* ================================================================= */ /* ----------------------------------------------------------------- *\ | void WriteWord(FILE *ofp, Word word) | | Output the word in "Pascal" string format -- length byte followed | by characters. This saves some disk space over writing MAXWORD+1 bytes | in all cases. \* ----------------------------------------------------------------- */ void WriteWord(FILE *ofp, Word word) { unsigned char length = strlen(word); fwrite((void *) &length, sizeof(char), 1, ofp); fwrite((void *) word, sizeof(char), length, ofp); } /* ----------------------------------------------------------------- *\ | void OutputTables(FILE *ofp) | | Compress and output the tables, with lots of user feedback. | KLUDGE -- Passing strcmp to qsort assumes intimate knowledge of | the HashCell and ExhashTable structs. \* ----------------------------------------------------------------- */ void OutputTables(FILE *ofp) { register HashPtr words; register int i, k, count; Index_t m, n; printf("Writing index tables..."); fflush(stdout); numfields = 0; /* recount, ignoring black holes */ for (i=0; i numfields) { fieldtable[numfields] = fieldtable[i]; /* copy i-th table */ fieldtable[i].number = 0; /* then clear i-th table */ fieldtable[i].size = 0; /* to avoid duplicate free() later */ fieldtable[i].words = NULL; } numfields++; } } qsort(fieldtable, (size_t)numfields, sizeof(ExHashTable), (int (*)(const void*,const void*))strcmp); fwrite((void *) &numfields, sizeof(char), 1, ofp); for (i=0; i n) { words[n] = words[m]; /* copy m-th table to n-th */ words[m].number = 0; /* then clear m-th table */ words[m].size = 0; /* to avoid duplicate free() later */ words[m].refs = (Index_t*)0; } n++; } } qsort(words, (size_t)fieldtable[k].number, sizeof(HashCell), (int (*)(const void*,const void*))strcmp); fwrite((void *) &(fieldtable[k].number), sizeof(Index_t), 1, ofp); count = 0; for (m=0; mthefield); ch = safegetc(ifp, "MungeField"); while (isspace(ch)) ch = safegetc(ifp, "MungeField"); if (ch != '{' && ch != '"') /* handle abbreviations */ { for (i=0; !strchr(NONABBREV, ch); i++) { if (i >= MAXWORD) die("word buffer overflow", htable->thefield); nextword[i] = tolower(ch); ch = safegetc(ifp, "MungeField"); } nextword[i] = 0; nextword[MAXWORD] = 0; InsertEntry(htable, nextword, entry); } else { while (GetNextWord(ifp, nextword)) { if (nextword[1] == 0) /* ignore single letters */ continue; for (bad=0, i=0; !bad && badwords[i]; i++) if (!strcmp(nextword, badwords[i])) bad = 1; if (!bad) { nextword[MAXWORD] = 0; /* truncate to fit */ InsertEntry(htable, nextword, entry); } } ch = safegetc(ifp, "MungeField"); /* close quote/brace */ ch = safegetc(ifp, "MungeField"); /* comma (maybe) */ } while (isspace(ch)) ch = safegetc(ifp, "MungeField"); ungetc(ch, ifp); /* comma or close paren/brace */ } /* ----------------------------------------------------------------- *\ | int iskeychar(char c,int first_char) | | Return 1 if c is a valid keyword character, and 0 otherwise. | The rules are different for the first character, so first_char | must be set non-zero to select that case. | | The code in bibtex.web in the section on character set dependencies | creates a character classification array, id_class[], which permits | the following characters (verified by experiment too) in field | names: | | A-Z a-z 0-9 | ! $ & + - . / : ; < > ? @ [ \ ] ^ _ ` | ~ | \* ----------------------------------------------------------------- */ int iskeychar(char c,int first_char) { if (first_char) return (isalpha(c) ? 1 : 0); else if (isalpha(c) || isdigit(c)) return (1); else if (strchr("!$&+-./:;<>?@[\\]^_`|~\177",(int)c) != (char*)NULL) return (1); else return (0); } /* ----------------------------------------------------------------- *\ | void MungeEntry(FILE *ifp, Index_t entry) | | Wander through the entry, extracting words and depositing them in | the appropriate hash tables. \* ----------------------------------------------------------------- */ void MungeEntry(FILE *ifp, Index_t entry) { register char ch; Word thefield; short i; ch = safegetc(ifp, "MungeEntry"); while (ch == ',') { ch = safegetc(ifp, "MungeEntry"); while (isspace(ch)) ch = safegetc(ifp, "MungeEntry"); if ((ch == '}') || (ch == ')')) /* allow trailing comma after */ return; /* last key = "value" entry */ if (!isalpha(ch)) { char msg[2]; msg[0] = ch; msg[1] = 0; die("Nonalpha character in field descriptor:", msg); } for (i=0; iskeychar(ch,i==0); i++) { if (i >= MAXWORD) { thefield[i] = 0; die("field name buffer overflow", thefield); } thefield[i] = tolower(ch); ch = safegetc(ifp, "MungeEntry"); } ungetc(ch, ifp); /* put back lookahead char */ thefield[i] = 0; MungeField(ifp, GetHashTable(thefield), entry); ch = safegetc(ifp, "MungeEntry"); } } /* ============================= OUTPUT ============================ */ /* ----------------------------------------------------------------- *\ | IndexBibFile(FILE *ifp, FILE *ofp, char *filename) | | Index a bibtex file. Input comes from ifp; output goes to ofp. | Filename is the name of the bibliography, with no prefix. \* ----------------------------------------------------------------- */ void IndexBibFile(FILE *ifp, FILE *ofp, char *filename) { Index_t count=0; long curoffset; long *offsets; long *oldoff; Index_t offsize; time_t now = time(0); printf("Indexing %s.bib...", filename); fflush(stdout); offsize = 128; /* MINIMUM OFFSET LIST SIZE */ offsets = (long *) malloc(offsize * sizeof(long)); while (!feof(ifp)) { curoffset = FindNextEntry(ifp); if (curoffset == (unsigned long) -1) break; offsets[count] = curoffset; MungeEntry(ifp, count++); if (count == offsize) /* expand full offset array */ { oldoff = offsets; offsize *= 2; offsets = (long *) malloc(offsize * sizeof(long)); bcopy(oldoff, offsets, count * sizeof(long)); free(oldoff); } if (!(count % 500)) /* feedback */ { printf("%d...", count); fflush(stdout); } } printf("done.\n"); fprintf(ofp, "bibindex %d %d %d %s", FILE_VERSION, MAJOR_VERSION, MINOR_VERSION, ctime(&now)); printf("Writing file offsets..."); fflush(stdout); fwrite((void *) &count, sizeof(Index_t), 1, ofp); fwrite((void *) offsets, sizeof(long), count, ofp); free(offsets); printf("[%d entries]\n", count); OutputTables(ofp); printf("All done!\n"); } /* ----------------------------------------------------------------- *\ | The main program \* ----------------------------------------------------------------- */ int main(int argc, char **argv) { FILE *ifp; FILE *ofp; char infile[256]; char outfile[256]; char *p; int i; #if DEBUG_MALLOC malloc_debug(2); #endif /* DEBUG_MALLOC */ if (argc < 2) die("Usage: bibindex bib [-i field...]", ""); if (((p = strrchr(argv[1],'.')) != (char*)NULL) && (strcmp(p, ".bib") == 0)) *p = '\0'; /* remove any .bib extension */ sprintf(infile, "%s.bib", argv[1]); sprintf(outfile, "%s.bix", argv[1]); ifp = fopen(infile, "r"); if (!ifp) die("Can't read", infile); ofp = fopen(outfile, "w"); if (!ofp) die("Can't write", outfile); InitTables(); if ((argc > 2) && (!strcmp(argv[2], "-i"))) for (i=3; i