% This file is part of the Stanford GraphBase (c) Stanford University 1992 \def\title{GB\_\thinspace IO} @i boilerplate.w %<< legal stuff: PLEASE READ IT BEFORE MAKING ANY CHANGES! @* Introduction. This is |gb_io|, the input/output module used by all GraphBase routines to access data~files. It doesn't actually do any output; but somehow `input/output' sounds like a more useful title than just `input'. All files of GraphBase data are designed to produce identical results on almost all existing computers and operating systems. Each line of each file contains at most 79 characters. Each character is either a blank or a digit or an uppercase letter or a lowercase letter or a standard punctuation mark. Blank characters at the end of each line are ``invisible''; they have no perceivable effect, hence identical results will be obtained on record-oriented systems that pad every line with blanks. The data is carefully sum-checked so that defective input files have little chance of being accepted. @ Changes might be needed when these routines are ported to different systems. Sections of the program that are most likely to require such changes are listed under `system dependencies' in the index. A validation program is provided so that installers can tell if |gb_io| is working properly. To make the test, simply run |test_io|. @(test_io.c@>= #include #ifdef SYSV #include #else #include #endif #include "gb_io.h" /* all users of |gb_io| should include this header file */ #define exit_test(m) /* we invoke this macro if something goes wrong */@+@=\@> {@+fprintf(stderr,"%s!\n(Error code = %ld)\n",m,io_errors);@+return -1;@+} @t\2@>@/ main() { @; @; @; printf("OK, the gb_io routines seem to work!\n"); } @ The external variable |io_errors| mentioned in the previous section will be set nonzero if any anomalies are detected. Errors won't occur in normal use of GraphBase programs, so no attempt has been made to provide a user-friendly way to decode the nonzero values that |io_errors| may assume. Information is simply gathered in binary form; system wizards who may need to do a bit of troubleshooting should be able to decode |io_errors| without great pain. @d cant_open_file 0x1 /* bit set in |io_errors| if |fopen| fails */ @d cant_close_file 0x2 /* bit set if |fclose| fails */ @d bad_first_line 0x4 /* bit set if the data file's first line isn't legit */ @d bad_second_line 0x8 /* bit set if the second line doesn't pass muster */ @d bad_third_line 0x10 /* bit set if the third line is awry */ @d bad_fourth_line 0x20 /* guess when this bit is set */ @d file_ended_prematurely 0x40 /* bit set if |fgets| fails */ @d missing_newline 0x80 /* bit set if line is too long or |'\n'| is missing */ @d wrong_number_of_lines 0x100 /* bit set if the line count is wrong */ @d wrong_checksum 0x200 /* bit set if the check sum is wrong */ @d no_file_open 0x400 /* bit set if user tries to close an unopened file */ @d bad_last_line 0x800 /* bit set if final line has incorrect form */ @ The \Cee\ code for |gb_io| doesn't have a main routine; it's just a bunch of subroutines to be incorporated into programs at a higher level, via the system loading routine. Here is the general outline of \.{gb\_io.c}: @p @
@; @@; @@; @@; @ @ Every external variable is declared twice in this \.{CWEB} file: once for |gb_io| itself (the ``real'' declaration for storage allocation purposes), and once in \.{gb\_io.h} (for cross-references by |gb_io| users). @= long io_errors; /* record of anomalies noted by |gb_io| routines */ @ @(gb_io.h@>= extern long io_errors; /* record of anomalies noted by |gb_io| routines */ @ We will stick to standard \Cee-type input conventions. We'll also have occasion to use some of the standard string operations. @= #include #ifdef SYSV #include #else #include #endif @* Inputting a line. The |gb_io| routines get their input from an array called |buffer|. This array is internal to |gb_io|---its contents are hidden from user programs. We make it 81 characters long, since the data is supposed to have at most 79 characters per line, followed by newline and null. @= static char buffer[81]; /* the current line of input */ static char *cur_pos=buffer; /* the current character of interest */ static FILE *cur_file; /* current file, or |NULL| is none is open */ @ Here's a basic subroutine to fill the |buffer|. The main feature of interest is the removal of trailing blanks. We assume that |cur_file| is open. Notice that a line of 79 characters (followed by |'\n'|) will just fit into the buffer, and will cause no errors. A line of 80 characters will also fit; but it will be split into two lines and the |missing_newline| message will occur, because of the way |fgets| is defined. A |missing_newline| error will also occur if the file ends in the middle of a line, or if a null character (|'\0'|) occurs within a line. @= static fill_buf() {@+register char *p; if (!fgets(buffer,81,cur_file)) { io_errors |= file_ended_prematurely; buffer[0]=more_data=0; } for (p=buffer; *p; p++) ; /* advance to first null character */ if (p--==buffer || *p!='\n') { io_errors |= missing_newline; p++; } while (--p>=buffer && *p==' ') ; /* move back over trailing blanks */ *++p='\n'; *++p=0; /* newline and null are always present at end of line */ cur_pos=buffer; /* get ready to read |buffer[0]| */ } @* Checksums. Each data file has a ``magic number,'' which is defined to be $$\biggl(\sum_l 2^l c_l\biggr) \bmod p\,;$$ here $p$ is a large prime number, and $c_l$ denotes the internal code corresponding to the $l$th-from-last data character read (including newlines but not nulls). The ``internal codes'' $c_l$ are computed in a system-independent way: Each character |c| in the actual encoding scheme being used has a corresponding |icode| which is the same on all systems. For example, the |icode| of |'0'| is zero, regardless of whether |'0'| is actually represented in ASCII or EBCDIC or some other scheme. (We assume that every modern computer system is capable of printing at least 95 different characters, including a blank space.) We will accept a data file as error-free if it has the correct number of lines and ends with the proper magic number. @= static char icode[256]; /* mapping of characters to internal codes */ static long checksum_prime=(1<<30)-83; /* large prime such that $2p+100$ won't overflow */ static long magic; /* current checksum value */ static int line_no; /* current line number in file */ static long final_magic; /* desired final magic number */ static long tot_lines; /* total number of data lines */ static char more_data; /* is there data still waiting to be read? */ @ The |icode| mapping is defined by a single string, |imap|, such that character |imap[k]| has |icode| value~|k|. There are 96 characters in |imap|, namely the 94 standard visible ASCII codes plus space and newline. If EBCDIC code is used instead of ASCII, the cents sign \rlap{\.{\kern.05em/}}\.c should take the place of single-left-quote \.{\char`\`}, and \.{\char5}~should take the place of\/~\.{\char`\~}. Characters that do not appear in |imap| all are given the same |icode| value, called |unexpected_char|. Such characters should be avoided in GraphBase files whenever possible. (If they do appear, they can still get into a user's data, but we don't distinguish them from each other for checksumming purposes.) The |icode| table actually plays a dual role, because we've rigged it so that codes 0--15 come from the characters |"0123456789ABCDEF"|. This facilitates conversion of decimal and hexadecimal data, and we can also use it for radices higher than 16. @d unexpected_char 100 /* default |icode| value */ @= static char *imap="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\ abcdefghijklmnopqrstuvwxyz_^~&@@,;.:?!%#$+-*/|\\<=>()[]{}`'\" \n"; @ Users of |gb_io| can look at the |imap|, but they can't change it. @= char imap_chr(d) int d; { return d<0 || d>strlen(imap)? '\0': imap[d]; } @# int imap_ord(c) char c; { @; return icode[c]; } @ @(gb_io.h@>= #define unexpected_char @t\quad@> 100 extern char imap_chr(); /* the character that maps to |d| */ extern int imap_ord(); /* the ordinal number of a given character */ @ @= if (!icode['1']) icode_setup(); @ @= static icode_setup() {@+register int k; register char *p; for (k=0;k<256;k++) icode[k]=unexpected_char; for (p=imap,k=0; *p; p++,k++) icode[*p]=k; } @ Now we're ready to specify the first external subroutine for |gb_io| users. Calling |gb_newline()| will read the next line of data into |buffer| and update the magic number accordingly. @(gb_io.h@>= extern void gb_newline(); /* advance to next line of the data file */ extern long new_checksum(); /* compute change in magic number */ @ The magic checksum is not affected by lines that begin with \.*. @= gb_newline() { if (++line_no>tot_lines) more_data=0; if (more_data) { fill_buf(); if (buffer[0]!='*') magic=new_checksum(buffer,magic); } } @ Users can compute checksums like |gb_newline| does, but they can't change the (private) value of |magic|. @= long new_checksum(s,old_checksum) char *s; /* a string */ long old_checksum; {@+register long a=old_checksum; register char*p; for (p=s; *p; p++) a=(a+a+icode[*p]) % checksum_prime; return a; } @ Another simple routine allows a user to read (but not write) the variable |more_data|. @(gb_io.h@>= extern int gb_eof(); /* has the data all been read? */ @ @= int gb_eof() { return !more_data; } @* Parsing a line. The user can input characters from the buffer in several ways. First, there's a basic |gb_char()| routine, which returns a single character. The character is |'\n'| if the last character on the line has already been read (and it continues to be |'\n'| until the user calls |gb_newline|). The current position in the line, |cur_pos|, always advances when |gb_char| is called, unless |cur_pos| was already at the end of the line. There's also a |gb_backup()| routine, which moves |cur_pos| one place to the left unless it was already at the beginning. @(gb_io.h@>= extern char gb_char(); /* get next character of current line, or |'\n'| */ extern void gb_backup(); /* move back ready to scan a character again */ @ @= char gb_char() { if (*cur_pos) return (*cur_pos++); return '\n'; } @# gb_backup() { if (cur_pos>buffer) cur_pos--; } @ There are two ways to read numerical data. The first, |gb_digit(d)|, expects to read a single character in radix~|d|, using |icode| values to specify digits greater than~9. (Thus, for example, |'A'| represents the digit 10.) If the next character is a valid |d|-git, |cur_pos| moves to the next character and the numerical value is returned. Otherwise |cur_pos| stays in the same place and $-1$ is returned. The second routine, |gb_number(d)|, reads characters and forms an unsigned radix-|d| number until the first non-digit is encountered. The resulting number is returned; it is zero if no digits were found. No errors are possible with this routine, because it uses |unsigned long| arithmetic. @(gb_io.h@>= extern int gb_digit(); /* |gb_digit(d)| reads a digit between 0 and |d-1| */ extern unsigned long gb_number(); /* |gb_number(d)| reads a radix-|d| number */ @ The value of |d| should be at most 127, if users want their programs to be portable, because \Cee\ does not treat larger |char| values in a well-defined manner. In most applications, |d| is of course either 10 or 16. @= int gb_digit(d) char d; { if (icode[*cur_pos]= #define STR_BUF_LENGTH 160 extern char str_buf[]; /* safe place to receive output of |gb_string| */ extern char *gb_string(); /* |gb_string(p,c)| reads a string delimited by |c| into bytes starting at |p| */ @ @d STR_BUF_LENGTH 160 @= char str_buf[STR_BUF_LENGTH]; /* users can put strings here if they wish */ char *gb_string(p,c) char *p; /* where to put the result */ char c; /* character following the string */ { while (*cur_pos && *cur_pos!=c) *p++=*cur_pos++; *p++=0; return p; } @ Here's how we test those routines in |io_test|: The first line of test data consists of 79 characters, beginning with 64 zeroes and ending with `\.{123456789ABCDEF}'. The second line is completely blank. The third and final line says `\.{Oops:(intentional mistake)}'. @= if (gb_number(10)!=123456789) io_errors |= 1<<20; /* decimal number not working */ if (gb_digit(16)!=10) io_errors |= 1<<21; /* we missed the \.A following the decimal number */ gb_backup();@+ gb_backup(); /* get set to read `\.{9A}' again */ if (gb_number(16)!=0x9ABCDEF) io_errors |= 1<<22; /* hexadecimal number not working */ gb_newline(); /* now we should be scanning a blank line */ if (gb_char()!='\n') io_errors |= 1<<23; /* newline not inserted at end */ if (gb_char()!='\n') io_errors |= 1<<24; /* newline not implied after end */ if (gb_number(60)!=0) io_errors |= 1<<25; /* number should stop at null character */ {@+char temp[100]; if (gb_string(temp,'\n')!=temp+1) io_errors |= 1<<26; /* string should be null after end of line */ gb_newline(); if (gb_string(temp,':')!=temp+5 || strcmp(temp,"Oops")) io_errors |= 1<<27; /* string not read properly */ } if (io_errors) exit_test("Sorry, it failed. Look at the error code for clues"); if (gb_digit(10)!=-1) exit_test("Digit error not detected"); if (gb_char()!=':') io_errors |= 1<<28; /* lost synch after |gb_string| and |gb_digit| */ if (gb_eof()) io_errors |= 1<<29; /* premature end-of-file indication */ gb_newline(); if (!gb_eof()) io_errors |= 1<<30; /* postmature end-of-file indication */ @* Opening a file. The call |gb_weak_open("foo")| will open file |"foo"| and initialize the checksumming process. If the file cannot be opened, |io_errors| will be set to |cant_open_file|, otherwise |io_errors| will be initialized to zero. The call |gb_open("foo")| is a stronger version of |gb_weak_open|, which is used for standard GraphBase data files like |"words.dat"| to make doubly sure that they have not been corrupted. It returns the current value of |io_errors|, which will be nonzero if any problems were detected at the beginning of the file. @= if (gb_open("test.dat")!=0) exit_test("Can't open test.dat"); @ @(gb_io.h@>= extern void gb_weak_open(); /* open a file for GraphBase input */ extern int gb_open(); /* open a GraphBase data file; return 0 if OK */ @ @= void gb_weak_open(f) char *f; { @; @; if (cur_file) { io_errors=0; more_data=1; line_no=magic=0; tot_lines=0x7fffffff; /* allow ``infinitely many'' lines */ fill_buf(); } else io_errors=cant_open_file; } @ Here's a possibly system-dependent part of the code: We try first to open the data file by using the file name itself as the path name; failing that, we try to prefix the file name with the name of the standard directory for GraphBase data, found in the header file \.{localdefs.h}. @^system dependencies@> @= #include "localdefs.h" @ @= cur_file=fopen(f,"r"); @^system dependencies@> if (!cur_file && (strlen(DATA_DIRECTORY)+strlen(f)= int gb_open(f) char *f; { strncpy(file_name,f,19); /* save the name for use by |gb_close| */ gb_weak_open(f); if (cur_file) { @; @; @; @; gb_newline(); /* the first line of real data is now in the buffer */ } return io_errors; } @ @= static char file_name[20]; /* name of the data file, without a prefix */ @ The first four lines of a typical data file should look something like this: $$\halign{\hskip5em\.{#}\hfill\cr * File "words.dat" from the Stanford GraphBase (C) 1992 Stanford University\cr * A database of English 5-letter words\cr * This file may be freely copied but please do not change it in any way!\cr * (Checksum parameters 5678,78934448)\cr}$$ We actually verify only that the first four lines of a data file named |"foo"| begin with the characters $$\halign{\hskip5em\.{#}\hfill\cr * File "foo"\cr *\cr *\cr * (Checksum parameters $l,m$)\cr}$$ respectively, where $l$ and $m$ are decimal numbers. The values of $l$ and~$m$ are stored away as |tot_lines| and |final_magic|, to be matched at the end of the file. @= sprintf(str_buf,"* File \"%s\"",f); if (strncmp(buffer,str_buf,strlen(str_buf))) return (io_errors |= bad_first_line); @ @= fill_buf(); if (*buffer!='*') return (io_errors |= bad_second_line); @ @= fill_buf(); if (*buffer!='*') return (io_errors |= bad_third_line); @ @= fill_buf(); if (strncmp(buffer,"* (Checksum parameters ",23)) return (io_errors |= bad_fourth_line); cur_pos +=23; tot_lines=gb_number(10); if (gb_char()!=',') return (io_errors |= bad_fourth_line); final_magic=gb_number(10); if (gb_char()!=')') return (io_errors |= bad_fourth_line); @* Closing a file. At the end, we check that the file was open and that it had the correct number of lines, the correct magic number, and a correct final line. The subroutine |gb_close|, like |gb_open|, returns the value of |io_errors|, which will be nonzero if at least one problem was noticed. @= if (gb_close()!=0) exit_test("Bad checksum, or difficulty closing the file"); @ @= int gb_close() { if (!cur_file) return (io_errors |= no_file_open); fill_buf(); sprintf(str_buf,"* End of file \"%s\"",file_name); if (strncmp(buffer,str_buf,strlen(str_buf))) io_errors |= bad_last_line; more_data=buffer[0]=0; /* now the |gb_io| routines are effectively shut down */ /* we have |cur_pos=buffer| */ if (fclose(cur_file)!=0) return (io_errors |= cant_close_file); cur_file=NULL; if (line_no!=tot_lines+1) return (io_errors |= wrong_number_of_lines); if (magic!=final_magic) return (io_errors |= wrong_checksum); return io_errors; } @ There is also a less paranoid routine, |gb_weak_close|, that closes user-generated files. It simply closes the current file, if any, and returns the value of the |magic| checksum. Example: The |restore_graph| subroutine in module |gb_save| uses |gb_weak_open| and |gb_weak_close| to provide system-independent input that is almost as foolproof as the reading of standard GraphBase data. @ @(gb_io.h@>= extern int gb_close(); /* close a GraphBase data file; return 0 if OK */ extern long gb_weak_close(); /* close file and return the checksum */ @ @= long gb_weak_close() { if (cur_file) { fclose(cur_file); more_data=buffer[0]=0; cur_pos=buffer; cur_file=NULL; } return magic; } @* Index. Here is a list that shows where the identifiers of this program are defined and used.