@q Copyright 2012-2024, Alexander Shibakov@> @q This file is part of SPLinT@> @q SPLinT is free software: you can redistribute it and/or modify@> @q it under the terms of the GNU General Public License as published by@> @q the Free Software Foundation, either version 3 of the License, or@> @q (at your option) any later version.@> @q SPLinT is distributed in the hope that it will be useful,@> @q but WITHOUT ANY WARRANTY; without even the implied warranty of@> @q MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the@> @q GNU General Public License for more details.@> @q You should have received a copy of the GNU General Public License@> @q along with SPLinT. If not, see .@> @** Forcing \eatone{bison}\bison\ and \eatone{flex}\flex\ to output \TeX. Instead of implementing a \bison\ (or \flex) `plugin' for outputting \TeX\ parser, the code that follows produces a separate executable that outputs all the required tables after the inclusion of an ordinary \Cee\ parser produced by \bison\ (or a scanner produced by \flex). The actions in both \bison\ parser and \flex\ scanner are assumed to be merely |printf()| statements that output the `real' \TeX\ actions. The code below simply cycles through all such actions to output an `action switch' appropriate for use with \TeX. In every other respect, the included parser or scanner can use any features allowed in `real' parsers and scanners. \def\action#1{\hbox{$\hbox{\\{action}}_{#1}$}} \def\actionn{\action{n}} @s action1 TeX @s actionn TeX @*1 Common routines. The `top' level of the scanner and parser `drivers' is very similar, and is therefore separated into a few sections that are common to both drivers. The layout is fairly typical and follows a standard `initialize-input-process-output-clean up' scheme. The logic behind each section of the program will be explained in detail below. The section below is called |@<\Cee\ postamble@>| because the output of the tables can happen only after the \bison\ (or \flex) generated \.{.c} file is included and all the data structures are known. The actual `assembly' of each driver has to be done separately due to some `singularities' of the \CWEB\ system and the design of this software. All the essential routines are presented in the sections below, though. @<\Cee\ postamble@>= @@; int main( int argc, char **argv ) { @@; @@; @@; @@; switch( mode ) { @@; default: break; } fprintf( stderr, "Outputting tables and actions\n" ); if ( tables_out ) { fprintf( stderr, " tables ... " ); @@; fprintf( stderr, "actions ... " ); @@; } else { fprintf( stderr, "No output, exiting\n" ); exit(0); } fprintf( stderr, "done, cleaning up\n" ); @@; return 0; } @ Not all the code can be supplied at this stage (most of the routines here are at the `top' level so the specifics have to be `filled-in' by each driver), so many of the sections above are placeholders for the code provided by a specific driver. However, we still need to supply a trivial definition here to placate \CWEAVE\ whenever this portion of the code is used isolated in documentation. @= @ Standard library declarations for memory management routines, some syntactic sugar, command line processing, and variadic functions are all that is needed. @= #include #include #include #include #include @ This code snippet is a payment for some poor (in my view) philosophy on the part of the \bison\ and \flex\ developers. There used to be an option in \bison\ to output just the tables and the action code but it had never worked correctly and it was simply dropped in the latest version. Instead, one can only get access to \bison's goodies as part of a tangled mess of |@[#define@]|'s and error processing code. Had the tables and the parser function itself been considered separate, well isolated sections of \bison's output, there would simply be no reason for dirty tricks like the one below, one would be able to write custom error processing functions, unicorns would roam the Earth and pixies would hand open sourced tablets to everyone. At a minimum, it would have been a much cleaner, modular approach. As of version~\.{3.0} of \bison\ some critical arrays, namely, |yyprhs| and |yyrhs| are no longer generated (even internally) which significantly reduces \bison's useability as a parser generator. As an example, the |yyrthree| array, which is necessary for processing `inline' actions is computed in \.{bs.w} using the two arrays mentioned in the previous sentence. There does not seem to be any other way to access this information. A number of tools (GNU and otherwise) have taken the path of narrowing the field of application to a few use cases envisioned by the maintainers. This includes compilers, as well. There is a strange reluctance on the part of the \gcc\ team to output any intermediate code other than the results of preprocessing and assembly. I have seen an argument that involves some sort of appeal to making the code difficult to close source but the logic of it escaped me completely (well, there {\it is\/} logic to it, however choosing poor design in order to punish a few bad players seems like a rather inferior option). Ideally, there should be no such thing as a parser generator, or a compiler, for that matter: all of these are just basic table driven rewriting routines. Tables are hard but table driven code should not be. If one had access to the tables themselves, and some canonical examples of code driven by such tables, like |yyparse()| and |yylex()|, the flexibility of these tools would improve tremendously. Barring that, this is what we have to do {\it now}. There are several ways to gain write access to the data declared |const| in \Cee, like passing its address to a function with no prototype. All these methods have one drawback: the loopholes that make them possible have been steadily getting on the chopping block of the \Cee\ standards committee. Indeed, |const| data should be constant. Even if one succeeds in getting access, there is no reason to believe that the data is not allocated in a write-only region of the memory. The cleanest way to get write access then is to eliminate |const| altogether. The code should have the same semantics after that, and the trick is only marginally bad. The last two definitions are less innocent (and, at least the second one, are prohibited by the ISO standard (clause 6.10.8(2), see~\cite[ISO/C11])) but \gcc\ does not seem to mind, and it gets rid of warnings about dropping a |const| qualifier whenever an |assert| is encountered. Since the macro is not recursively expanded, this will only work if $\ldots$|FUNCTION__| is treated as a pseudo-variable, as it is in \gcc, not a macro. @d const @d __PRETTY_FUNCTION__ @[(char *)__PRETTY_FUNCTION__@] @d __FUNCTION__ @[(char *)__FUNCTION__@] @ The output file has to be known to both parts of the code, so it is declared at the very beginning of the program. We also add some syntactic sugar for loops. @q The line below is simply irresistible, one should put it on a T-shirt@> @s FOREVER TeX @d FOREVER for(;;) @= #include FILE *tables_out; @ The clean-up portion of the code can be left empty, as all it does is close the output file, which can be left to the operating system but we take care of it ourselves to keep out code `clean'\footnote{In case the reader has not noticed yet, this is a weak attempt at humor to break the monotony of going through the lines of \CTANGLE'd code}. @= fclose( tables_out ); @ There is a descriptor controlling the output of the program as a whole. The code below is an example of a literate programming technique that will be used repeatedly to maintain large structures that can grow during the course of the program design. Note that the name of each table is only mentioned once, the rest of the code is generic. Technically speaking, all of this can be done with \Cee\ preprocessor macros of moderate complexity, taking advantage of its expansion rules but it is not nearly as transparent as the \CWEB\ approach. @= struct output_d { @@; }; struct output_d output_desc = { @ }; @ To declare each table field in the global output descriptor, all one has to do is to provide a general pattern. @= #define _register_table_d(name) @[bool output_##name:1;@] @@; #undef _register_table_d @ Same for assigning default values to each field. @= #define _register_table_d(name) @[.output_##name = 0,@] /* do not output any tables by default */ @
@; #undef _register_table_d @ Each descriptor is populated using the same approach. @= #define _register_table_d(name) @[struct table_d name##_desc = {0};@] @
@; #undef _register_table_d @ The flag \.{--optimize-tables} affects the way tables are output. @= static int optimize_tables = 0; @ It is set using the command line option below. @= register_option_("optimize-tables", no_argument, &optimize_tables, 1, "")@; @ The reason to implement the table output routine as a macro is to avoid writing separate functions for tables of different types of data (stings as well as integers). The output is controlled by each table's {\it descriptor\/} defined below. A more sophisticated approach is possible but this code is merely a `patch' so we are not after full generality\footnote{A somewhat cleaner way to achieve the same effect is to use the \.{\_Generic} facility of \Cee11.}. @d output_table(table_desc, table_name, stream) if ( output_desc.output_##table_name ) { int i, j = 0; if ( optimize_tables ) { fprintf( stream, "\\setoptopt{%s}%%\n", table_desc.name ); if ( !table_desc.optimized_numeric ) { fprintf( stream, "\\beginoptimizednonnumeric{%s}%%\n", table_desc.name ); } for( i = 0; i < sizeof(table_name)/sizeof(table_name[0]) - 1; i++) { if ( table_desc.formatter ) { table_desc.formatter( stream, i ); } else { fprintf( stream, table_desc.optimized_numeric, table_desc.name, i, table_name[i] ); } } if ( table_desc.formatter ) { table_desc.formatter( stream, -i ); } else { fprintf( stream, table_desc.optimized_numeric, table_desc.name, i, table_name[i] ); } if ( table_desc.cleanup ) { table_desc.cleanup( &table_desc ); } } else { fprintf( stream, table_desc.preamble, table_desc.name ); for( i = 0; i < sizeof(table_name)/sizeof(table_name[0]) - 1; i++) { if ( table_desc.formatter ) { j += table_desc.formatter( stream, i ); } else { if ( table_name[i] ) { j += fprintf( stream, table_desc.separator, table_name[i] ); } else { j += fprintf( stream, "%s", table_desc.null ); } } if ( j > MAX_PRETTY_LINE && table_desc.prettify ) { fprintf( stream, "\n" ); j = 0; } } if ( table_desc.formatter ) { table_desc.formatter( stream, -i ); } else { if ( table_name[i] ) { fprintf( stream, table_desc.postamble, table_name[i] ); } else { fprintf( stream, "%s", table_desc.null_postamble ); } } if ( table_desc.cleanup ) { table_desc.cleanup( &table_desc ); } } } @= struct table_d { @@; }; @ @= char *name; char *preamble; char *separator; char *postamble; char *null_postamble; char *null; char *optimized_numeric; bool prettify; int (*formatter)( FILE *, int ); void (*cleanup)( struct table_d * ); @ Tables are output first. The action output code must come last since it changes the values of the tables to achieve its goals. Again, a different approach is possible, that saves the data first but simplicity was deemed more important than total generality at this point. @= @@; @ One more application of `gather the names first then process' technique. @= #define _register_table_d(name) @[output_table(name##_desc, name, tables_out);@] @
@; #undef _register_table_d @ Tables will be output by each driver. Placeholder here, for \CWEAVE's piece of mind. @
= @ Action output invokes a totally new level of dirty code. If tables, constants, and tokens are just data structures, actions are executable commands. We can only hope to cycle through all the actions, which is enough to successfully use \bison\ and \flex\ to generate \TeX. The |switch| statement containing the actions is embedded in the parser function so to get access to each action one has to coerce |yyparse()| to jump to each case. Here is where we need the table manipulation. The appropriate code is highly specific to the program used (since \bison\ and \flex\ parsing and scanning functions had to be `reverse engineered' to make them do what we want), so at this point we simply declare the options controlling the level of detail and the type of actions output. @= static int bare_actions = 0; /* (|static| for local variables) and |int| to pacify the compiler (for a constant initializer and compatible type) */ static int optimize_actions = 0; @ The first of the following options allows one to output an action switch without the actions themselves. It is useful when one needs to output a \TeX\ parser for a grammar file that is written in \Cee. In this case it will be impossible to cycle through actions (as no setup code has been executed), so the parser invocation is omitted. The second option splits the action switch into several macros to speed up the processing of the action code. The last argument of the `flexible' macro below is supposed to be an extended description of each option which can be later utilized by a |usage()| function. @= register_option_("bare-actions", no_argument, &bare_actions, 1, "")@; register_option_("optimize-actions", no_argument, &optimize_actions, 1, "")@; @ The rest of the action output code mimics that for table output, starting with the descriptor. To make the output format more flexible, this descriptor should probably be turned into a specialized routine. @= struct action_d { char *preamble; char *act_setup; char *act_suffix; char *action1; char *actionn; char *postamble; void (*print_rule)( int ); void (*cleanup)( struct action_d * ); }; @ @= bool output_actions:1; @ Nothing is output by default, including actions. @= @[@].output_actions = 0,@[@]@; @ @= struct action_d action_desc = {0}; @ Each function below outputs the \TeX\ code of the appropriate action when the action is `run' by the action output switch. The main concern in designing these functions is to make the code easier to look at. Further explanation is given in the grammar file. If the parser is doing its job, this is the only place where one would actually see these as functions (or, rather, macros). In case one wishes to use the `native' \bison\ way of referencing terms (i.e.\ something along the lines of~\.{\\the\char`\$[term]}) these macros should be used with a trailing underscore (say, \.{TeXa\_}) to let the postprocessor know that the code inside should be postprocessed. The postprocessor will then create three files: one, destined for \CWEAVE, will use the same macro withough the underscore (i.e.\ \.{TeXa} to continue our example, and have the native \bison\ terms replaced wih the appropriate \TeX\ commands. Another file is intended for \CTANGLE, where the whole macro will be replaced first with a special identifier, which in turn, after \CTANGLE\ finishes, will be replaced by an appropriately constructed call to \.{TeX\_\_}. The third file will contain the substitutions. In compliance with paragraph 6.10.8(2)\footnote{[$\ldots$] {\it Any other predefined macro names shall begin with a leading underscore followed by an uppercase letter or a second underscore.}} of the \ISO\ \Cee11 standard the names of these macros do not start with an underscore, since the first letter of \.{TeX} is uppercase\footnote{One might wonder why one of these functions is defined as a \CWEB\ macro while the other is put into the preamble `by hand'. It really makes no difference, however, the reason the second macro is defined explicitly is \CWEB's lack of awareness of `variadic' macros which produces undesirable typesetting artefacts.}. \def\TeXx{\hbox{\.{TeX\_}}} \def\TeXa{\hbox{\.{TeXa}}} \def\TeXb{\hbox{\.{TeXb}}} \def\TeXf{\hbox{\.{TeXf}}} \def\TeXfo{\hbox{\.{TeXfo}}} \def\TeXao{\hbox{\.{TeXao}}} \def\TeXxx{\hbox{\.{TeX\_\_}}} @s TeX__ TeX @d TeX_( string ) fprintf( tables_out, " %s%%\n", string ) @d TeXb( string ) TeX_( string ) @d TeXa( string ) TeX_( string ) @d TeXf( string ) TeX_( string ) @d TeXfo( string ) TeX_( string ) @d TeXao( string ) TeX_( string ) @d YY_FATAL_ERROR( message ) fprintf( tables_out, " /yylexcomplain{%s}/yylexerrterminate%%\n", message ) @q \CWEB\ is not aware of variadic macros, so this has to be done the old way@> @<\Cee\ preamble@>= #define TeX__( string, ... ) @[fprintf( tables_out, " " string "%%\n", __VA_ARGS__ )@] @ If a full parser is not needed, the lexing mechanism is not required. To satisfy the compiler and the linker, the lexer and other functions still have to be declared and defined, since these functions are referred to in the body of the parser. The details of these declarations can be found in the driver code. @<\Cee\ preamble@>= @; @@; @@; @ We begin with a few macros to facilitate the output of tables in the format that \TeX\ can understand. As there is no perfect way to represent an array in \TeX\ a rather weak compromise was settled upon. Further explanation of this choice is given in the \TeX\ file that implements the \TeX\ parser for the \bison\ input grammar. @d tex_table_generic(table_name) table_name##_desc.preamble = "\\newtable{%s}{%%\n"; table_name##_desc.separator = "%d\\or "; table_name##_desc.postamble = "%d}%%\n"; table_name##_desc.null_postamble = "0}%%\n"; table_name##_desc.null = "0\\or "; table_name##_desc.optimized_numeric = "\\expandafter\\def\\csname %s\\parsernamespace %d\\endcsname{%d}%%\n"; table_name##_desc.prettify = true; table_name##_desc.formatter = NULL; table_name##_desc.cleanup = NULL; output_desc.output_##table_name = 1; @d tex_table(table_name) tex_table_generic(table_name); table_name##_desc.name = #table_name; @ An approach paralleling the table output scheme is taken with constants. Since constants are \Cee\ {\it macros\/} one has to be careful to avoid the temptation of using constant {\it names\/} directly as names for fields in structures. They will simply be replaced by the constants' values. When the names are concatenated with other tokens, however, the \Cee\ preprocessor postpones the macro expansion until the concatenation is complete (see clauses 6.10.3.1, 6.10.3.2, and 6.10.3.3 of the \ISO\ \Cee\ Standard, \cite[ISO/C11]). Unless the result of the concatenation is still expandable, the expansion will halt\footnote{Another trick to halt expansion is to use {\it function macros} which will expand only when they are followed by parentheses. Since we do not have control over how constants are defined by \bison, we cannot take advantage of this feature of the \Cee\ preprocessor.}. @= struct const_d { char *format; char *name; int value; }; @ @= #define _register_const_d(c_name,...) @[struct const_d c_name##_desc;@] @@; #undef _register_const_d @ @= #define _register_const_d(c_name,...) @[bool output_##c_name:1;@] @@; #undef _register_const_d @ @= #define _register_const_d(c_name,...) @[@[@].output_##c_name = 0,@[@]@] @@; #undef _register_const_d @ @= fprintf( tables_out, "%%\n%% constant definitions\n%%\n" ); @@; @ @= { int any_constants = 0; #define _register_const_d(c_name,...) \ \ if ( output_desc.output_##c_name ) { \ const_out( tables_out, c_name##_desc)@; \ any_constants = 1; \ } @@; #undef _register_const_d if ( any_constants ); /* this is merely a placeholder statement */ } @ Constants are very driver specific, so to make \CWEAVE\ happy $\ldots$ @= @ A macro to help with constant output. @d const_out(stream, c_desc) fprintf(stream, c_desc.format, c_desc.name, c_desc.value); @ Action switch output routines modify the automata tables and therefore have to be output last. Since action output is highly automaton specific, we leave this section blank here, to pacify \CWEAVE\ in case this file is typeset by itself. @= @*2 Error codes. @= enum err_codes{ @@, LAST_ERROR }; @ @= NO_MEMORY, BAD_STRING, BAD_MIX_FORMAT,@[@] @ A lot more care is necessary to output the token table. A number of precautions are taken to ensure that a maximum possible range of names can be passed safely to \TeX. This involves some manipulation of \.{\\catcode}'s and control characters. The complicated part is left to \TeX\ so the output code can be kept simple. The helper function below is used to `combine' two strings. @d MAX_PRETTY_LINE 100 @= char *mix_string( char *format, ... ); @ @= char *mix_string( char *format, ... ) { char *buffer; size_t size = 0; int length = 0; int written = 0; char *formatp = format; va_list ap, ap_save; va_start( ap, format ); va_copy( ap_save, ap ); size = strnlen( format, MAX_PRETTY_LINE * 5 ); if ( size >= MAX_PRETTY_LINE * 5 ) { fprintf( stderr, "%s: runaway string?\n", __func__ ); exit(BAD_STRING); } while ( (formatp = strstr( formatp, "%" )) ) { switch( formatp[1] ) { case 's':@; length = strnlen( va_arg( ap, char * ), MAX_PRETTY_LINE * 5 ); if ( length >= MAX_PRETTY_LINE * 5 ) { fprintf( stderr, "%s: runaway string?\n", __func__ ); exit(BAD_STRING); } size += length; size -=2; formatp++; break; case '%':@; size--; formatp +=2; default: printf( "%s: cannot handle %%%c in mix string format\n", __func__, formatp[1] ); exit( BAD_MIX_FORMAT ); } } buffer = (char *)malloc( sizeof(char) * size + 1 ); if ( buffer ) { written = vsnprintf( buffer, size + 1, format, ap_save ); if ( written < 0 || written > size ) { fprintf( stderr, "%s: runaway string?\n", __func__ ); exit(BAD_STRING); } } else { fprintf( stderr, "%s: failed to allocate memory for the output string\n", __func__ ); exit(NO_MEMORY); } va_end( ap ); va_end( ap_save ); return buffer; } @*2Initial setup. Depending on the output mode (right now only \TeX\ and `tokens only' (in the \bison\ `driver') are supported) the format of each table, action field and token has to be set up. @= enum output_mode {@@, LAST_OUT}; @ And to calm down \CWEAVE\ $\ldots$ @= @ \TeX\ is the main output mode. @= enum output_mode mode = TEX_OUT; @*2Command line processing. This program uses a standard way of parsing the command line, based on |getopt_long|. At the heart of the setup are the array below with a couple of supporting variables. @= #include #include #include @ @= const char *usage = "%s [options] output_file\n"; @ @= int c, option_index = 0;@# enum higher_options{NON_OPTION = 0xFF, @@, LAST_HIGHER_OPTION}; static struct option long_options[] = { @/@[ @@;@/ {0, 0, 0, 0} @] };@# @ The main loop of the command line option processing follows. This can be used as a template for setting up the option processing. The specific cases are added to in the course of adding new features. @= opterr = 0; /* we do our own error reporting */ FOREVER { c = getopt_long (argc, argv, ( char [] )@[{':'@t, @>@}@], long_options, &option_index); if (c == -1) break; switch (c) { case 0:@; /* it is a flag, the name is kept in |long_options[option_index].name|, and the value can be found in |long_options[option_index].val| */ break; @t}\4{@>@; @t}\4{@>@; case '?':@; fprintf (stderr, "Unknown option: `%s', see `Usage' below\n\n", argv[optind - 1]); fprintf(stderr, usage, argv[0]); exit(1); break; case ':':@; fprintf (stderr, "Missing argument for `%s'\n\n", argv[optind - 1]); fprintf(stderr, usage, argv[0]); exit(1); break; default:@; printf ("warning: feature `%c' is not yet implemented\n", c); } } if (optind >= argc) { fprintf( stderr, "No output file specified!\n" ); } else { tables_out = fopen( argv[optind++], "w" ); } if (optind < argc) { printf ("script files to be loaded: "); while (optind < argc) printf ("%s ", argv[optind++]); putchar ('\n'); } @ @= #define register_option_(name, arg_flag, loc, val, exp) @[{name, arg_flag, loc, val},@[@]@] @@; @@; @@; #undef register_option_ @ In addition to spelling out the full command line option name (such as \.{--help}) |getopt_long| gives the user a choice of using a shortcut (say, \.{-h}). As individual options are treated in drivers themselves, there are no shortcuts to supply at this point. We leave this section (and a number of others) empty to be filled in with the driver specific code to pacify \CWEAVE. @= #define dd_optional_argument @[@[@], ':', ':'@] #define dd_required_argument @[@[@], ':'@] #define dd_no_argument #define register_option_(name, arg_flag, loc, val, ...) @[@[@], val dd_##arg_flag@] @@; #undef register_option_ #undef dd_optional_argument #undef dd_required_argument #undef dd_no_argument @ Some options have one-letter `shortcuts', whereas others only exist in `fully spelled-out' form. To easily keep track of the latter, a special enumerated list is declared. To add to this list, simply add to the \CWEB\ section below. @= #define register_option_(name, arg_flag, loc, val, ...) @[val,@[@]@] @@; #undef register_option_ @ @= @ @= @ @= @ @=