% Norman Ramsey (nr@notecnirp) --- CS 320 \def\sizedboxit#1#2{\vtop{\vbox{\hrule\hbox{\vrule\kern #2% \vtop{\vbox{\kern #2\hbox{#1}}\kern #2}\kern #2\vrule}}\hrule}} \def\boxit#1{\sizedboxit{#1}{1pt}} \def\token#1{\boxit{\strut\tt #1}} \setcounter{secnumdepth}{0} % l2h ignore boxit % l2h argblock token tt @ In this assignment I print intermediate code. <>= void print6 (Exp e) { extern int yylineno; printf("line %d:\n",yylineno); if (e.type==error_type) printf("# ERROR\n"); else printTree(e.tree,stdout); } <>= void print6(Exp e); @ \section{The {\tt yacc} value stack} Here are all the objects I use as synthesized attributes: <>= %union { char *string; Type type; Product product; Symbol symbol; Exp exp; Explist explist; int bit; } @ \section{3. Vocabulary and representation} \subsection{Representation of terminal symbols} {\tt yacc} and {\tt lex} use integers to represent terminal symbols (tokens). Single-character tokens are represented by their ASCII codes. Longer tokens are declared using {\tt yacc}'s [[%token]] declaration; {\tt yacc} chooses an integer representation of each such token. These representations are made available to the lexer via [[y.tab.h]]. I use the standard trick from Kernighan and Pike ([[x.tab.h]] instead of [[y.tab.h]]) to avoid remaking the lexer unecessarily. <>= #include "x.tab.h" @ \subsection{1. Identifiers} Since {\tt lex} is notoriously slow at using patterns to recognize reserved words, I look up every identifier in a table of reserved words. [[idcategory(id)]] returns the category of the reserved word [[id]] when [[id]] is in fact a reserved word. When [[id]] is not a reserved word, [[idcategory(id)]] returns [[IDENT]]. <>= letter [A-Za-z] digit [0-9] @ The offensive [[]] below has to do with handling comments (q.v.). Both identifiers and reserved words are saved in the string table and put on the {\tt yacc} value stack. <>= {letter}({letter}|{digit})* { yylval.string = strsave(yytext); return idcategory(yytext); } <>= extern char *strsave(char *s); <>= %token IDENT %type IDENT <>= ident : IDENT ; <>= %type ident <>= extern char yytext[]; @ \subsection{2. Numbers} I have to use two different [[ScaleFactor]] definitions so I can tell the difference between long and short reals. Notice that the scale factor is {\em not} optional for the long real. <>= {digit}+|{digit}({hexDigit}*)[Hh] { yylval.string = strsave(yytext); return INTEGER; } {digit}+"."{digit}*{EScaleFactor}? { yylval.string = strsave(yytext); return REAL; } {digit}+"."{digit}*{DScaleFactor} { yylval.string = strsave(yytext); return LONGREAL; } <>= %token INTEGER REAL LONGREAL @ I permit lower case where Wirth insists on upper case. This will be convenient later on. Besides, Hanson does it. I need parentheses around things like [[EScaleFactor]] because I'n not really defining a regular expression---I'm using a macro facility. <>= hexDigit [0-9A-Fa-f] EScaleFactor ([eE][+\-]?{digit}+) DScaleFactor ([dD][+\-]?{digit}+) @ \subsection{3,4. Strings and character constants} Single character strings [["x"]] become character constants, not strings, thanks to the {\tt lex} disambiguation rules. <>= "\'"{nonquote}"\'" { yylval.string=strsave(yytext); return CHARCONST; } {digit}{hexDigit}*[Xx] { yylval.string=strsave(yytext); return CHARCONST; } "\""{nondoublequote}*"\"" { yylval.string=strsave(yytext); return STRING; } <>= %token CHARCONST STRING @ The character set for string literals is awkward, because I want to include backslash escapes. I use the ANSI standard backslash escapes from section A2.5.2 of the second edition of Kernighan and Ritchie. Because the lexical analyzer is probably not the right place to handle illegal backslash escapes, I allow any reasonable character to follow the backslash. I define [[nonoctalx]] to be those characters that can't start an octal or hexadecimal constant (when following a backslash). Then I can recognize octal and hexdecimal character constants like [["\012"]]. I {\em don't} insist that at least one [[hexDigit]] follow [[\x]], because again that should be handled downstream of the lexical analyzer. <>= plainnonquote [ \t\]\"!#$%&()*+,\-./0-9:;<=>?@A-Z[^_`a-z{|}~] plainnondoublequote [ \t\]\'!#$%&()*+,\-./0-9:;<=>?@A-Z[^_`a-z{|}~] escapedchar (\\({nonoctalx}|{octal}{octal}?{octal}?|x{hexDigit}*)) nonoctalx [ \]\'\"!#$%&()*+,\-./89:;<=>?@A-Z[\\^_`a-wyz{|}~] nonquote ({plainnonquote}|{escapedchar}) nondoublequote ({plainnondoublequote}|{escapedchar}) octal [0-7] @ I also need to handle strings that don't terminate. When I see one, I gobble up the whole line on which it sits---that should make it easier for the parser to recover. (The alternative is trying to tokenize the characters following the open quote.) <>= ("\""{nondoublequote}*|"\'"{nonquote}*) <> @ I print the first few characters of a nonterminated string, followed by an ellipsis. I return the string anyway because that way there's a chance that the parser can just ignore the error. <>= { yyerror("Unterminated string %.8s%s",yytext,yyleng>8?"...":""); yylval.string=strsave(""); return STRING; } @ I include prototypes for the string functions, to keep {\tt lcc -A} from complaining about missing prototypes. <>= #include @ \subsection{5. Operators and delimiters} Here are [[%token]] declarations for all the multicharacter tokens, including the reserved words. I use [[yyBEGIN]] because [[BEGIN]] means something special to {\tt lex}. <>= %token ARROW INC DEC LSHIFT RSHIFT LEQ GEQ EQ NEQ AND OR /* -> ++ -- @<< @>> <= >= == != && || */ @ I make sure the lexer always returns strings for identifiers and reserved words. @ Recognizing the operators and delimiters is straightforward: <>= "->" return ARROW; "++" return INC; "--" return DEC; "<<" return LSHIFT; ">>" return RSHIFT; "<=" return LEQ; ">=" return GEQ; "==" return EQ; "!=" return NEQ; "&&" return AND; "||" return OR; [\]+!\-*/~&.,;|()[{}^=#<>:] return *yytext; @ \paragraph{Reserved word search} Recall that, instead of having the {\tt lex}-generated automaton recognize reserved words, I wanted to look up each identifier to see if it is a reserved word. I put the reserved words into an array and use binary search to find their categories. A word that isn't in the list has category [[IDENT]]. The list itself is excruciating to read. I use a trick I got from Dave Hanson---I put the list in a header file as calls to the [[kw]] (keyword) macro. Then I include the header with appropriate macros wherever I need it. <>= kw(INT, "int") @ A binary search table of reserved words: <>= static struct reserved {char *s; int category;} reservedwords[] = { #define kw(VAL,STRING) {STRING,VAL}, <> #undef kw }; static int numreservedwords = (sizeof(reservedwords)/sizeof(struct reserved)); @ [[idcategory]] is just binary search. <>= <> static int idcategory (char *id) { int low=0, high=numreservedwords-1, mid; int compare; while (low <= high) { /* Invariant: if id is in the initial range low...high, it is in the current range low...high */ mid = (low+high)/2; /* note low <= mid <= high */ compare = strcmp(id,reservedwords[mid].s); if (compare>0) low = mid + 1; else if (compare<0) high = mid - 1; else return reservedwords[mid].category; } return IDENT; /* id is not a reserved word */ } <>= static int idcategory(char *); @ \subsection{Comments} I use the standard trick of defining a special state just for comments. A begin comment sends the lexer into state [[]], and an end comment returns it to state [[]]. All tokens that aren't comments are recognized only in state [[]], which explains the offensive [[]] prepended to every rule. <>= %S COMMENT <>= "/*" BEGIN COMMENT; "*/" BEGIN INITIAL; "\n" ; . ; @ \subsection{Whitespace and bad characters} <>= [ \t\n]+ ; /* ignore whitespace */ . <> @ The error message we print is different for printable and nonprintable characters. <>= { if (*yytext >= ' ' && *yytext < 0177) yyerror("bad character `%c'", *yytext); else yyerror("bad character `\\%03o'", *yytext); } @ \section{8. Expressions} \subsection{8.1 Operands (designators and constants)} There are no qualified identifiers, so this simplifies the parsing of designators. It is a bit awkward to distinguish variables and parameters from constant identifiers. There is also an awkwardness with rvalues---boolean expressions have to be converted from ``test'' to values using [[BOOL]]. Following a suggestion of Hanson's, I use three nonterminals: [[exp]] is an expression (possibly a test); [[rvalue]] is an rvalue (never a test), and [[lvalue]] is an lvalue. I introduce [[complex_lvalue]] because I need to distinguish identifiers from all other lvalues (otherwise I get a reduce-reduce conflict when converting lvalues to expressions). @ <>= %type arraydes lvalue complex_lvalue rvalue exp @ Here are productions for all the C literals. I use [[make_constval(type,string)]] to convert a string to a value of the type desired. I issue warnings for long reals, since they aren't supported in Oberon/320. <>= exp : INTEGER { $$ = make_constval(integer_type,$1); } | REAL { $$ = make_constval(real_type,$1); } | LONGREAL { warning("Long reals not supported (replaced with zero)"); $$ = make_constval(real_type,strsave("0.0")); } | CHARCONST { $$ = make_constval(char_type,$1); } | STRING { $$ = make_constval(build_array(0,stringchar_type),$1); } ; @ \subsection{8.2 Operators} I use {\tt yacc} precedence declarations. These declarations define precedence. My task is much simplified because unary and binary [[-]] have exactly the same precedence. <>= %left OR %left AND %left '|' %left '^' %left '&' %left EQ NEQ %left '<' '>' LEQ GEQ %left LSHIFT RSHIFT %left '+' '-' %left '*' '/' '%' %right '!' '~' INC DEC /* unary operators */ %left ARROW '.' @ [[binop]] checks types and generates intermediate code. Consult the chapter on typechecking for the description of the correct operation of [[binop]] and the meanings of various permissions [[p_xxx]]. <>= complex_lvalue : lvalue '.' ident { $$ = find_field($1,$3); } | exp ARROW ident { $$ = find_field(deref($1),$3) } | '*' rvalue { $$ = deref($1); } | arraydes ']' { $$ = $1; } ; arraydes : rvalue '[' exp { $$ = subscript($1,$3); } ; lvalue : complex_lvalue { $$ = $1; } | ident { $$ = lookup_lvalue($1); } ; exp : complex_lvalue { $$.type=$1.type; $$.tree=tMEM($1.type->size,$1.tree); } | ident { $$ = lookup_exp($1); } ; rvalue : exp { $$ = boolval($1); } ; <>= exp : exp EQ exp { $$ = binop(OSeq, $1,$3,boolean_type,p_equality); } | exp NEQ exp { $$ = binop(OSneq,$1,$3,boolean_type,p_equality); } | exp '<' exp { $$ = binop(OSlt, $1,$3,boolean_type,p_relational); } | exp LEQ exp { $$ = binop(OSleq,$1,$3,boolean_type,p_relational); } | exp '>' exp { $$ = binop(OSgt, $1,$3,boolean_type,p_relational); } | exp GEQ exp { $$ = binop(OSgeq,$1,$3,boolean_type,p_relational); } ; exp : exp '+' exp { $$ = binop(OSplus, $1,$3,0,p_numeric); } | exp '-' exp { $$ = binop(OSminus,$1,$3,0,p_numeric); } | exp OR exp { $$ = binop(OSor, $1,$3,0,p_boolean); } ; exp : exp '*' exp { $$ = binop(OSmul, $1,$3,0,p_numeric); } | exp AND exp { $$ = binop(OSand,$1,$3,0,p_boolean); } ; exp : '(' exp ')' { $$ = $2; } ; @ \subsection{Function calls} Calls to functions (procedures having a return type) may occur {\em only} as factors in the production given below. The bottom of p.~678 of the Oberon report makes it clear that the~[[()]] are required even if the function has no parameters. <>= exp : ident ActualParameters { $$ = fcall($1,$2); } ; ActualParameters: '(' ExpList ')' { $$ = $2; } | '(' ')' { $$ = 0; } ; ExpList : rvalue ',' ExpList { $$ = explist($1,$3); } | rvalue { $$ = explist($1,0); } ; <>= %type ActualParameters ExpList @ \section{11. Compilation units} <>= module : exp { compile($1); } <>= %start module @ \section{Error recovery} Here are some simple error productions that might help the parser continue. The first four gobble up mangled declarations. The last two are stabs in the dark; I hope they give the parser a chance to recover from errors in statements and expressions. <>= exp : error { $$.type = error_type; } @ \section{Putting it all together} Here are the necessary {\tt\#include} files: <>= #include #include #include "types.h" #include "predef.h" #include "tree.h" #include "typecheck.h" #include "codegen.h" #include "symbol.h" #include "declarations.h" #include "constants.h" #include "errors.h" @ There are no include files used exclusively by the parser. This is because the lexer sees {\tt y.tab.h}, and so has to know everything the parser knows. <>= @ This is boilerplate for every {\tt lex} specification ever written: <>= %{ <> <> <> %} <> %% <> %% <> @ And this is boilerplate for every {\tt yacc} specification ever written: <>= %{ <> <> <> extern int yylex(void); %} <> %% <> %% #define lint /* keep lcc from barking about no reference to yaccpar_sccsid */ <> @ \section{Indices} \subsection{Chunks} \nowebchunks \subsection{Identifiers} \nowebindex