<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <HTML> <HEAD> <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> <META name="GENERATOR" content="hevea 1.06-7 of 2001-11-14"> <TITLE> Lexer and parser generators (ocamllex, ocamlyacc) </TITLE> </HEAD> <BODY TEXT=black BGCOLOR=white> <A HREF="manual025.html"><IMG SRC ="previous_motif.gif" ALT="Previous"></A> <A HREF="index.html"><IMG SRC ="contents_motif.gif" ALT="Contents"></A> <A HREF="manual027.html"><IMG SRC ="next_motif.gif" ALT="Next"></A> <HR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#2de52d"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc120"><B><FONT SIZE=6>Chapter 12</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=6>Lexer and parser generators (ocamllex, ocamlyacc)</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE> <A NAME="c:ocamlyacc"></A> <BR> This chapter describes two program generators: <TT>ocamllex</TT>, that produces a lexical analyzer from a set of regular expressions with associated semantic actions, and <TT>ocamlyacc</TT>, that produces a parser from a grammar with associated semantic actions.<BR> <BR> These program generators are very close to the well-known <TT>lex</TT> and <TT>yacc</TT> commands that can be found in most C programming environments. This chapter assumes a working knowledge of <TT>lex</TT> and <TT>yacc</TT>: while it describes the input syntax for <TT>ocamllex</TT> and <TT>ocamlyacc</TT> and the main differences with <TT>lex</TT> and <TT>yacc</TT>, it does not explain the basics of writing a lexer or parser description in <TT>lex</TT> and <TT>yacc</TT>. Readers unfamiliar with <TT>lex</TT> and <TT>yacc</TT> are referred to ``Compilers: principles, techniques, and tools'' by Aho, Sethi and Ullman (Addison-Wesley, 1986), or ``Lex & Yacc'', by Levine, Mason and Brown (O'Reilly, 1992).<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#66ff66"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc121"><B><FONT SIZE=5>12.1</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=5>Overview of <TT>ocamllex</TT></FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> The <TT>ocamllex</TT> command produces a lexical analyzer from a set of regular expressions with attached semantic actions, in the style of <TT>lex</TT>. Assuming the input file is <I>lexer</I><TT>.mll</TT>, executing <PRE> ocamllex <I>lexer</I>.mll </PRE> produces Caml code for a lexical analyzer in file <I>lexer</I><TT>.ml</TT>. This file defines one lexing function per entry point in the lexer definition. These functions have the same names as the entry points. Lexing functions take as argument a lexer buffer, and return the semantic attribute of the corresponding entry point.<BR> <BR> Lexer buffers are an abstract data type implemented in the standard library module <TT>Lexing</TT>. The functions <TT>Lexing.from_channel</TT>, <TT>Lexing.from_string</TT> and <TT>Lexing.from_function</TT> create lexer buffers that read from an input channel, a character string, or any reading function, respectively. (See the description of module <TT>Lexing</TT> in chapter <A HREF="manual034.html#c:stdlib">20</A>.)<BR> <BR> When used in conjunction with a parser generated by <TT>ocamlyacc</TT>, the semantic actions compute a value belonging to the type <TT>token</TT> defined by the generated parsing module. (See the description of <TT>ocamlyacc</TT> below.)<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#66ff66"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc122"><B><FONT SIZE=5>12.2</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=5>Syntax of lexer definitions</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> The format of lexer definitions is as follows: <PRE> { <I>header</I> } let <I>ident</I> = <I>regexp</I> ... rule <I>entrypoint</I> = parse <I>regexp</I> { <I>action</I> } | ... | <I>regexp</I> { <I>action</I> } and <I>entrypoint</I> = parse ... and ... { <I>trailer</I> } </PRE> Comments are delimited by <TT>(*</TT> and <TT>*)</TT>, as in Caml.<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#7fff7f"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc123"><B><FONT SIZE=4>12.2.1</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=4>Header and trailer</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE> The <I>header</I> and <I>trailer</I> sections are arbitrary Caml text enclosed in curly braces. Either or both can be omitted. If present, the header text is copied as is at the beginning of the output file and the trailer text at the end. Typically, the header section contains the <CODE>open</CODE> directives required by the actions, and possibly some auxiliary functions used in the actions.<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#7fff7f"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc124"><B><FONT SIZE=4>12.2.2</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=4>Naming regular expressions</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> Between the header and the entry points, one can give names to frequently-occurring regular expressions. This is written <TT><FONT COLOR=blue>let</FONT></TT> <TT><I><FONT COLOR=maroon>ident</FONT></I></TT> <TT><FONT COLOR=blue>=</FONT></TT> <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT>. In following regular expressions, the identifier <I>ident</I> can be used as shorthand for <I>regexp</I>.<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#7fff7f"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc125"><B><FONT SIZE=4>12.2.3</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=4>Entry points</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> The names of the entry points must be valid identifiers for Caml values (starting with a lowercase letter). Each entry point becomes a Caml function that takes one argument of type <TT>Lexing.lexbuf</TT>. Characters are read from the <TT>Lexing.lexbuf</TT> argument and matched against the regular expressions provided in the rule, until a prefix of the input matches one of the rule. The corresponding action is then evaluated and returned as the result of the function.<BR> <BR> If several regular expressions match a prefix of the input, the ``longest match'' rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#7fff7f"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc126"><B><FONT SIZE=4>12.2.4</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=4>Regular expressions</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> The regular expressions are in the style of <TT>lex</TT>, with a more Caml-like syntax. <DL COMPACT=compact><DT><B><TT><FONT COLOR=blue>'</FONT></TT></B><B> <TT><I><FONT COLOR=maroon>char</FONT></I></TT></B><B> <TT><FONT COLOR=blue>'</FONT></TT></B><DD> A character constant, with the same syntax as Objective Caml character constants. Match the denoted character.<BR> <BR> <DT><B><TT>_</TT></B><DD> (Underscore.) Match any character.<BR> <BR> <DT><B><TT>eof</TT></B><DD> Match the end of the lexer input.<BR> <B>Note:</B> On some systems, with interactive input, an end-of-file may be followed by more characters. However, <TT>ocamllex</TT> will not correctly handle regular expressions that contain <TT>eof</TT> followed by something else.<BR> <BR> <DT><B><TT><FONT COLOR=blue>"</FONT></TT></B><B> <TT><I><FONT COLOR=maroon>string</FONT></I></TT></B><B> <TT><FONT COLOR=blue>"</FONT></TT></B><DD> A string constant, with the same syntax as Objective Caml string constants. Match the corresponding sequence of characters.<BR> <BR> <DT><B><TT><FONT COLOR=blue>[</FONT></TT></B><B> <TT><I><FONT COLOR=maroon>character-set</FONT></I></TT></B><B> <TT><FONT COLOR=blue>]</FONT></TT></B><DD> Match any single character belonging to the given character set. Valid character sets are: single character constants <TT><FONT COLOR=blue>'</FONT></TT> <TT><I><FONT COLOR=maroon>c</FONT></I></TT> <TT><FONT COLOR=blue>'</FONT></TT>; ranges of characters <TT><FONT COLOR=blue>'</FONT></TT> <TT><I><FONT COLOR=maroon>c</FONT></I></TT><SUB><FONT SIZE=2>1</FONT></SUB> <TT><FONT COLOR=blue>'</FONT></TT> <TT><FONT COLOR=blue>-</FONT></TT> <TT><FONT COLOR=blue>'</FONT></TT> <TT><I><FONT COLOR=maroon>c</FONT></I></TT><SUB><FONT SIZE=2>2</FONT></SUB> <TT><FONT COLOR=blue>'</FONT></TT> (all characters between <I>c</I><SUB><FONT SIZE=2>1</FONT></SUB> and <I>c</I><SUB><FONT SIZE=2>2</FONT></SUB>, inclusive); and the union of two or more character sets, denoted by concatenation.<BR> <BR> <DT><B><TT><FONT COLOR=blue>[</FONT></TT></B><B> <TT><FONT COLOR=blue>^</FONT></TT></B><B> <TT><I><FONT COLOR=maroon>character-set</FONT></I></TT></B><B> <TT><FONT COLOR=blue>]</FONT></TT></B><DD> Match any single character not belonging to the given character set.<BR> <BR> <DT><B><TT><I><FONT COLOR=maroon>regexp</FONT></I></TT></B><B> <TT><FONT COLOR=blue>*</FONT></TT></B><DD> (Repetition.) Match the concatenation of zero or more strings that match <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT>. <BR> <BR> <DT><B><TT><I><FONT COLOR=maroon>regexp</FONT></I></TT></B><B> <TT><FONT COLOR=blue>+</FONT></TT></B><DD> (Strict repetition.) Match the concatenation of one or more strings that match <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT>.<BR> <BR> <DT><B><TT><I><FONT COLOR=maroon>regexp</FONT></I></TT></B><B> <TT><FONT COLOR=blue>?</FONT></TT></B><DD> (Option.) Match either the empty string, or a string matching <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT>.<BR> <BR> <DT><B><TT><I><FONT COLOR=maroon>regexp</FONT></I></TT></B><SUB><B><FONT SIZE=2>1</FONT></B></SUB><B> <TT><FONT COLOR=blue>|</FONT></TT></B><B> <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT></B><SUB><B><FONT SIZE=2>2</FONT></B></SUB><DD> (Alternative.) Match any string that matches either <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT><SUB><FONT SIZE=2>1</FONT></SUB> or <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT><SUB><FONT SIZE=2>2</FONT></SUB><BR> <BR> <DT><B><TT><I><FONT COLOR=maroon>regexp</FONT></I></TT></B><SUB><B><FONT SIZE=2>1</FONT></B></SUB><B> <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT></B><SUB><B><FONT SIZE=2>2</FONT></B></SUB><DD> (Concatenation.) Match the concatenation of two strings, the first matching <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT><SUB><FONT SIZE=2>1</FONT></SUB>, the second matching <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT><SUB><FONT SIZE=2>2</FONT></SUB>.<BR> <BR> <DT><B><TT><FONT COLOR=blue>(</FONT></TT></B><B> <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT></B><B> <TT><FONT COLOR=blue>)</FONT></TT></B><DD> Match the same strings as <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT>.<BR> <BR> <DT><B><TT><I><FONT COLOR=maroon>ident</FONT></I></TT></B><DD> Reference the regular expression bound to <TT><I><FONT COLOR=maroon>ident</FONT></I></TT> by an earlier <TT><FONT COLOR=blue>let</FONT></TT> <TT><I><FONT COLOR=maroon>ident</FONT></I></TT> <TT><FONT COLOR=blue>=</FONT></TT> <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT> definition.</DL> Concerning the precedences of operators, <TT>*</TT> and <TT>+</TT> have highest precedence, followed by <TT>?</TT>, then concatenation, then <TT>|</TT> (alternation).<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#7fff7f"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc127"><B><FONT SIZE=4>12.2.5</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=4>Actions</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> The actions are arbitrary Caml expressions. They are evaluated in a context where the identifier <TT>lexbuf</TT> is bound to the current lexer buffer. Some typical uses for <TT>lexbuf</TT>, in conjunction with the operations on lexer buffers provided by the <TT>Lexing</TT> standard library module, are listed below. <DL COMPACT=compact><DT> <B><TT>Lexing.lexeme lexbuf</TT></B><DD> Return the matched string.<BR> <BR> <DT><B><TT>Lexing.lexeme_char lexbuf </TT><I>n</I></B><DD> Return the <I>n</I><SUP><FONT SIZE=2>th</FONT></SUP> character in the matched string. The first character corresponds to <I>n</I> = 0.<BR> <BR> <DT><B><TT>Lexing.lexeme_start lexbuf</TT></B><DD> Return the absolute position in the input text of the beginning of the matched string. The first character read from the input text has position 0.<BR> <BR> <DT><B><TT>Lexing.lexeme_end lexbuf</TT></B><DD> Return the absolute position in the input text of the end of the matched string. The first character read from the input text has position 0.<BR> <BR> <DT><B><I>entrypoint</I></B><B> <TT>lexbuf</TT></B><DD> (Where <I>entrypoint</I> is the name of another entry point in the same lexer definition.) Recursively call the lexer on the given entry point. Useful for lexing nested comments, for example.</DL> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#7fff7f"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc128"><B><FONT SIZE=4>12.2.6</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=4>Reserved identifiers</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> All identifiers starting with <TT>__ocaml_lex</TT> are reserved for use by <TT>ocamllex</TT>; do not use any such identifier in your programs.<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#66ff66"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc129"><B><FONT SIZE=5>12.3</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=5>Overview of <TT>ocamlyacc</TT></FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> The <TT>ocamlyacc</TT> command produces a parser from a context-free grammar specification with attached semantic actions, in the style of <TT>yacc</TT>. Assuming the input file is <I>grammar</I><TT>.mly</TT>, executing <PRE> ocamlyacc <I>options</I> <I>grammar</I>.mly </PRE> produces Caml code for a parser in the file <I>grammar</I><TT>.ml</TT>, and its interface in file <I>grammar</I><TT>.mli</TT>.<BR> <BR> The generated module defines one parsing function per entry point in the grammar. These functions have the same names as the entry points. Parsing functions take as arguments a lexical analyzer (a function from lexer buffers to tokens) and a lexer buffer, and return the semantic attribute of the corresponding entry point. Lexical analyzer functions are usually generated from a lexer specification by the <TT>ocamllex</TT> program. Lexer buffers are an abstract data type implemented in the standard library module <TT>Lexing</TT>. Tokens are values from the concrete type <TT>token</TT>, defined in the interface file <I>grammar</I><TT>.mli</TT> produced by <TT>ocamlyacc</TT>.<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#66ff66"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc130"><B><FONT SIZE=5>12.4</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=5>Syntax of grammar definitions</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> Grammar definitions have the following format: <PRE> %{ <I>header</I> %} <I>declarations</I> %% <I>rules</I> %% <I>trailer</I> </PRE> Comments are enclosed between <CODE>/*</CODE> and <CODE>*/</CODE> (as in C) in the ``declarations'' and ``rules'' sections, and between <CODE>(*</CODE> and <CODE>*)</CODE> (as in Caml) in the ``header'' and ``trailer'' sections.<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#7fff7f"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc131"><B><FONT SIZE=4>12.4.1</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=4>Header and trailer</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> The header and the trailer sections are Caml code that is copied as is into file <I>grammar</I><TT>.ml</TT>. Both sections are optional. The header goes at the beginning of the output file; it usually contains <TT>open</TT> directives and auxiliary functions required by the semantic actions of the rules. The trailer goes at the end of the output file.<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#7fff7f"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc132"><B><FONT SIZE=4>12.4.2</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=4>Declarations</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> Declarations are given one per line. They all start with a <CODE>%</CODE> sign. <DL COMPACT=compact><DT><B><TT><FONT COLOR=blue>%token</FONT></TT></B><B> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><B> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><DD> Declare the given symbols as tokens (terminal symbols). These symbols are added as constant constructors for the <TT>token</TT> concrete type.<BR> <BR> <DT><B><TT><FONT COLOR=blue>%token</FONT></TT></B><B> <TT><FONT COLOR=blue><</FONT></TT></B><B> <TT><I><FONT COLOR=maroon>type</FONT></I></TT></B><B> <TT><FONT COLOR=blue>></FONT></TT></B><B> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><B> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><DD> Declare the given symbols as tokens with an attached attribute of the given type. These symbols are added as constructors with arguments of the given type for the <TT>token</TT> concrete type. The <TT><I><FONT COLOR=maroon>type</FONT></I></TT> part is an arbitrary Caml type expression, except that all type constructor names must be fully qualified (e.g. <TT>Modname.typename</TT>) for all types except standard built-in types, even if the proper <CODE>open</CODE> directives (e.g. <CODE>open Modname</CODE>) were given in the header section. That's because the header is copied only to the <TT>.ml</TT> output file, but not to the <TT>.mli</TT> output file, while the <TT><I><FONT COLOR=maroon>type</FONT></I></TT> part of a <CODE>%token</CODE> declaration is copied to both.<BR> <BR> <DT><B><TT><FONT COLOR=blue>%start</FONT></TT></B><B> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><B> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><DD> Declare the given symbols as entry points for the grammar. For each entry point, a parsing function with the same name is defined in the output module. Non-terminals that are not declared as entry points have no such parsing function. Start symbols must be given a type with the <CODE>%type</CODE> directive below.<BR> <BR> <DT><B><TT><FONT COLOR=blue>%type</FONT></TT></B><B> <TT><FONT COLOR=blue><</FONT></TT></B><B> <TT><I><FONT COLOR=maroon>type</FONT></I></TT></B><B> <TT><FONT COLOR=blue>></FONT></TT></B><B> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><B> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><DD> Specify the type of the semantic attributes for the given symbols. This is mandatory for start symbols only. Other nonterminal symbols need not be given types by hand: these types will be inferred when running the output files through the Objective Caml compiler (unless the <CODE>-s</CODE> option is in effect). The <TT><I><FONT COLOR=maroon>type</FONT></I></TT> part is an arbitrary Caml type expression, except that all type constructor names must be fully qualified, as explained above for <TT>%token</TT>.<BR> <BR> <DT><B><TT><FONT COLOR=blue>%left</FONT></TT></B><B> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><B> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><DD> <DT><B><TT><FONT COLOR=blue>%right</FONT></TT></B><B> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><B> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><DD> <DT><B><TT><FONT COLOR=blue>%nonassoc</FONT></TT></B><B> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><B> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT></B><DD><BR> <BR> Associate precedences and associativities to the given symbols. All symbols on the same line are given the same precedence. They have higher precedence than symbols declared before in a <CODE>%left</CODE>, <CODE>%right</CODE> or <CODE>%nonassoc</CODE> line. They have lower precedence than symbols declared after in a <CODE>%left</CODE>, <CODE>%right</CODE> or <CODE>%nonassoc</CODE> line. The symbols are declared to associate to the left (<CODE>%left</CODE>), to the right (<CODE>%right</CODE>), or to be non-associative (<CODE>%nonassoc</CODE>). The symbols are usually tokens. They can also be dummy nonterminals, for use with the <CODE>%prec</CODE> directive inside the rules.</DL> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#7fff7f"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc133"><B><FONT SIZE=4>12.4.3</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=4>Rules</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> The syntax for rules is as usual: <PRE> <I>nonterminal</I> : <I>symbol</I> ... <I>symbol</I> { <I>semantic-action</I> } | ... | <I>symbol</I> ... <I>symbol</I> { <I>semantic-action</I> } ; </PRE> Rules can also contain the <CODE>%prec </CODE><I>symbol</I> directive in the right-hand side part, to override the default precedence and associativity of the rule with the precedence and associativity of the given symbol.<BR> <BR> Semantic actions are arbitrary Caml expressions, that are evaluated to produce the semantic attribute attached to the defined nonterminal. The semantic actions can access the semantic attributes of the symbols in the right-hand side of the rule with the <CODE>$</CODE> notation: <CODE>$1</CODE> is the attribute for the first (leftmost) symbol, <CODE>$2</CODE> is the attribute for the second symbol, etc.<BR> <BR> The rules may contain the special symbol <TT>error</TT> to indicate resynchronization points, as in <TT>yacc</TT>.<BR> <BR> Actions occurring in the middle of rules are not supported.<BR> <BR> Nonterminal symbols are like regular Caml symbols, except that they cannot end with <TT>'</TT> (single quote).<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#7fff7f"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc134"><B><FONT SIZE=4>12.4.4</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=4>Error handling</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> Error recovery is supported as follows: when the parser reaches an error state (no grammar rules can apply), it calls a function named <TT>parse_error</TT> with the string <TT>"syntax error"</TT> as argument. The default <TT>parse_error</TT> function does nothing and returns, thus initiating error recovery (see below). The user can define a customized <TT>parse_error</TT> function in the header section of the grammar file.<BR> <BR> The parser also enters error recovery mode if one of the grammar actions raises the <TT>Parsing.Parse_error</TT> exception.<BR> <BR> In error recovery mode, the parser discards states from the stack until it reaches a place where the error token can be shifted. It then discards tokens from the input until it finds three successive tokens that can be accepted, and starts processing with the first of these. If no state can be uncovered where the error token can be shifted, then the parser aborts by raising the <TT>Parsing.Parse_error</TT> exception.<BR> <BR> Refer to documentation on <TT>yacc</TT> for more details and guidance in how to use error recovery.<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#66ff66"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc135"><B><FONT SIZE=5>12.5</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=5>Options</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> The <TT>ocamlyacc</TT> command recognizes the following options: <DL COMPACT=compact><DT> <B><TT>-v</TT></B><DD> Generate a description of the parsing tables and a report on conflicts resulting from ambiguities in the grammar. The description is put in file <I>grammar</I><TT>.output</TT>.<BR> <BR> <DT><B><TT>-b</TT><I>prefix</I></B><DD> Name the output files <I>prefix</I><TT>.ml</TT>, <I>prefix</I><TT>.mli</TT>, <I>prefix</I><TT>.output</TT>, instead of the default naming convention.</DL> At run-time, the <TT>ocamlyacc</TT>-generated parser can be debugged by setting the <TT>p</TT> option in the <TT>OCAMLRUNPARAM</TT> environment variable (see section <A HREF="manual024.html#ocamlrun-options">10.2</A>). This causes the pushdown automaton executing the parser to print a trace of its action (tokens shifted, rules reduced, etc). The trace mentions rule numbers and state numbers that can be interpreted by looking at the file <I>grammar</I><TT>.output</TT> generated by <TT>ocamlyacc -v</TT>.<BR> <BR> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#66ff66"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc136"><B><FONT SIZE=5>12.6</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=5>A complete example</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE><BR> The all-time favorite: a desk calculator. This program reads arithmetic expressions on standard input, one per line, and prints their values. Here is the grammar definition: <PRE> /* File parser.mly */ %token <int> INT %token PLUS MINUS TIMES DIV %token LPAREN RPAREN %token EOL %left PLUS MINUS /* lowest precedence */ %left TIMES DIV /* medium precedence */ %nonassoc UMINUS /* highest precedence */ %start main /* the entry point */ %type <int> main %% main: expr EOL { $1 } ; expr: INT { $1 } | LPAREN expr RPAREN { $2 } | expr PLUS expr { $1 + $3 } | expr MINUS expr { $1 - $3 } | expr TIMES expr { $1 * $3 } | expr DIV expr { $1 / $3 } | MINUS expr %prec UMINUS { - $2 } ; </PRE>Here is the definition for the corresponding lexer: <PRE> (* File lexer.mll *) { open Parser (* The type token is defined in parser.mli *) exception Eof } rule token = parse [' ' '\t'] { token lexbuf } (* skip blanks *) | ['\n' ] { EOL } | ['0'-'9']+ { INT(int_of_string(Lexing.lexeme lexbuf)) } | '+' { PLUS } | '-' { MINUS } | '*' { TIMES } | '/' { DIV } | '(' { LPAREN } | ')' { RPAREN } | eof { raise Eof } </PRE>Here is the main program, that combines the parser with the lexer: <PRE> (* File calc.ml *) let _ = try let lexbuf = Lexing.from_channel stdin in while true do let result = Parser.main Lexer.token lexbuf in print_int result; print_newline(); flush stdout done with Lexer.Eof -> exit 0 </PRE>To compile everything, execute: <PRE> ocamllex lexer.mll # generates lexer.ml ocamlyacc parser.mly # generates parser.ml and parser.mli ocamlc -c parser.mli ocamlc -c lexer.ml ocamlc -c parser.ml ocamlc -c calc.ml ocamlc -o calc lexer.cmo parser.cmo calc.cmo </PRE> <TABLE CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR><TD BGCOLOR="#66ff66"><DIV ALIGN=center><TABLE> <TR><TD><A NAME="htoc137"><B><FONT SIZE=5>12.7</FONT></B></A></TD> <TD WIDTH="100%" ALIGN=center><B><FONT SIZE=5>Common errors</FONT></B></TD> </TR></TABLE></DIV></TD> </TR></TABLE> <DL COMPACT=compact><DT><B>ocamllex: transition table overflow, automaton is too big</B><DD><BR> <BR> The deterministic automata generated by <TT>ocamllex</TT> are limited to at most 32767 transitions. The message above indicates that your lexer definition is too complex and overflows this limit. This is commonly caused by lexer definitions that have separate rules for each of the alphabetic keywords of the language, as in the following example. <PRE> rule token = parse "keyword1" { KWD1 } | "keyword2" { KWD2 } | ... | "keyword100" { KWD100 } | ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] * { IDENT(Lexing.lexeme lexbuf) } </PRE>To keep the generated automata small, rewrite those definitions with only one general ``identifier'' rule, followed by a hashtable lookup to separate keywords from identifiers: <PRE> { let keyword_table = Hashtbl.create 53 let _ = List.iter (fun (kwd, tok) -> Hashtbl.add keyword_table kwd tok) [ "keyword1", KWD1; "keyword2", KWD2; ... "keyword100", KWD100 ] } rule token = parse ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] * { let id = Lexing.lexeme lexbuf in try Hashtbl.find keyword_table s with Not_found -> IDENT s } </PRE></DL> <HR> <A HREF="manual025.html"><IMG SRC ="previous_motif.gif" ALT="Previous"></A> <A HREF="index.html"><IMG SRC ="contents_motif.gif" ALT="Contents"></A> <A HREF="manual027.html"><IMG SRC ="next_motif.gif" ALT="Next"></A> </BODY> </HTML>