Sophie

Sophie

distrib > Mageia > 7 > armv7hl > media > core-release > by-pkgid > fb18813323b88f9a6e869238ab603257 > files > 24

ocaml-doc-4.07.1-2.mga7.noarch.rpm

<!DOCTYPE html>
<html>
<head>

<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
<meta name="generator" content="hevea 2.32">

  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">
<link rel="stylesheet" type="text/css" href="manual.css">
<title>Chapter&#XA0;13&#XA0;&#XA0;Lexer and parser generators (ocamllex, ocamlyacc)</title>
</head>
<body>
<a href="native.html"><img src="previous_motif.svg" alt="Previous"></a>
<a href="index.html"><img src="contents_motif.svg" alt="Up"></a>
<a href="depend.html"><img src="next_motif.svg" alt="Next"></a>
<hr>
<h1 class="chapter" id="sec313">Chapter&#XA0;13&#XA0;&#XA0;Lexer and parser generators (ocamllex, ocamlyacc)</h1>
<ul>
<li><a href="lexyacc.html#sec314">13.1&#XA0;&#XA0;Overview of <span class="c003">ocamllex</span></a>
</li><li><a href="lexyacc.html#sec316">13.2&#XA0;&#XA0;Syntax of lexer definitions</a>
</li><li><a href="lexyacc.html#sec325">13.3&#XA0;&#XA0;Overview of <span class="c003">ocamlyacc</span></a>
</li><li><a href="lexyacc.html#sec326">13.4&#XA0;&#XA0;Syntax of grammar definitions</a>
</li><li><a href="lexyacc.html#sec331">13.5&#XA0;&#XA0;Options</a>
</li><li><a href="lexyacc.html#sec332">13.6&#XA0;&#XA0;A complete example</a>
</li><li><a href="lexyacc.html#sec333">13.7&#XA0;&#XA0;Common errors</a>
</li></ul>
<p>
<a id="c:ocamlyacc"></a>

</p><p>This chapter describes two program generators: <span class="c003">ocamllex</span>, that
produces a lexical analyzer from a set of regular expressions with
associated semantic actions, and <span class="c003">ocamlyacc</span>, that produces a parser
from a grammar with associated semantic actions.</p><p>These program generators are very close to the well-known <span class="c003">lex</span> and
<span class="c003">yacc</span> commands that can be found in most C programming environments.
This chapter assumes a working knowledge of <span class="c003">lex</span> and <span class="c003">yacc</span>: while
it describes the input syntax for <span class="c003">ocamllex</span> and <span class="c003">ocamlyacc</span> and the
main differences with <span class="c003">lex</span> and <span class="c003">yacc</span>, it does not explain the basics
of writing a lexer or parser description in <span class="c003">lex</span> and <span class="c003">yacc</span>. Readers
unfamiliar with <span class="c003">lex</span> and <span class="c003">yacc</span> are referred to &#X201C;Compilers:
principles, techniques, and tools&#X201D; by Aho, Sethi and Ullman
(Addison-Wesley, 1986), or &#X201C;Lex &amp; Yacc&#X201D;, by Levine, Mason and
Brown (O&#X2019;Reilly, 1992).</p>
<h2 class="section" id="sec314">13.1&#XA0;&#XA0;Overview of <span class="c003">ocamllex</span></h2>
<p>The <span class="c003">ocamllex</span> command produces a lexical analyzer from a set of regular
expressions with attached semantic actions, in the style of
<span class="c003">lex</span>. Assuming the input file is <span class="c009">lexer</span><span class="c003">.mll</span>, executing
</p><pre>
        ocamllex <span class="c009">lexer</span>.mll
</pre><p>
produces OCaml code for a lexical analyzer in file <span class="c009">lexer</span><span class="c003">.ml</span>.
This file defines one lexing function per entry point in the lexer
definition. These functions have the same names as the entry
points. Lexing functions take as argument a lexer buffer, and return
the semantic attribute of the corresponding entry point.</p><p>Lexer buffers are an abstract data type implemented in the standard
library module <span class="c003">Lexing</span>. The functions <span class="c003">Lexing.from_channel</span>,
<span class="c003">Lexing.from_string</span> and <span class="c003">Lexing.from_function</span> create
lexer buffers that read from an input channel, a character string, or
any reading function, respectively. (See the description of module
<span class="c003">Lexing</span> in chapter&#XA0;<a href="stdlib.html#c%3Astdlib">26</a>.)</p><p>When used in conjunction with a parser generated by <span class="c003">ocamlyacc</span>, the
semantic actions compute a value belonging to the type <span class="c003">token</span> defined
by the generated parsing module. (See the description of <span class="c003">ocamlyacc</span>
below.)</p>
<h3 class="subsection" id="sec315">13.1.1&#XA0;&#XA0;Options</h3>
<p>
The following command-line options are recognized by <span class="c003">ocamllex</span>.</p><dl class="description"><dt class="dt-description"><span class="c006">-ml</span></dt><dd class="dd-description">
Output code that does not use OCaml&#X2019;s built-in automata
interpreter. Instead, the automaton is encoded by OCaml functions.
This option mainly is useful for debugging <span class="c003">ocamllex</span>, using it for
production lexers is not recommended.</dd><dt class="dt-description"><span class="c013"><span class="c003">-o</span> <span class="c009">output-file</span></span></dt><dd class="dd-description">
Specify the name of the output file produced by <span class="c003">ocamllex</span>.
The default is the input file name with its extension replaced by <span class="c003">.ml</span>.</dd><dt class="dt-description"><span class="c006">-q</span></dt><dd class="dd-description">
Quiet mode. <span class="c003">ocamllex</span> normally outputs informational messages
to standard output. They are suppressed if option <span class="c003">-q</span> is used.</dd><dt class="dt-description"><span class="c013"><span class="c003">-v</span> or <span class="c003">-version</span></span></dt><dd class="dd-description">
Print version string and exit.</dd><dt class="dt-description"><span class="c006">-vnum</span></dt><dd class="dd-description">
Print short version number and exit.</dd><dt class="dt-description"><span class="c013"><span class="c003">-help</span> or <span class="c003">--help</span></span></dt><dd class="dd-description">
Display a short usage summary and exit.
</dd></dl>
<h2 class="section" id="sec316">13.2&#XA0;&#XA0;Syntax of lexer definitions</h2>
<p>The format of lexer definitions is as follows:
</p><pre>
{ <span class="c009">header</span> }
let <span class="c009">ident</span> = <span class="c009">regexp</span> &#X2026;
[refill { <span class="c009">refill-handler</span> }]
rule <span class="c009">entrypoint</span> [<span class="c009">arg</span><sub>1</sub>&#X2026; <span class="c009">arg</span><sub><span class="c009">n</span></sub>] =
  parse <span class="c009">regexp</span> { <span class="c009">action</span> }
      | &#X2026;
      | <span class="c009">regexp</span> { <span class="c009">action</span> }
and <span class="c009">entrypoint</span> [<span class="c009">arg</span><sub>1</sub>&#X2026; <span class="c009">arg</span><sub><span class="c009">n</span></sub>] =
  parse &#X2026;
and &#X2026;
{ <span class="c009">trailer</span> }
</pre><p>
Comments are delimited by <span class="c003">(*</span> and <span class="c003">*)</span>, as in OCaml.
The <span class="c003">parse</span> keyword, can be replaced by the <span class="c003">shortest</span> keyword, with
the semantic consequences explained below.</p><p>Refill handlers are a recent (optional) feature introduced in 4.02,
documented below in subsection&#XA0;<a href="#ss%3Arefill-handlers">13.2.7</a>.</p>
<h3 class="subsection" id="sec317">13.2.1&#XA0;&#XA0;Header and trailer</h3>
<p>
The <span class="c009">header</span> and <span class="c009">trailer</span> sections are arbitrary OCaml
text enclosed in curly braces. Either or both can be omitted. If
present, the header text is copied as is at the beginning of the
output file and the trailer text at the end. Typically, the
header section contains the <span class="c003">open</span> directives required
by the actions, and possibly some auxiliary functions used in the
actions.</p>
<h3 class="subsection" id="sec318">13.2.2&#XA0;&#XA0;Naming regular expressions</h3>
<p>Between the header and the entry points, one can give names to
frequently-occurring regular expressions. This is written
<span class="c004">let</span> <a class="syntax" href="lex.html#ident"><span class="c010">ident</span></a> <span class="c004">=</span> &#XA0;<a class="syntax" href="#regexp"><span class="c010">regexp</span></a>.
In regular expressions that follow this declaration, the identifier
<span class="c009">ident</span> can be used as shorthand for <span class="c009">regexp</span>.</p>
<h3 class="subsection" id="sec319">13.2.3&#XA0;&#XA0;Entry points</h3>
<p>The names of the entry points must be valid identifiers for OCaml
values (starting with a lowercase letter).
Similarily, the arguments <span class="c005">arg</span><sub>1</sub><span class="c003">&#X2026;
<span class="c009">arg</span></span><sub><span class="c009">n</span></sub> must be valid identifiers for OCaml.
Each entry point becomes an
OCaml function that takes <span class="c009">n</span>+1 arguments,
the extra implicit last argument being of type <span class="c003">Lexing.lexbuf</span>.
Characters are read from the <span class="c003">Lexing.lexbuf</span> argument and matched
against the regular expressions provided in the rule, until a prefix
of the input matches one of the rule. The corresponding action is
then evaluated and returned as the result of the function.</p><p>If several regular expressions match a prefix of the input, the
&#X201C;longest match&#X201D; rule applies: the regular expression that matches
the longest prefix of the input is selected. In case of tie, the
regular expression that occurs earlier in the rule is selected.</p><p>However, if lexer rules are introduced with the <span class="c003">shortest</span> keyword in
place of the <span class="c003">parse</span> keyword, then the &#X201C;shortest match&#X201D; rule applies:
the shortest prefix of the input is selected. In case of tie, the
regular expression that occurs earlier in the rule is still selected.
This feature is not intended for use in ordinary lexical analyzers, it
may facilitate the use of <span class="c003">ocamllex</span> as a simple text processing tool.</p>
<h3 class="subsection" id="sec320">13.2.4&#XA0;&#XA0;Regular expressions</h3>
<p>The regular expressions are in the style of <span class="c003">lex</span>, with a more
OCaml-like syntax.
</p><div class="syntax"><table class="display dcenter"><tr class="c019"><td class="dcell"><table class="c001 cellpading0"><tr><td class="c018">
<a class="syntax" id="regexp"><span class="c010">regexp</span></a></td><td class="c015">::=</td><td class="c017">
&#X2026;
</td></tr>
</table></td></tr>
</table></div><dl class="description"><dt class="dt-description"><span class="c004">'</span> <span class="c010">regular-char</span> &#X2223;  <a class="syntax" href="lex.html#escape-sequence"><span class="c010">escape-sequence</span></a> <span class="c004">'</span></dt><dd class="dd-description">
A character constant, with the same syntax as OCaml character
constants. Match the denoted character.</dd><dt class="dt-description"><span class="c006">_</span></dt><dd class="dd-description">
(underscore) Match any character.</dd><dt class="dt-description"><span class="c004">eof</span></dt><dd class="dd-description">
Match the end of the lexer input.<br>
<span class="c013">Note:</span> On some systems, with interactive input, an end-of-file
may be followed by more characters. However, <span class="c003">ocamllex</span> will not
correctly handle regular expressions that contain <span class="c003">eof</span> followed by
something else.</dd><dt class="dt-description"><span class="c004">"</span> { <a class="syntax" href="lex.html#string-character"><span class="c010">string-character</span></a> } <span class="c004">"</span></dt><dd class="dd-description">
A string constant, with the same syntax as OCaml string
constants. Match the corresponding sequence of characters.</dd><dt class="dt-description"><span class="c002"><span class="c003">[</span> <span class="c010">character-set</span> <span class="c003">]</span></span></dt><dd class="dd-description">
Match any single character belonging to the given
character set. Valid character sets are: single
character constants <span class="c002"><span class="c003">'</span> <span class="c010">c</span> <span class="c003">'</span></span>; ranges of characters
<span class="c004">'</span> <span class="c010">c</span><sub>1</sub> <span class="c002"><span class="c003">'</span> <span class="c003">-</span> <span class="c003">'</span></span> <span class="c010">c</span><sub>2</sub> <span class="c004">'</span> (all characters between <span class="c009">c</span><sub>1</sub> and <span class="c009">c</span><sub>2</sub>,
inclusive); and the union of two or more character sets, denoted by
concatenation.</dd><dt class="dt-description"><span class="c002"><span class="c003">[</span> <span class="c003">^</span> <span class="c010">character-set</span> <span class="c003">]</span></span></dt><dd class="dd-description">
Match any single character not belonging to the given character set.</dd><dt class="dt-description"><a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>1</sub> <span class="c004">#</span> &#XA0;<a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>2</sub></dt><dd class="dd-description">
(difference of character sets)
Regular expressions <a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>1</sub> and <a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>2</sub> must be character sets
defined with <span class="c004">[</span>&#X2026; <span class="c004">]</span> (or a single character expression or
underscore <span class="c003">_</span>).
Match the difference of the two specified character sets.</dd><dt class="dt-description"><a class="syntax" href="#regexp"><span class="c010">regexp</span></a> <span class="c004">*</span></dt><dd class="dd-description">
(repetition) Match the concatenation of zero or more
strings that match <a class="syntax" href="#regexp"><span class="c010">regexp</span></a>.</dd><dt class="dt-description"><a class="syntax" href="#regexp"><span class="c010">regexp</span></a> <span class="c004">+</span></dt><dd class="dd-description">
(strict repetition) Match the concatenation of one or more
strings that match <a class="syntax" href="#regexp"><span class="c010">regexp</span></a>.</dd><dt class="dt-description"><a class="syntax" href="#regexp"><span class="c010">regexp</span></a> <span class="c004">?</span></dt><dd class="dd-description">
(option) Match the empty string, or a string matching <a class="syntax" href="#regexp"><span class="c010">regexp</span></a>.</dd><dt class="dt-description"><a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>1</sub> <span class="c004">|</span> &#XA0;<a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>2</sub></dt><dd class="dd-description">
(alternative) Match any string that matches <a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>1</sub> or <a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>2</sub></dd><dt class="dt-description"><a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>1</sub> &#XA0;<a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>2</sub></dt><dd class="dd-description">
(concatenation) Match the concatenation of two strings, the first
matching <a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>1</sub>, the second matching <a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>2</sub>.</dd><dt class="dt-description"><span class="c004">(</span> <a class="syntax" href="#regexp"><span class="c010">regexp</span></a> <span class="c004">)</span></dt><dd class="dd-description">
Match the same strings as <a class="syntax" href="#regexp"><span class="c010">regexp</span></a>.</dd><dt class="dt-description"><a class="syntax" href="lex.html#ident"><span class="c010">ident</span></a></dt><dd class="dd-description">
Reference the regular expression bound to <a class="syntax" href="lex.html#ident"><span class="c010">ident</span></a> by an earlier
<span class="c004">let</span> <a class="syntax" href="lex.html#ident"><span class="c010">ident</span></a> <span class="c004">=</span> &#XA0;<a class="syntax" href="#regexp"><span class="c010">regexp</span></a> definition.</dd><dt class="dt-description"><a class="syntax" href="#regexp"><span class="c010">regexp</span></a> <span class="c004">as</span> &#XA0;<a class="syntax" href="lex.html#ident"><span class="c010">ident</span></a></dt><dd class="dd-description">
Bind the substring matched by <a class="syntax" href="#regexp"><span class="c010">regexp</span></a> to identifier <a class="syntax" href="lex.html#ident"><span class="c010">ident</span></a>.
</dd></dl><p>Concerning the precedences of operators, <span class="c003">#</span> has the highest precedence,
followed by <span class="c003">*</span>, <span class="c003">+</span> and <span class="c003">?</span>,
then concatenation, then <span class="c003">|</span> (alternation), then <span class="c003">as</span>.</p>
<h3 class="subsection" id="sec321">13.2.5&#XA0;&#XA0;Actions</h3>
<p>The actions are arbitrary OCaml expressions. They are evaluated in
a context where the identifiers defined by using the <span class="c003">as</span> construct
are bound to subparts of the matched string.
Additionally, <span class="c003">lexbuf</span> is bound to the current lexer
buffer. Some typical uses for <span class="c003">lexbuf</span>, in conjunction with the
operations on lexer buffers provided by the <span class="c003">Lexing</span> standard library
module, are listed below.</p><dl class="description"><dt class="dt-description">
<span class="c006">Lexing.lexeme lexbuf</span></dt><dd class="dd-description">
Return the matched string.</dd><dt class="dt-description"><span class="c006">Lexing.lexeme_char lexbuf </span><span class="c009">n</span></dt><dd class="dd-description">
Return the <span class="c009">n</span><sup><span class="c008">th</span></sup>
character in the matched string. The first character corresponds to <span class="c009">n</span> = 0.</dd><dt class="dt-description"><span class="c006">Lexing.lexeme_start lexbuf</span></dt><dd class="dd-description">
Return the absolute position in the input text of the beginning of the
matched string (i.e. the offset of the first character of the matched
string). The first character read from the input text has offset 0.</dd><dt class="dt-description"><span class="c006">Lexing.lexeme_end lexbuf</span></dt><dd class="dd-description">
Return the absolute position in the input text of the end of the
matched string (i.e. the offset of the first character after the
matched string). The first character read from the input text has
offset 0.</dd><dt class="dt-description"><span class="c013"><span class="c009">entrypoint</span> [<span class="c009">exp</span></span><sub>1</sub><span class="c013">&#X2026; <span class="c009">exp</span></span><sub><span class="c009">n</span></sub><span class="c013">] <span class="c003">lexbuf</span></span></dt><dd class="dd-description">
(Where <span class="c009">entrypoint</span> is the name of another entry point in the same
lexer definition.) Recursively call the lexer on the given entry point.
Notice that <span class="c003">lexbuf</span> is the last argument.
Useful for lexing nested comments, for example.</dd></dl>
<h3 class="subsection" id="sec322">13.2.6&#XA0;&#XA0;Variables in regular expressions</h3>
<p>
The <span class="c003">as</span> construct is similar to &#X201C;<em>groups</em>&#X201D; as provided by
numerous regular expression packages.
The type of these variables can be <span class="c003">string</span>, <span class="c003">char</span>, <span class="c003">string option</span>
or <span class="c003">char option</span>.</p><p>We first consider the case of linear patterns, that is the case when
all <span class="c003">as</span> bound variables are distinct.
In <a class="syntax" href="#regexp"><span class="c010">regexp</span></a> <span class="c004">as</span> &#XA0;<a class="syntax" href="lex.html#ident"><span class="c010">ident</span></a>, the type of <a class="syntax" href="lex.html#ident"><span class="c010">ident</span></a> normally is <span class="c003">string</span> (or
<span class="c003">string option</span>) except
when <a class="syntax" href="#regexp"><span class="c010">regexp</span></a> is a character constant, an underscore, a string
constant of length one, a character set specification, or an
alternation of those. Then, the type of <a class="syntax" href="lex.html#ident"><span class="c010">ident</span></a> is <span class="c003">char</span> (or <span class="c003">char option</span>).
Option types are introduced when overall rule matching does not
imply matching of the bound sub-pattern. This is in particular the
case of <span class="c004">(</span> <a class="syntax" href="#regexp"><span class="c010">regexp</span></a> <span class="c004">as</span> &#XA0;<a class="syntax" href="lex.html#ident"><span class="c010">ident</span></a> <span class="c002"><span class="c003">)</span> <span class="c003">?</span></span> and of
<a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>1</sub> <span class="c002"><span class="c003">|</span> <span class="c003">(</span></span> &#XA0;<a class="syntax" href="#regexp"><span class="c010">regexp</span></a><sub>2</sub> <span class="c004">as</span> &#XA0;<a class="syntax" href="lex.html#ident"><span class="c010">ident</span></a> <span class="c004">)</span>.</p><p>There is no linearity restriction over <span class="c003">as</span> bound variables.
When a variable is bound more than once, the previous rules are to be
extended as follows:
</p><ul class="itemize"><li class="li-itemize">
A variable is a <span class="c003">char</span> variable when all its occurrences bind
<span class="c003">char</span> occurrences in the previous sense.
</li><li class="li-itemize">A variable is an <span class="c003">option</span> variable when the overall expression
can be matched without binding this variable.
</li></ul><p>
For instance, in
<span class="c003">('a' as x) | ( 'a' (_ as x) )</span> the variable <span class="c003">x</span> is of type
<span class="c003">char</span>, whereas in
<span class="c003">("ab" as x) | ( 'a' (_ as x) ? )</span> the variable <span class="c003">x</span> is of type
<span class="c003">string option</span>.</p><p>In some cases, a successful match may not yield a unique set of bindings.
For instance the matching of <code>aba</code> by the regular expression
<span class="c003">(('a'|"ab") as x) (("ba"|'a') as y)</span> may result in binding
either
<code>x</code> to <code>"ab"</code> and <code>y</code> to <code>"a"</code>, or
<code>x</code> to <code>"a"</code> and <code>y</code> to <code>"ba"</code>.
The automata produced <span class="c003">ocamllex</span> on such ambiguous regular
expressions will select one of the possible resulting sets of
bindings.
The selected set of bindings is purposely left unspecified.</p>
<h3 class="subsection" id="sec323">13.2.7&#XA0;&#XA0;Refill handlers</h3>
<p>
<a id="ss:refill-handlers"></a></p><p>By default, when ocamllex reaches the end of its lexing buffer, it
will silently call the <span class="c003">refill_buff</span> function of <span class="c003">lexbuf</span> structure
and continue lexing. It is sometimes useful to be able to take control
of refilling action; typically, if you use a library for asynchronous
computation, you may want to wrap the refilling action in a delaying
function to avoid blocking synchronous operations.</p><p>Since OCaml 4.02, it is possible to specify a <span class="c009">refill-handler</span>,
a function that will be called when refill happens. It is passed the
continuation of the lexing, on which it has total control. The OCaml
expression used as refill action should have a type that is an
instance of
</p><pre>   (Lexing.lexbuf -&gt; 'a) -&gt; Lexing.lexbuf -&gt; 'a
</pre><p>where the first argument is the continuation which captures the
processing ocamllex would usually perform (refilling the buffer, then
calling the lexing function again), and the result type that
instantiates [&#X2019;a] should unify with the result type of all lexing
rules.</p><p>As an example, consider the following lexer that is parametrized over
an arbitrary monad:
</p><pre>{
type token = EOL | INT of int | PLUS

module Make (M : sig
               type 'a t
               val return: 'a -&gt; 'a t
               val bind: 'a t -&gt; ('a -&gt; 'b t) -&gt; 'b t
               val fail : string -&gt; 'a t

               (* Set up lexbuf *)
               val on_refill : Lexing.lexbuf -&gt; unit t
             end)
= struct

let refill_handler k lexbuf =
    M.bind (M.on_refill lexbuf) (fun () -&gt; k lexbuf)

}

refill {refill_handler}

rule token = parse
| [' ' '\t']
    { token lexbuf }
| '\n'
    { M.return EOL }
| ['0'-'9']+ as i
    { M.return (INT (int_of_string i)) }
| '+'
    { M.return PLUS }
| _
    { M.fail "unexpected character" }
{
end
}
</pre>
<h3 class="subsection" id="sec324">13.2.8&#XA0;&#XA0;Reserved identifiers</h3>
<p>All identifiers starting with <span class="c003">__ocaml_lex</span> are reserved for use by
<span class="c003">ocamllex</span>; do not use any such identifier in your programs.</p>
<h2 class="section" id="sec325">13.3&#XA0;&#XA0;Overview of <span class="c003">ocamlyacc</span></h2>
<p>The <span class="c003">ocamlyacc</span> command produces a parser from a context-free grammar
specification with attached semantic actions, in the style of <span class="c003">yacc</span>.
Assuming the input file is <span class="c009">grammar</span><span class="c003">.mly</span>, executing
</p><pre>
        ocamlyacc <span class="c009">options grammar</span>.mly
</pre><p>
produces OCaml code for a parser in the file <span class="c009">grammar</span><span class="c003">.ml</span>,
and its interface in file <span class="c009">grammar</span><span class="c003">.mli</span>.</p><p>The generated module defines one parsing function per entry point in
the grammar. These functions have the same names as the entry points.
Parsing functions take as arguments a lexical analyzer (a function
from lexer buffers to tokens) and a lexer buffer, and return the
semantic attribute of the corresponding entry point. Lexical analyzer
functions are usually generated from a lexer specification by the
<span class="c003">ocamllex</span> program. Lexer buffers are an abstract data type
implemented in the standard library module <span class="c003">Lexing</span>. Tokens are values from
the concrete type <span class="c003">token</span>, defined in the interface file
<span class="c009">grammar</span><span class="c003">.mli</span> produced by <span class="c003">ocamlyacc</span>.</p>
<h2 class="section" id="sec326">13.4&#XA0;&#XA0;Syntax of grammar definitions</h2>
<p>Grammar definitions have the following format:
</p><pre>
%{
  <span class="c009">header</span>
%}
  <span class="c009">declarations</span>
%%
  <span class="c009">rules</span>
%%
  <span class="c009">trailer</span>
</pre><p>Comments are enclosed between <code>/*</code> and <code>*/</code> (as in C) in the
&#X201C;declarations&#X201D; and &#X201C;rules&#X201D; sections, and between <code>(*</code> and
<code>*)</code> (as in OCaml) in the &#X201C;header&#X201D; and &#X201C;trailer&#X201D; sections.</p>
<h3 class="subsection" id="sec327">13.4.1&#XA0;&#XA0;Header and trailer</h3>
<p>The header and the trailer sections are OCaml code that is copied
as is into file <span class="c009">grammar</span><span class="c003">.ml</span>. Both sections are optional. The header
goes at the beginning of the output file; it usually contains
<span class="c003">open</span> directives and auxiliary functions required by the semantic
actions of the rules. The trailer goes at the end of the output file.</p>
<h3 class="subsection" id="sec328">13.4.2&#XA0;&#XA0;Declarations</h3>
<p>Declarations are given one per line. They all start with a <code>%</code> sign.</p><dl class="description"><dt class="dt-description"><span class="c004">%token</span> <a class="syntax" href="names.html#constr"><span class="c010">constr</span></a> &#X2026; &#XA0;<a class="syntax" href="names.html#constr"><span class="c010">constr</span></a></dt><dd class="dd-description">
Declare the given symbols <a class="syntax" href="names.html#constr"><span class="c010">constr</span></a> &#X2026; &#XA0;<a class="syntax" href="names.html#constr"><span class="c010">constr</span></a>
as tokens (terminal symbols). These symbols
are added as constant constructors for the <span class="c003">token</span> concrete type.</dd><dt class="dt-description"><span class="c002"><span class="c003">%token</span> <span class="c003">&lt;</span></span> <a class="syntax" href="types.html#typexpr"><span class="c010">typexpr</span></a> <span class="c004">&gt;</span> &#XA0;<a class="syntax" href="names.html#constr"><span class="c010">constr</span></a> &#X2026; &#XA0;<a class="syntax" href="names.html#constr"><span class="c010">constr</span></a></dt><dd class="dd-description">
Declare the given symbols <a class="syntax" href="names.html#constr"><span class="c010">constr</span></a> &#X2026; &#XA0;<a class="syntax" href="names.html#constr"><span class="c010">constr</span></a> as tokens with an
attached attribute of the
given type. These symbols are added as constructors with arguments of
the given type for the <span class="c003">token</span> concrete type. The <a class="syntax" href="types.html#typexpr"><span class="c010">typexpr</span></a> part is
an arbitrary OCaml type expression, except that all type
constructor names must be fully qualified (e.g. <span class="c003">Modname.typename</span>)
for all types except standard built-in types, even if the proper
<code>open</code> directives (e.g. <code>open Modname</code>) were given in the
header section. That&#X2019;s because the header is copied only to the <span class="c003">.ml</span>
output file, but not to the <span class="c003">.mli</span> output file, while the <a class="syntax" href="types.html#typexpr"><span class="c010">typexpr</span></a> part
of a <code>%token</code> declaration is copied to both.</dd><dt class="dt-description"><span class="c004">%start</span> <span class="c010">symbol</span> &#X2026; &#XA0;<span class="c010">symbol</span></dt><dd class="dd-description">
Declare the given symbols as entry points for the grammar. For each
entry point, a parsing function with the same name is defined in the
output module. Non-terminals that are not declared as entry points
have no such parsing function. Start symbols must be given a type with
the <code>%type</code> directive below.</dd><dt class="dt-description"><span class="c002"><span class="c003">%type</span> <span class="c003">&lt;</span></span> <a class="syntax" href="types.html#typexpr"><span class="c010">typexpr</span></a> <span class="c004">&gt;</span> &#XA0;<span class="c010">symbol</span> &#X2026; &#XA0;<span class="c010">symbol</span></dt><dd class="dd-description">
Specify the type of the semantic attributes for the given symbols.
This is mandatory for start symbols only. Other nonterminal symbols
need not be given types by hand: these types will be inferred when
running the output files through the OCaml compiler (unless the
<code>-s</code> option is in effect). The <a class="syntax" href="types.html#typexpr"><span class="c010">typexpr</span></a> part is an arbitrary OCaml
type expression, except that all type constructor names must be
fully qualified, as explained above for <span class="c003">%token</span>.</dd><dt class="dt-description"><span class="c004">%left</span> <span class="c010">symbol</span> &#X2026; &#XA0;<span class="c010">symbol</span></dt><dd class="dd-description">
</dd><dt class="dt-description"><span class="c004">%right</span> <span class="c010">symbol</span> &#X2026; &#XA0;<span class="c010">symbol</span></dt><dd class="dd-description">
</dd><dt class="dt-description"><span class="c004">%nonassoc</span> <span class="c010">symbol</span> &#X2026; &#XA0;<span class="c010">symbol</span></dt><dd class="dd-description"><p>Associate precedences and associativities to the given symbols. All
symbols on the same line are given the same precedence. They have
higher precedence than symbols declared before in a <code>%left</code>,
<code>%right</code> or <code>%nonassoc</code> line. They have lower precedence
than symbols declared after in a <code>%left</code>, <code>%right</code> or
<code>%nonassoc</code> line. The symbols are declared to associate to the
left (<code>%left</code>), to the right (<code>%right</code>), or to be
non-associative (<code>%nonassoc</code>). The symbols are usually tokens.
They can also be dummy nonterminals, for use with the <code>%prec</code>
directive inside the rules.</p><p>The precedence declarations are used in the following way to
resolve reduce/reduce and shift/reduce conflicts:
</p><ul class="itemize"><li class="li-itemize">
Tokens and rules have precedences. By default, the precedence
of a rule is the precedence of its rightmost terminal. You
can override this default by using the <span class="c004">%prec</span> directive in the rule.
</li><li class="li-itemize">A reduce/reduce conflict
is resolved in favor of the first rule (in the order given by the
source file), and <span class="c003">ocamlyacc</span> outputs a warning.
</li><li class="li-itemize">A shift/reduce conflict
is resolved by comparing the precedence of the rule to be
reduced with the precedence of the token to be shifted. If the
precedence of the rule is higher, then the rule will be reduced;
if the precedence of the token is higher, then the token will
be shifted.
</li><li class="li-itemize">A shift/reduce conflict between a rule and a token with the
same precedence will be resolved using the associativity: if the
token is left-associative, then the parser will reduce; if the
token is right-associative, then the parser will shift. If the
token is non-associative, then the parser will declare a syntax
error.
</li><li class="li-itemize">When a shift/reduce conflict cannot be resolved using the above
method, then <span class="c003">ocamlyacc</span> will output a warning and the parser will
always shift.
</li></ul></dd></dl>
<h3 class="subsection" id="sec329">13.4.3&#XA0;&#XA0;Rules</h3>
<p>The syntax for rules is as usual:
</p><pre>
<span class="c009">nonterminal</span> :
    <span class="c009">symbol</span> &#X2026; <span class="c009">symbol</span> { <span class="c009">semantic-action</span> }
  | &#X2026;
  | <span class="c009">symbol</span> &#X2026; <span class="c009">symbol</span> { <span class="c009">semantic-action</span> }
;
</pre><p>
Rules can also contain the <code>%prec </code><span class="c009">symbol</span> directive in the
right-hand side part, to override the default precedence and
associativity of the rule with the precedence and associativity of the
given symbol.</p><p>Semantic actions are arbitrary OCaml expressions, that
are evaluated to produce the semantic attribute attached to
the defined nonterminal. The semantic actions can access the
semantic attributes of the symbols in the right-hand side of
the rule with the <code>$</code> notation: <code>$1</code> is the attribute for the
first (leftmost) symbol, <code>$2</code> is the attribute for the second
symbol, etc.</p><p>The rules may contain the special symbol <span class="c003">error</span> to indicate
resynchronization points, as in <span class="c003">yacc</span>.</p><p>Actions occurring in the middle of rules are not supported.</p><p>Nonterminal symbols are like regular OCaml symbols, except that they
cannot end with <span class="c003">'</span> (single quote).</p>
<h3 class="subsection" id="sec330">13.4.4&#XA0;&#XA0;Error handling</h3>
<p>Error recovery is supported as follows: when the parser reaches an
error state (no grammar rules can apply), it calls a function named
<span class="c003">parse_error</span> with the string <span class="c003">"syntax error"</span> as argument. The default
<span class="c003">parse_error</span> function does nothing and returns, thus initiating error
recovery (see below). The user can define a customized <span class="c003">parse_error</span>
function in the header section of the grammar file.</p><p>The parser also enters error recovery mode if one of the grammar
actions raises the <span class="c003">Parsing.Parse_error</span> exception.</p><p>In error recovery mode, the parser discards states from the
stack until it reaches a place where the error token can be shifted.
It then discards tokens from the input until it finds three successive
tokens that can be accepted, and starts processing with the first of
these. If no state can be uncovered where the error token can be
shifted, then the parser aborts by raising the <span class="c003">Parsing.Parse_error</span>
exception.</p><p>Refer to documentation on <span class="c003">yacc</span> for more details and guidance in how
to use error recovery.</p>
<h2 class="section" id="sec331">13.5&#XA0;&#XA0;Options</h2>
<p>The <span class="c003">ocamlyacc</span> command recognizes the following options:</p><dl class="description"><dt class="dt-description"><span class="c013"><span class="c003">-b</span><span class="c009">prefix</span></span></dt><dd class="dd-description">
Name the output files <span class="c009">prefix</span><span class="c003">.ml</span>, <span class="c009">prefix</span><span class="c003">.mli</span>,
<span class="c009">prefix</span><span class="c003">.output</span>, instead of the default naming convention.</dd><dt class="dt-description"><span class="c006">-q</span></dt><dd class="dd-description">
This option has no effect.</dd><dt class="dt-description"><span class="c006">-v</span></dt><dd class="dd-description">
Generate a description of the parsing tables and a report on conflicts
resulting from ambiguities in the grammar. The description is put in
file <span class="c009">grammar</span><span class="c003">.output</span>.</dd><dt class="dt-description"><span class="c006">-version</span></dt><dd class="dd-description">
Print version string and exit.</dd><dt class="dt-description"><span class="c006">-vnum</span></dt><dd class="dd-description">
Print short version number and exit.</dd><dt class="dt-description"><span class="c006">-</span></dt><dd class="dd-description">
Read the grammar specification from standard input. The default
output file names are <span class="c003">stdin.ml</span> and <span class="c003">stdin.mli</span>.</dd><dt class="dt-description"><span class="c013"><span class="c003">--</span> <span class="c009">file</span></span></dt><dd class="dd-description">
Process <span class="c009">file</span> as the grammar specification, even if its name
starts with a dash (-) character. This option must be the last on the
command line.</dd></dl><p>At run-time, the <span class="c003">ocamlyacc</span>-generated parser can be debugged by
setting the <span class="c003">p</span> option in the <span class="c003">OCAMLRUNPARAM</span> environment variable
(see section&#XA0;<a href="runtime.html#ocamlrun-options">11.2</a>). This causes the pushdown
automaton executing the parser to print a trace of its action (tokens
shifted, rules reduced, etc). The trace mentions rule numbers and
state numbers that can be interpreted by looking at the file
<span class="c009">grammar</span><span class="c003">.output</span> generated by <span class="c003">ocamlyacc -v</span>.</p>
<h2 class="section" id="sec332">13.6&#XA0;&#XA0;A complete example</h2>
<p>The all-time favorite: a desk calculator. This program reads
arithmetic expressions on standard input, one per line, and prints
their values. Here is the grammar definition:
</p><pre>        /* File parser.mly */
        %token &lt;int&gt; INT
        %token PLUS MINUS TIMES DIV
        %token LPAREN RPAREN
        %token EOL
        %left PLUS MINUS        /* lowest precedence */
        %left TIMES DIV         /* medium precedence */
        %nonassoc UMINUS        /* highest precedence */
        %start main             /* the entry point */
        %type &lt;int&gt; main
        %%
        main:
            expr EOL                { $1 }
        ;
        expr:
            INT                     { $1 }
          | LPAREN expr RPAREN      { $2 }
          | expr PLUS expr          { $1 + $3 }
          | expr MINUS expr         { $1 - $3 }
          | expr TIMES expr         { $1 * $3 }
          | expr DIV expr           { $1 / $3 }
          | MINUS expr %prec UMINUS { - $2 }
        ;
</pre><p>Here is the definition for the corresponding lexer:
</p><pre>        (* File lexer.mll *)
        {
        open Parser        (* The type token is defined in parser.mli *)
        exception Eof
        }
        rule token = parse
            [' ' '\t']     { token lexbuf }     (* skip blanks *)
          | ['\n' ]        { EOL }
          | ['0'-'9']+ as lxm { INT(int_of_string lxm) }
          | '+'            { PLUS }
          | '-'            { MINUS }
          | '*'            { TIMES }
          | '/'            { DIV }
          | '('            { LPAREN }
          | ')'            { RPAREN }
          | eof            { raise Eof }
</pre><p>Here is the main program, that combines the parser with the lexer:
</p><pre>        (* File calc.ml *)
        let _ =
          try
            let lexbuf = Lexing.from_channel stdin in
            while true do
              let result = Parser.main Lexer.token lexbuf in
                print_int result; print_newline(); flush stdout
            done
          with Lexer.Eof -&gt;
            exit 0
</pre><p>To compile everything, execute:
</p><pre>        ocamllex lexer.mll       # generates lexer.ml
        ocamlyacc parser.mly     # generates parser.ml and parser.mli
        ocamlc -c parser.mli
        ocamlc -c lexer.ml
        ocamlc -c parser.ml
        ocamlc -c calc.ml
        ocamlc -o calc lexer.cmo parser.cmo calc.cmo
</pre>
<h2 class="section" id="sec333">13.7&#XA0;&#XA0;Common errors</h2>
<dl class="description"><dt class="dt-description"><span class="c013">ocamllex: transition table overflow, automaton is too big</span></dt><dd class="dd-description"><p>The deterministic automata generated by <span class="c003">ocamllex</span> are limited to at
most 32767 transitions. The message above indicates that your lexer
definition is too complex and overflows this limit. This is commonly
caused by lexer definitions that have separate rules for each of the
alphabetic keywords of the language, as in the following example.
</p><pre>rule token = parse
  "keyword1"   { KWD1 }
| "keyword2"   { KWD2 }
| ...
| "keyword100" { KWD100 }
| ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] * as id
               { IDENT id}
</pre><p>To keep the generated automata small, rewrite those definitions with
only one general &#X201C;identifier&#X201D; rule, followed by a hashtable lookup
to separate keywords from identifiers:
</p><pre>{ let keyword_table = Hashtbl.create 53
  let _ =
    List.iter (fun (kwd, tok) -&gt; Hashtbl.add keyword_table kwd tok)
              [ "keyword1", KWD1;
                "keyword2", KWD2; ...
                "keyword100", KWD100 ]
}
rule token = parse
  ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] * as id
               { try
                   Hashtbl.find keyword_table id
                 with Not_found -&gt;
                   IDENT id }
</pre></dd><dt class="dt-description"><span class="c013">ocamllex: Position memory overflow, too many bindings</span></dt><dd class="dd-description">
The deterministic automata generated by <span class="c003">ocamllex</span> maintain a table of
positions inside the scanned lexer buffer. The size of this table is
limited to at most 255 cells. This error should not show up in normal
situations.</dd></dl>
<hr>
<a href="native.html"><img src="previous_motif.svg" alt="Previous"></a>
<a href="index.html"><img src="contents_motif.svg" alt="Up"></a>
<a href="depend.html"><img src="next_motif.svg" alt="Next"></a>
</body>
</html>