Sophie

Sophie

distrib > Mandriva > 9.1 > ppc > by-pkgid > b9ba69a436161613d8fb030c8c726a8e > files > 442

spirit-1.5.1-2mdk.noarch.rpm

<html>
<head>
<title>The Scanner and Parsing</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link rel="stylesheet" href="theme/style.css" type="text/css">
</head>

<body>
<table width="100%" border="0" background="theme/bkd2.gif" cellspacing="2">
  <tr> 
    <td width="10"> 
    </td>
    <td width="85%"> 
      <font size="6" face="Verdana, Arial, Helvetica, sans-serif"><b>The Scanner and Parsing</b></font>
    </td>
    <td width="112"><a href="http://spirit.sf.net"><img src="theme/spirit.gif" width="112" height="48" align="right" border="0"></a></td>
  </tr>
</table>
<br>
<table border="0">
  <tr>
    <td width="10"></td>
    <td width="30"><a href="../index.html"><img src="theme/u_arr.gif" border="0"></a></td>
    <td width="30"><a href="directives.html"><img src="theme/l_arr.gif" border="0"></a></td>
    <td width="20"><a href="grammar.html"><img src="theme/r_arr.gif" border="0"></a></td>
   </tr>
</table>
<p>The <b>scanner</b>'s task is to feed the sequential input data stream to the 
  parser. The scanner extracts data from the input, parceling, potentially modifying 
  or filtering, and then finally relegating the result to individual parser elements 
  on demand until the input is exhausted. The scanner is composed of two STL conforming 
  forward iterators, first and last, where first is held by reference and last, 
  by value. The first iterator is held by reference to allow it to be re-positioned. 
  The following diagram illustrates what's happening:</p>
<table width="62%" border="0" align="center">
  <tr>
    <td><img src="theme/scanner1.png"></td>
  </tr>
</table>
<p>The scanner manages various aspects of the parsing process through a set of 
  policies. There are three sets of policies that govern:</p>
<blockquote>
  <p><img src="theme/bullet.gif" width="12" height="12"> Iteration and filtering<br>
    <img src="theme/bullet.gif" width="12" height="12"> Recognition and matching<br>
    <img src="theme/bullet.gif" width="12" height="12"> Handling semantic actions</p>
</blockquote>
<p>These policies are mostly hidden from view and users generally need not know 
  about them. Advanced users might however provide their own policies that override 
  the ones that are already in place in order to fine tune the parsing process 
  to fit her own needs. We shall see how this can be done. This will be covered 
  in further detail later.</p>
<p>The <tt>scanner</tt> is a template class expecting two parameters: <tt>IteratorT</tt>, 
  the iterator type and <tt>PoliciesT</tt>, its set of policies. <tt>IteratorT</tt> 
  defaults to <tt>char const*</tt> while <tt>PoliciesT</tt> defaults to <tt>scanner_policies&lt;&gt;</tt>, 
  a predefined set of scanner policies that we can use straight out of the box.</p>
<pre><code><font color="#000000"><span class=keyword>    template
    </span><span class=special>&lt;
        </span><span class=keyword>typename </span><span class=identifier>IteratorT  </span><span class=special>= </span><span class=keyword>char </span><span class=keyword>const</span><span class=special>*,
        </span><span class=keyword>typename </span><span class=identifier>PoliciesT  </span><span class=special>= </span><span class=identifier>scanner_policies</span><span class=special>&lt;&gt;
    </span><span class=special>&gt;
    </span><span class=keyword>class </span><span class=identifier>scanner</span><span class=special>;</span></font></code></pre>
<p>Spirit uses the same iterator concepts and interface formally defined by the 
  C++ Standard Template Library (STL). We can use iterators supplied by STL's 
  containers (e.g. <tt>list</tt>, <tt>vector</tt>, <tt>string</tt>, etc.) as is, 
  or perhaps write our own. Iterators can be as simple as a pointer (e.g. <tt>char 
  const<span class="operators">*</span></tt>). At the other end of the spectrum, 
  iterators can be quite complex; for instance, an iterator adapter that wraps 
  a lexer such as LEX.</p>
<table width="80%" border="0" align="center">
  <tr>
    <td class="note_box"><img src="theme/lens.gif" width="15" height="16"> <b>Forward 
      iterators</b><br>
      <br>
      In general, the <tt>scanner</tt> expects at least a standard conforming 
      forward iterator. Forward iterators are needed for backtracking where the 
      iterator needs to be saved and restored later. Generally speaking, Spirit 
      is a backtracking parser. The implication of this is that at some point, 
      the iterator position will have to be saved to allow the parser to backtrack 
      to a previous point. Thus, for backtracking to work, the framework requires 
      at least a forward iterator.<br>
      <br>
      Some parsers might require more specialized iterators (bi-directional or 
      even random access). Perhaps in the future, deterministic parsers when added 
      to the framework, will perform no backtracking and will need just a single 
      token lookahead, hence will require input iterators only.</td>
  </tr>
</table>
<h2>The Free Parse Functions</h2>
<p>The framework provides a couple of free functions to make parsing a snap. These 
  parser functions have two forms. The first form works on the <b>character level</b>. 
  The second works on the <b>phrase level</b> and asks for a <b>skip parser</b>.</p>
<p>The <b>skip parser</b> is just about any parser primitive or composite. Its 
  purpose is to move the scanner's <tt>first</tt> iterator to valid tokens by 
  skipping what are considered white spaces. In C for instance, the tab <tt class="quotes">'\t'</tt>, 
  the newline <tt class="quotes">'\n'</tt>, return <tt><span class="quotes">'\r'</span></tt>, 
  space <tt class="quotes">' '</tt> and characters inside comments <tt class="quotes">/*...*/</tt> 
  are considered as white spaces.</p>
<p><b>Character level parsing</b></p>
<pre><code><font color="#000000"><span class=special>    </span><span class=keyword>template </span><span class=special>&lt;</span><span class=keyword>typename </span><span class=identifier>IteratorT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>DerivedT</span><span class=special>&gt;
    </span><span class=identifier>parse_info</span><span class=special>&lt;</span><span class=identifier>IteratorT</span><span class=special>&gt;
    </span><span class=identifier>parse
    </span><span class=special>(
        </span><span class=identifier>IteratorT </span><span class=keyword>const</span><span class=special>&        </span><span class=identifier>first</span><span class=special>,
        </span><span class=identifier>IteratorT </span><span class=keyword>const</span><span class=special>&        </span><span class=identifier>last</span><span class=special>,
        </span><span class=identifier>parser</span><span class=special>&lt;</span><span class=identifier>DerivedT</span><span class=special>&gt; </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>p
    </span><span class=special>);</span></font></code></pre>
<pre><code><font color="#000000"><span class=special>    </span><span class=keyword>template </span><span class=special>&lt;</span><span class=keyword>typename </span><span class=identifier>CharT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>DerivedT</span><span class=special>&gt;
    </span><span class=identifier>parse_info</span><span class=special>&lt;</span><span class=identifier>CharT </span><span class=keyword>const</span><span class=special>*&gt;
    </span><span class=identifier>parse
    </span><span class=special>(
        </span><span class=identifier>CharT </span><span class=keyword>const</span><span class=special>*            </span><span class=identifier>str</span><span class=special>,
        </span><span class=identifier>parser</span><span class=special>&lt;</span><span class=identifier>DerivedT</span><span class=special>&gt; </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>p
    </span><span class=special>);</span></font></code></pre>
<p>There are two variants. The first variant accepts a <tt>first</tt>, <tt>last</tt> 
  iterator pair like you do STL algorithms. The second variant accepts a null 
  terminated string. The last argument is a parser <tt>p</tt> which will be used 
  to parse the input.</p>
<p><b>Phrase level parsing</b></p>
<pre><code><font color="#000000"><span class=special>    </span><span class=keyword>template </span><span class=special>&lt;</span><span class=keyword>typename </span><span class=identifier>IteratorT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>ParserT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>SkipT</span><span class=special>&gt;
    </span><span class=identifier>parse_info</span><span class=special>&lt;</span><span class=identifier>IteratorT</span><span class=special>&gt;
    </span><span class=identifier>parse
    </span><span class=special>(
        </span><span class=identifier>IteratorT </span><span class=keyword>const</span><span class=special>&        </span><span class=identifier>first</span><span class=special>,
        </span><span class=identifier>IteratorT </span><span class=keyword>const</span><span class=special>&        </span><span class=identifier>last</span><span class=special>,
        </span><span class=identifier>parser</span><span class=special>&lt;</span><span class=identifier>ParserT</span><span class=special>&gt; </span><span class=keyword>const</span><span class=special>&  </span><span class=identifier>p</span><span class=special>,
        </span><span class=identifier>parser</span><span class=special>&lt;</span><span class=identifier>SkipT</span><span class=special>&gt; </span><span class=keyword>const</span><span class=special>&    </span><span class=identifier>skip
    </span><span class=special>);</span></font></code></pre>
<pre><code><font color="#000000"><span class=special>    </span><span class=keyword>template </span><span class=special>&lt;</span><span class=keyword>typename </span><span class=identifier>CharT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>ParserT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>SkipT</span><span class=special>&gt;
    </span><span class=identifier>parse_info</span><span class=special>&lt;</span><span class=identifier>CharT </span><span class=keyword>const</span><span class=special>*&gt;
    </span><span class=identifier>parse
    </span><span class=special>(
        </span><span class=identifier>CharT </span><span class=keyword>const</span><span class=special>*            </span><span class=identifier>str</span><span class=special>,
        </span><span class=identifier>parser</span><span class=special>&lt;</span><span class=identifier>ParserT</span><span class=special>&gt; </span><span class=keyword>const</span><span class=special>&  </span><span class=identifier>p</span><span class=special>,
        </span><span class=identifier>parser</span><span class=special>&lt;</span><span class=identifier>SkipT</span><span class=special>&gt; </span><span class=keyword>const</span><span class=special>&    </span><span class=identifier>skip
    </span><span class=special>);</span></font></code></pre>
<p>Like above, there are two variants. The first variant accepts a <tt>first</tt>, 
  <tt>last</tt> iterator pair like you do STL algorithms. The second variant accepts 
  a null terminated string. The argument <tt>p</tt> is the parser which will be 
  used to parse the input. The last argument <tt>skip</tt> is the skip parser.</p>
<p><b>The parse_info structure</b></p>
<p>The functions above return a <tt>parse_info</tt> structure parameterized by 
  the iterator type passed in. The parse_info struct has these members:</p>
<table width="90%" border="0" align="center">
  <tr> 
    <td colspan="2" class="table_title"><b>parse_info</b></td>
  </tr>
  <tr> 
    <td width="14%" class="table_cells"><b>stop</b></td>
    <td width="86%" class="table_cells">Points to the final parse position (i.e 
      The parser recognized and processed the input up to this point)</td>
  </tr>
  <tr> 
    <td width="14%" class="table_cells"><b>hit</b></td>
    <td width="86%" class="table_cells">True if parsing is successful. This may 
      be full: the parser consumed all the input, or partial: the parser consumed 
      only a portion of the input.</td>
  </tr>
  <tr> 
    <td width="14%" class="table_cells"><b>full</b></td>
    <td width="86%" class="table_cells">True when we have a full match (i.e The 
      parser consumed all the input).</td>
  </tr>
  <tr> 
    <td width="14%" class="table_cells"><b>length</b></td>
    <td width="86%" class="table_cells">The number of characters consumed by the 
      parser. This is valid only if we have a successful match (either partial 
      or full). </td>
  </tr>
</table>
<h2><img src="theme/lens.gif" width="15" height="16"> Direct parsing with Iterators</h2>
<p>On some occassions, we may wish to go low-level and call the parser's parse 
  member function directly. Since we have the free parse functions to begin with, 
  the truth is, we might not even need to know about these scanner intricacies. 
  We might come across it, however, sometime in the future when we need to get 
  under the hood. So, it's nice that we know what we are dealing with when that 
  need comes. </p>
<p>If we wish to work on the <b>character level</b>, the procedure is quite simple:</p>
<pre><span class=identifier>    </span><span class=identifier>scanner</span><span class=special>&lt;</span><span class=identifier>IteratorT</span><span class=special>&gt; </span><span class=identifier>scan</span><span class=special>(</span><span class=identifier>first</span><span class=special>, </span><span class=identifier>last</span><span class=special>);

    </span><span class=keyword>if </span><span class=special>(</span><span class=identifier>p</span><span class=special>.</span><span class=identifier>parse</span><span class=special>(</span><span class=identifier>scan</span><span class=special>))
    </span><span class=special>{
        </span><span class=comment>//  Parsed successfully. If first == last, then we have
        //  a full parse, the parser recognized the input in whole.
    </span><span class=special>}
    </span><span class=keyword>else
    </span><span class=special>{
        </span><span class=comment>//  Parsing failure. The parser failed to recognize the input
    </span><span class=special>}</span></pre>
<p>Where <tt>p</tt> is the parser we want to use, and <tt>first</tt>/<tt>last</tt> 
  are the iterator pairs referring to the input. We just create a scanner given 
  the iterators. The scanner type we will use here uses the default <tt>scanner_policies&lt;&gt;</tt>.</p>
<p>The situation is a bit more complex when we wish to work on the <b>phrase level</b>:</p>
<pre><span class=special>    </span><span class=keyword>typedef </span><span class=identifier>skip_parser_iteration_policy</span><span class=special>&lt;</span><span class=identifier>SkipT</span><span class=special>&gt; </span><span class=identifier>iter_policy_t</span><span class=special>;
    </span><span class=keyword>typedef </span><span class=identifier>scanner_policies</span><span class=special>&lt;</span><span class=identifier>iter_policy_t</span><span class=special>&gt; </span><span class=identifier>scanner_policies_t</span><span class=special>;
    </span><span class=keyword>typedef </span><span class=identifier>scanner</span><span class=special>&lt;</span><span class=identifier>IteratorT</span><span class=special>, </span><span class=identifier>scanner_policies_t</span><span class=special>&gt; </span><span class=identifier>scanner_t</span><span class=special>;

</span><span class=special>    </span><span class=identifier>iter_policy_t </span><span class=identifier>iter_policy</span><span class=special>(</span><span class=identifier>skip</span><span class=special>);
    </span><span class=identifier>scanner_policies_t </span><span class=identifier>policies</span><span class=special>(</span><span class=identifier>iter_policy</span><span class=special>);
    </span><span class=identifier>scanner_t </span><span class=identifier>scan</span><span class=special>(</span><span class=identifier>first</span><span class=special>, </span><span class=identifier>last</span><span class=special>, </span><span class=identifier>policies</span><span class=special>);
</span>
    <span class=keyword>if </span><span class=special>(</span><span class=identifier>p</span><span class=special>.</span><span class=identifier>parse</span><span class=special>(</span><span class=identifier>scan</span><span class=special>))
    </span><span class=special>{
        </span><span class=comment>//  Parsed successfully. If first == last, then we have
        //  a full parse, the parser recognized the input in whole.
    </span><span class=special>}
    </span><span class=keyword>else
    </span><span class=special>{
        </span><span class=comment>//  Parsing failure. The parser failed to recognize the input
    </span><span class=special>}</span></pre>
<p>Where <tt>SkipT</tt> is the type of the skip-parser, <tt>skip</tt>. Again, 
  <tt>p</tt> is the parser we want to use, and <tt>first</tt>/<tt>last</tt> are 
  the iterator pairs referring to the input. Given a skip-parser type <tt>SkipT</tt>, 
  <span class=identifier><tt>skip_parser_iteration_policy</tt></span> creates 
  a scanner iteration policy that skips over portions that are recognized by the 
  skip-parser. This may then be used to create a scanner. The <tt>scanner_policies</tt> 
  class wraps all scanner related policies including the iteration policies.<br>
</p>
<p><b><img src="theme/alert.gif"> <a name="business"></a> The Scanner Business</b></p>
<p>As much as possible, we do not want to complicate matters. However, dealing 
  with some nitty gritty details is sometimes unavoidable. While the free functions 
  above are indeed easy to use, sometimes you'll want to pass in a rule to one 
  of the functions. The problem is that the rule is a template class that is parameterized 
  by the scanner type. This is rather awkward but unavoidable: the rule is tied 
  to a scanner. What's not obvious is that this scanner must be compatible with 
  the scanner that is ultimately passed to the rule's parse member function. Otherwise, 
  the compiler will complain. Many times, the problem is subtle. For example:</p>
<pre><code><font color="#000000"><span class=special>    </span><span class=identifier>rule</span><span class=special>&lt;&gt; </span><span class=identifier>r </span><span class=special>= </span><span class=special>*</span><span class=identifier>anychar_p</span><span class=special>;
    </span><span class=identifier>parse</span><span class=special>(</span><span class=string>"hello world"</span><span class=special>, </span><span class=identifier>r</span><span class=special>);            </span><span class=comment>//  OK  [character level parsing]
    </span><span class=identifier>parse</span><span class=special>(</span><span class=string>"hello world"</span><span class=special>, </span><span class=identifier>r</span><span class=special>, </span><span class=identifier>space_p</span><span class=special>);   </span><span class=comment>//  BAD [attempts phrase level parsing]</span></font></code></pre>
<p>Why does the second call to parse not compile? Because of scanner incompatibility. 
  Behind the scenes, the free parse function creates a scanner from the iterators 
  passed in. In the first call to parse, the scanner created is a plain vanilla 
  <tt>scanner&lt;&gt;</tt>. This is compatible with the default scanner type of 
  <tt>rule&lt;&gt;</tt> [see default template parameters of <a href="rule.html">the 
  rule</a>]. The second call creates a scanner of type <tt>phrase_scanner_t</tt>:</p>
<pre><span class=special>    </span><span class=keyword>typedef </span><span class=identifier>skipper_iteration_policy</span><span class=special>&lt;&gt;                  </span><span class=identifier>iter_policy_t</span><span class=special>;
    </span><span class=keyword>typedef </span><span class=identifier>scanner_policies</span><span class=special>&lt;</span><span class=identifier>iter_policy_t</span><span class=special>&gt;             </span><span class=identifier>scanner_policies_t</span><span class=special>;
    </span><span class=keyword>typedef </span><span class=identifier>scanner</span><span class=special>&lt;</span><span class=keyword>char </span><span class=keyword>const</span><span class=special>*, </span><span class=identifier>scanner_policies_t</span><span class=special>&gt;    </span><span class=identifier>phrase_scanner_t</span><span class=special>;</span></pre>
<p>Thus, in order for the second call to succeed, the rule must be parameterized 
  as <tt>rule&lt;phrase_scanner_t&gt;</tt>:</p>
<pre><code><font color="#000000"><span class=comment>    </span><span class=identifier>rule</span><span class=special>&lt;</span><span class=identifier>phrase_scanner_t</span><span class=special>&gt; </span><span class=identifier>r </span><span class=special>= </span><span class=special>*</span><span class=identifier>anychar_p</span><span class=special>;
    </span><span class=identifier>parse</span><span class=special>(</span><span class=string>"hello world"</span><span class=special>, </span><span class=identifier>r</span><span class=special>, </span><span class=identifier>space_p</span><span class=special>);       </span><span class=comment>//  OK [phrase level parsing]</span></font></code></pre>
<p>Take note however that <tt>phrase_scanner_t</tt> is compatible only when you 
  are using <tt>char const*</tt> iterators and <tt>space_p</tt> as the skip parser. 
  Other than that, you'll have to find the right type of scanner. This is tedious 
  to do correctly. In light of this issue, it is best to avoid rules as arguments 
  to the parse functions. We shall see how this can be done in the next section. 
  Keep in mind that this happens only with rules. The rule is the only parser 
  that has to be tied to a particular scanner type. For instance:</p>
<pre><span class=comment>    </span><span class=identifier>parse</span><span class=special>(</span><span class=string>"hello world"</span><span class=special>, *</span><span class=identifier>anychar_p</span><span class=special>);           </span><span class=comment><code><font color="#000000"><span class=comment>//  OK  [character level parsing</span></font></code>]
    </span><span class=identifier>parse</span><span class=special>(</span><span class=string>"hello world"</span><span class=special>, *</span><span class=identifier>anychar_p</span><span class=special>, </span><span class=identifier>space_p</span><span class=special>);  </span><span class="comment">//  OK  [phrase level parsing]</span></pre>
<table border="0">
  <tr> 
    <td width="10"></td>
    <td width="30"><a href="../index.html"><img src="theme/u_arr.gif" border="0"></a></td>
    <td width="30"><a href="directives.html"><img src="theme/l_arr.gif" border="0"></a></td>
    <td width="20"><a href="grammar.html"><img src="theme/r_arr.gif" border="0"></a></td>
  </tr>
</table>
<br>
<hr size="1">
<p class="copyright">Copyright &copy; 1998-2002 Joel de Guzman<br>
  <br>
  <font size="2">Permission to copy, use, modify, sell and distribute this document 
  is granted provided this copyright notice appears in all copies. This document 
  is provided &quot;as is&quot; without express or implied warranty, and with 
  no claim as to its suitability for any purpose.</font></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
</body>
</html>