Sophie: ocaml-pxp-1.2.1-1mdv2010.1 x86

ocaml-pxp-1.2.1-1mdv2010.1.x86_64.rpm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<link rel="stylesheet" href="style.css" type="text/css">
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type">
<link rel="Start" href="index.html">
<link rel="previous" href="Intro_namespaces.html">
<link rel="next" href="Intro_resolution.html">
<link rel="Up" href="index.html">
<link title="Index of types" rel=Appendix href="index_types.html">
<link title="Index of exceptions" rel=Appendix href="index_exceptions.html">
<link title="Index of values" rel=Appendix href="index_values.html">
<link title="Index of class methods" rel=Appendix href="index_methods.html">
<link title="Index of classes" rel=Appendix href="index_classes.html">
<link title="Index of class types" rel=Appendix href="index_class_types.html">
<link title="Index of modules" rel=Appendix href="index_modules.html">
<link title="Index of module types" rel=Appendix href="index_module_types.html">
<link title="Pxp_types" rel="Chapter" href="Pxp_types.html">
<link title="Pxp_document" rel="Chapter" href="Pxp_document.html">
<link title="Pxp_dtd" rel="Chapter" href="Pxp_dtd.html">
<link title="Pxp_tree_parser" rel="Chapter" href="Pxp_tree_parser.html">
<link title="Pxp_core_types" rel="Chapter" href="Pxp_core_types.html">
<link title="Pxp_ev_parser" rel="Chapter" href="Pxp_ev_parser.html">
<link title="Pxp_event" rel="Chapter" href="Pxp_event.html">
<link title="Pxp_dtd_parser" rel="Chapter" href="Pxp_dtd_parser.html">
<link title="Pxp_codewriter" rel="Chapter" href="Pxp_codewriter.html">
<link title="Pxp_marshal" rel="Chapter" href="Pxp_marshal.html">
<link title="Pxp_yacc" rel="Chapter" href="Pxp_yacc.html">
<link title="Pxp_reader" rel="Chapter" href="Pxp_reader.html">
<link title="Intro_trees" rel="Chapter" href="Intro_trees.html">
<link title="Intro_extensions" rel="Chapter" href="Intro_extensions.html">
<link title="Intro_namespaces" rel="Chapter" href="Intro_namespaces.html">
<link title="Intro_events" rel="Chapter" href="Intro_events.html">
<link title="Intro_resolution" rel="Chapter" href="Intro_resolution.html">
<link title="Intro_getting_started" rel="Chapter" href="Intro_getting_started.html">
<link title="Intro_advanced" rel="Chapter" href="Intro_advanced.html">
<link title="Intro_preprocessor" rel="Chapter" href="Intro_preprocessor.html">
<link title="Example_readme" rel="Chapter" href="Example_readme.html"><link title="XML data as stream of events" rel="Section" href="#1_XMLdataasstreamofevents">
<link title="The structure of event streams" rel="Subsection" href="#structure">
<link title="Calling the parser in event mode" rel="Subsection" href="#calling">
<link title="Filters" rel="Subsection" href="#filters">
<link title="Events and namespaces" rel="Subsection" href="#namespaces">
<link title="Example: Print the events while parsing" rel="Subsection" href="#2_ExamplePrinttheeventswhileparsing">
<link title="Connect PXP with a recursive-descent parser" rel="Subsection" href="#recdesc">
<link title="Escape PXP parsing" rel="Subsection" href="#escape">
<title>PXP Reference : Intro_events</title>
</head>
<body>
<div class="navbar"><a href="Intro_namespaces.html">Previous</a>
&nbsp;<a href="index.html">Up</a>
&nbsp;<a href="Intro_resolution.html">Next</a>
</div>
<center><h1>Intro_events</h1></center>
<br>
<br>
<a name="1_XMLdataasstreamofevents"></a>
<h1>XML data as stream of events</h1>
<p>

In contrast to the tree mode (see <a href="Intro_trees.html"><code class="code"><span class="constructor">Intro_trees</span></code></a>), the parser does not
return the complete document at once in event mode, but as a sequence
of so-called events. The parser makes a number of guarantees about
the structure of the emitted events, especially it is ensured that they
conform to the well-formedness constraints. For instance, it is ensured
that start tags and end tags are properly nested. Nevertheless, it is 
up to the caller to process and/or aggregate the events. This leaves
a lot of freedom for the caller.
<p>

The event mode is especially well-suited for processing very large
documents. As PXP does not by itself represent the complete document
in memory, PXP needs usually not to maintain large data structures in
event mode. Of course, the caller should also try to avoid such data
structures.  This makes it then possible to even process arbitrarily
large documents in many cases. Note, however, that not all limits are
taken out of effect. For example, for checking well-formedness the
parser still needs to maintain a stack of start elements whose end
elements have not been seen yet. Because of this, it is not possible
to parse arbitrarily deeply nested documents with constant memory. On
32 bit platforms, there is still a limit of the maximum string length
of 16 MB.
<p>

Another application of event mode is the direct combination with
recursive-descent parsers for postprocessing the stream of events.
See below <a href="Intro_events.html#recdesc"><i>Connect PXP with a recursive-descent parser</i></a> for more.
<p>

The event mode also makes it feasible to enable the special escape
tokens <code class="code">{</code>, <code class="code">}</code>, <code class="code">{{</code>, and <code class="code">}}</code>. PXP can be configured such that
these tokens trigger a user-defined add-on parser that reads directly
from the character stream. See below <a href="Intro_events.html#escape"><i>Escape PXP parsing</i></a> for more.
<p>

We should also mention one basic limitation of event-oriented parsing:
It is fundamentally incompatible with validation, as the tree view is
required to validate a document.
<p>

<a name="links"></a>
<h3>Links to other documentation</h3>
<p>
<ul>
<li><a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a> is the data type of events. Also explained below</li>
<li><a href="Pxp_ev_parser.html"><code class="code"><span class="constructor">Pxp_ev_parser</span></code></a> is the module with parsing functions in event mode</li>
<li><a href="Pxp_event.html"><code class="code"><span class="constructor">Pxp_event</span></code></a> is a module with helper functions for event mode, such as
  concatenation of event streams</li>
<li><a href="Pxp_document.html#VALliquefy"><code class="code"><span class="constructor">Pxp_document</span>.liquefy</code></a> allows one to convert a tree into an event stream</li>
<li><a href="Pxp_document.html#VALsolidify"><code class="code"><span class="constructor">Pxp_document</span>.solidify</code></a> allows one to convert an event stream into a tree</li>
<li><a href="Intro_preprocessor.html#events"><i>Generating events: pxp_evlist and pxp_evpull</i></a> explains how to use the preprocessor to
  construct event streams</li>
</ul>

<a name="compat"></a>
<h3>Compatibility</h3>
<p>

Event mode is compatible with:
<p>
<ul>
<li>Well-formedness parsing</li>
<li>Namespaces: Namespace processing works as outlined in <a href="Intro_namespaces.html"><code class="code"><span class="constructor">Intro_namespaces</span></code></a>,
  only that the user needs to interpret the namespace information contained
  in the events differently. See below <a href="Intro_events.html#namespaces"><i>Events and namespaces</i></a> for more.</li>
<li>Reading from arbitrary sources as described in <a href="Intro_resolution.html"><code class="code"><span class="constructor">Intro_resolution</span></code></a></li>
</ul>

Event mode is incompatible with:
<p>
<ul>
<li>Validation</li>
</ul>

<a name="structure"></a>
<h2>The structure of event streams</h2>
<p>

First we describe how well-formed XML fragments are represented in
stream format, i.e. XML text that is properly nested with respect to
start tags and end tags. For a real text, the parser will also emit
some wrapping.  It is distinguished between documents and non-document
entities. A document is a formally closed text that consists of one
main entity (file) and optionally a number of referenced entities.
One can parse a file as document, and in this case the parser will add
a wrapping suited for documents. Alternatively, one can parse an
entity as a plain entity, and in this case the parser will add a
wrapping suited for non-documents. Note that the XML declaration
(<code class="code">&lt;?xml ... <span class="keywordsign">?&gt;</span></code>) for such non-document entities is slightly different,
and that no <code class="code"><span class="constructor">DOCTYPE</span></code> clause is permitted.
<p>

<a name="wf"></a>
<h3>The structure of well-formed XML fragments</h3>
<p>

The type of events is <a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a>. The events do not strictly
correspond to syntactical elements of XML, but more to a logical 
interpretation.
<p>

The parser emits events for
<ul>
<li><code class="code"><span class="constructor">E_char_data</span>(text)</code>: Character data - The parser emits character
  data events for sequences of characters. It is unspecified how long
  these sequences are. This means it is up to the parser how a
  contiguous section of characters is split up into one or more
  character data events, i.e. <b>adjacent character data events may be
  emitted by the parser.</b> Also, it is not tried to suppress whitespace
  of any kind. For example, the XML text
  <pre></pre><code class="code">&nbsp;<span class="constructor">Hello</span>&nbsp;world&nbsp;</code><pre></pre>
  might lead to the emission of
  <pre></pre><code class="code">&nbsp;[<span class="constructor">E_char_data</span>&nbsp;<span class="string">"Hello&nbsp;"</span>;&nbsp;<span class="constructor">E_char_data</span>&nbsp;<span class="string">"world"</span>]&nbsp;</code><pre></pre>
  but also to any other split into events.
<p>

  </li>
<li><code class="code"><span class="constructor">E_start_tag</span>(name,atts,scope_opt,entid)</code>: Start tags of elements - 
  Includes everything within the angle brackets, i.e. <code class="code">name</code> and
  attribute list <code class="code">atts</code> (as name/value pairs). The event also
  includes the namespace scope <code class="code">scope_opt</code> if namespace processing is 
  enabled (or <code class="code"><span class="constructor">None</span></code>), and it includes a reference <code class="code">entid</code> to the entity
  the tag occurs in. Note that the tag name and the attribute names
  are subject to prefix normalization if namespace processing is
  enabled.
<p>

  </li>
<li><code class="code"><span class="constructor">E_end_tag</span>(name,entid)</code>: End tags of elements - The event
  mentions the <code class="code">name</code>, and the entity <code class="code">entid</code> the tag occurs in.
  Both <code class="code">name</code> and <code class="code">entid</code> are always identical to the values
  attached to the corresponding start tag.
<p>

  Note that the short form of empty elements, <code class="code">&lt;tag/&gt;</code> are emitted as
  a start tag followed by an end tag.
  </li>
<li><code class="code"><span class="constructor">E_pinstr</span>(name,value,entid)</code>: Processing instructions 
  (PI's) - In tree mode, PI's can be represented in two ways: Either by
  attaching them to the surrounding elements, or by including them
  into the tree exactly where they occurred in the text. For symmetry,
  the same two ways of handling PI's are also present in the event
  stream representation (event streams and trees should be convertible
  into each other without data loss). Although there is only one
  event (<code class="code"><span class="constructor">E_pinstr</span></code>), it depends on the config option
  <code class="code">enable_pinstr_nodes</code> where this event is placed into the event
  stream. If the option is enabled, <code class="code"><span class="constructor">E_pinstr</span></code> is always emitted where
  the PI occurs in the XML text. If it is disabled, the emission of
  <code class="code"><span class="constructor">E_pinstr</span></code> may be delayed, but it is still guaranteed that this
  happens in the same context (surrounding element).
  It is not possible to turn the emission of PI events completely
  off. (See <a href="Intro_events.html#filters"><i>Filters</i></a> for an example how to filter out
  PI events in a postprocessing step.)
<p>

  </li>
<li><code class="code"><span class="constructor">E_comment</span> text</code>: Comments - If enabled (by
  <code class="code">enable_comment_nodes</code> in <a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a>), the parser emits
   comment events.
<p>

  </li>
<li><code class="code"><span class="constructor">E_start_super</span></code> and <code class="code"><span class="constructor">E_end_super</span></code>: Super root nodes - If enabled
  (by <code class="code">enable_super_root_node</code> in <a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a>), the parser emits
   a start event for
  the super root node at the beginning of the stream, and an end event
  at the end of the stream. This is comparable to an element embracing
  the whole text.
<p>

  </li>
<li><code class="code"><span class="constructor">E_position</span>(e,l,p)</code>: Position events - If enabled (by
  <code class="code">store_element_positions</code> in <a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a>), the parser emits
  special position
  events. These events refer to the immediately following event, and
  say from where in the XML text the following event
  originates. Position events are emitted before <code class="code"><span class="constructor">E_start_tag</span></code>,
  <code class="code"><span class="constructor">E_pinstr</span></code>, and <code class="code"><span class="constructor">E_comment</span></code>. The argument <code class="code">e</code> is a textual
  description of the entity. <code class="code">l</code> is the line. <code class="code">p</code> is the byte position
  of the character.
  </li>
</ul>

<p>

As in the tree mode, entities are fully resolved, and do not appear
in the parsed events. Also, syntactic elements like CDATA sections,
the XML declaration, the DOCTYPE clause, and all elements only
allowed in the DTD part are not represented.
<p>

Example for an event stream: The XML fragment
<p>

<pre></pre><code class="code">&nbsp;&nbsp;&lt;p&nbsp;a1=<span class="string">"one"</span>&gt;&lt;q&gt;data1&lt;/q&gt;&lt;r&gt;data2&lt;/r&gt;&lt;s&gt;&lt;/s&gt;&lt;t/&gt;&lt;/p&gt;<br>
</code><pre></pre>
<p>

could be represented as
<p>

<pre></pre><code class="code">&nbsp;&nbsp;[&nbsp;<span class="constructor">E_start_tag</span>(<span class="string">"p"</span>,[<span class="string">"a1"</span>,<span class="string">"one"</span>],<span class="constructor">None</span>,&lt;entid&gt;);<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_start_tag</span>(<span class="string">"q"</span>,[],<span class="constructor">None</span>,&lt;entid&gt;);<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_char_data</span>&nbsp;<span class="string">"data1"</span>;<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_end_tag</span>(<span class="string">"q"</span>,&lt;entid&gt;);<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_start_tag</span>(<span class="string">"r"</span>,[],<span class="constructor">None</span>,&lt;entid&gt;);<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_char_data</span>&nbsp;<span class="string">"data2"</span>;<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_end_tag</span>(<span class="string">"r"</span>,&lt;entid&gt;);<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_start_tag</span>(<span class="string">"s"</span>,[],<span class="constructor">None</span>,&lt;entid&gt;);<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_end_tag</span>(<span class="string">"s"</span>,&lt;entid&gt;);<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_start_tag</span>(<span class="string">"t"</span>,[],<span class="constructor">None</span>,&lt;entid&gt;);<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_end_tag</span>(<span class="string">"t"</span>,&lt;entid&gt;);<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_end_tah</span>(<span class="string">"p"</span>,&lt;entid&gt;);<br>
&nbsp;&nbsp;]<br>
</code><pre></pre>
<p>

where <code class="code">&lt;entid&gt;</code> is the entity ID object.
<p>

<a name="nondocs"></a>
<h3>The wrapping for non-document entities</h3>
<p>

The XML specification demands that external XML entities (that are
referenced from a document entity or another external entity) comply
to this grammar (excerpt from the W3C definition):
<p>

<pre></pre><code class="code">extParsedEnt&nbsp;::=&nbsp;<span class="constructor">TextDecl</span>?&nbsp;content<br>
<span class="constructor">TextDecl</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;::=&nbsp;<span class="keywordsign">'</span>&lt;?xml'&nbsp;<span class="constructor">VersionInfo</span>?&nbsp;<span class="constructor">EncodingDecl</span>&nbsp;<span class="constructor">S</span>?&nbsp;<span class="keywordsign">'</span><span class="keywordsign">?&gt;</span><span class="keywordsign">'</span><br>
content&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;::=&nbsp;(element&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="constructor">CharData</span>&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="constructor">Reference</span>&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="constructor">CDSect</span>&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="constructor">PI</span>&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="constructor">Comment</span>)*<br>
</code><pre></pre>
<p>

i.e. there can be an XML declaration at the beginning (always with
an <code class="code">encoding</code> declaration), but the declaration is optional.
It is followed by a sequence of elements, character data, processing
instructions and comments (which are reflected by the events emitted
by the parser), and by entity references and CDATA sections (which
are already resolved by the parser).
<p>

The emitted events are now:<ul>
<li>No event is emitted for the XML declaration</li>
<li>The stream consists of the events for the <code class="code">content</code> production</li>
<li>Finally, there is an <code class="code"><span class="constructor">E_end_of_stream</span></code> event.</li>
</ul>

When the parser detects an error, it stops the event stream, and
emits a last <code class="code"><span class="constructor">E_error</span></code> event instead.
<p>

<a name="docs"></a>
<h3>The wrapping for closed documents</h3>
<p>

Closed documents have to match this grammar (excerpt from the W3C
definition):
<p>

<pre></pre><code class="code">document&nbsp;::=&nbsp;prolog&nbsp;element&nbsp;<span class="constructor">Misc</span>*<br>
prolog&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;::=&nbsp;<span class="constructor">XMLDecl</span>?&nbsp;<span class="constructor">Misc</span>*&nbsp;(doctypedecl&nbsp;<span class="constructor">Misc</span>*)?<br>
<span class="constructor">XMLDecl</span>&nbsp;&nbsp;::=&nbsp;<span class="keywordsign">'</span>&lt;?xml'&nbsp;<span class="constructor">VersionInfo</span>&nbsp;<span class="constructor">EncodingDecl</span>?&nbsp;<span class="constructor">SDDecl</span>?&nbsp;<span class="constructor">S</span>?&nbsp;<span class="keywordsign">'</span><span class="keywordsign">?&gt;</span><span class="keywordsign">'</span><br>
</code><pre></pre>
<p>

That means there can be an XML declaration at the beginning (always
with a <code class="code"><span class="constructor">VersionInfo</span></code> declaration), but the declaration is optional.
There can be a <code class="code"><span class="constructor">DOCTYPE</span></code> declaration. Finally, there must be a single
element. The production <code class="code"><span class="constructor">Misc</span></code> stands for a comment, a processing
instruction, or whitespace.
<p>

The emitted events are now:<ul>
<li><code class="code"><span class="constructor">E_start_doc</span>(version,dtd)</code> is always emitted at the beginning.
  The <code class="code">version</code> string is from <code class="code"><span class="constructor">VersionInfo</span></code>, or "1.0" if the whole
  XML declaration is missing. The <code class="code">dtd</code> object may contain the 
  declaration of the parsed <code class="code"><span class="constructor">DOCTYPE</span></code> clause. However, by setting parsing
  parameters it is possible to control which declarations are
  added to the <code class="code">dtd</code> object.</li>
<li>If <code class="code">enable_super_root</code>: <code class="code"><span class="constructor">E_start_super</span></code></li>
<li>If there are comments or processing instructions before the
  topmost element, and the node type is enabled, these events
  are now emitted.</li>
<li>Now the events of the topmost <code class="code">element</code> follow.</li>
<li>If there are comments or processing instructions after the
  topmost element, and the node type is enabled, these events
  are now emitted.</li>
<li>If <code class="code">enable_super_root</code>: <code class="code"><span class="constructor">E_end_super</span></code></li>
<li><code class="code"><span class="constructor">E_end_doc</span> name</code>: ends the document. The <code class="code">name</code> is the literal
  name of the topmost element, without any prefix normalization
  even if namespace processing is enabled</li>
<li>Finally, there is an <code class="code"><span class="constructor">E_end_of_stream</span></code> event.</li>
</ul>

When the parser detects an error, it stops the event stream, and
emits a last <code class="code"><span class="constructor">E_error</span></code> event instead.
<p>

<a name="calling"></a>
<h2>Calling the parser in event mode</h2>
<p>

The parser returns the emitted events while it is parsing. There are
two models for that:
<p>
<ul>
<li>Push parsing: The caller passes a callback function to the parser,
  and whenever the parser emits an event, this function is invoked</li>
<li>Pull parsing: The parser runs as a coroutine together with the
  caller. The invocation of the parser returns the pull function.
  The caller now repeatedly invokes the pull function to get the 
  emitted events until the end of the stream is indicated.</li>
</ul>

Let's look at both models in detail by giving an example. There is
some code that is needed in both push and pull parsing.  This example
is similar to the examples given in <a href="Intro_getting_started.html"><code class="code"><span class="constructor">Intro_getting_started</span></code></a>. First we
need a <a href="Pxp_types.html#TYPEsource"><code class="code"><span class="constructor">Pxp_types</span>.source</code></a> that says from where the input to parse
comes. Second, we need an entity manager (of the opaque PXP type
<code class="code"><span class="constructor">Pxp_entity_manager</span>.entity_manager</code>). The entity manager is a device
that controls the source and switches between the entities to parse
(if such switches are necessary). The entity manager is visible to the
caller in event mode - in tree mode it is also needed but hidden in
the parser driver.
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;config&nbsp;=&nbsp;<span class="constructor">Pxp_types</span>.default_config<br>
<span class="keyword">let</span>&nbsp;source&nbsp;=&nbsp;<span class="constructor">Pxp_types</span>.from_file&nbsp;<span class="string">"filename.xml"</span><br>
<span class="keyword">let</span>&nbsp;entmng&nbsp;=&nbsp;<span class="constructor">Pxp_ev_parser</span>.create_entity_manager&nbsp;config&nbsp;source<br>
</code><pre></pre>
<p>

(See also: <a href="Pxp_ev_parser.html#VALcreate_entity_manager"><code class="code"><span class="constructor">Pxp_ev_parser</span>.create_entity_manager</code></a>.)
<p>

From here on, the required code differs in both parsing modes.
<p>

<a name="push"></a>
<h3>Push parsing</h3>
<p>

The function <a href="Pxp_ev_parser.html#VALprocess_entity"><code class="code"><span class="constructor">Pxp_ev_parser</span>.process_entity</code></a> invokes the parser
in push mode:
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;()&nbsp;=&nbsp;<span class="constructor">Pxp_ev_parser</span>.process_entity&nbsp;config&nbsp;entry&nbsp;entmng&nbsp;(<span class="keyword">fun</span>&nbsp;ev&nbsp;<span class="keywordsign">-&gt;</span>&nbsp;...)<br>
</code><pre></pre>
<p>

The callback function is here shown as <code class="code">(<span class="keyword">fun</span> ev <span class="keywordsign">-&gt;</span> ...)</code>. It is called
back for every emitted event <code class="code">ev</code> (of type <a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a>). It is
ensured that the last emitted event is either <code class="code"><span class="constructor">E_end_of_stream</span></code> or
<code class="code"><span class="constructor">E_error</span></code>. See the documentation of <a href="Pxp_ev_parser.html#VALprocess_entity"><code class="code"><span class="constructor">Pxp_ev_parser</span>.process_entity</code></a>
for details about error handling.
<p>

The parameter <code class="code">entry</code> (of type <a href="Pxp_types.html#TYPEentry"><code class="code"><span class="constructor">Pxp_types</span>.entry</code></a>) determines the
entry point in the XML grammar.  Essentially, it says what kind of
thing to parse. Most users will want to pass <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_document</span></code> here to
parse a closed document. Note that the emitted event stream includes
the wrapping for documents as described in <a href="Intro_events.html#docs"><i>The wrapping for closed documents</i></a>.
<p>

The entry point <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_content</span></code> is for non-document external entities,
as described in <a href="Intro_events.html#nondocs"><i>The wrapping for non-document entities</i></a>. There is a similar entry 
point, <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_element_content</span></code>, which additionally enforces some
constraints on the node structure. In particular, there must be a single
top-level element so that the enforced node structure looks like a
document. We do not recommend to use <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_element_content</span></code> - rather
use <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_document</span></code>, and remove the document wrapping in a postprocessing
step.
<p>

The entry point <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_expr</span></code> reads a single node (see <a href="Pxp_types.html#TYPEentry"><code class="code"><span class="constructor">Pxp_types</span>.entry</code></a>
for details). It is recommended to use <a href="Pxp_ev_parser.html#VALprocess_expr"><code class="code"><span class="constructor">Pxp_ev_parser</span>.process_expr</code></a>
instead of <a href="Pxp_ev_parser.html#VALprocess_entity"><code class="code"><span class="constructor">Pxp_ev_parser</span>.process_entity</code></a> together with this entry
point, as this allows to start and end parsing within an entity, instead
of having to parse an entity as a whole. (This is intended for special
applications only.)
<p>

The entry point <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_declarations</span></code> is currently unused.
<p>

<b>Flags for <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_document</span></code>.</b> This entry point takes some flags as
arguments that determine some details. It is usually ok to just pass
the empty list of flags, i.e. <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_document</span> []</code>. The flags may
enable some validation checks, or at least configure that some data is
stored in the DTD object so that it is available for a later
validation pass. Remember that the event mode by itself can only do
well-formedness parsing. It can be reasonable, however, to enable
flags when the event stream is later validated by some other means
(e.g. by converting it into a tree and validating it).
<p>

<a name="pull"></a>
<h3>Pull parsing</h3>
<p>

The pull parser is created by <a href="Pxp_ev_parser.html#VALcreate_pull_parser"><code class="code"><span class="constructor">Pxp_ev_parser</span>.create_pull_parser</code></a> like:
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;pull&nbsp;=&nbsp;create_pull_parser&nbsp;config&nbsp;entry&nbsp;entmng<br>
</code><pre></pre>
<p>

The arguments <code class="code">config</code>, <code class="code">entry</code>, and <code class="code">entmng</code> have the same meaning as
for the push parser. In the case of the pull parser, however, no callback
function is passed by the user. Instead, the return value <code class="code">pull</code> is a
function one can call to "pull" the events out of the parser engine.
The <code class="code">pull</code> function returns <code class="code"><span class="constructor">Some</span> ev</code> where <code class="code">ev</code> is the event of type
<a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a>. After the end of the stream is reached, the function
returns <code class="code"><span class="constructor">None</span></code>.
<p>

Essentially, the parser works like an engine that can be started and
stopped. When the <code class="code">pull</code> function is invoked, the parser engine is
"turned on", and runs for a while until (at least) the next event is
available. Then, the engine is stopped again, and the event is returned.
The engine keeps its state between invocations of <code class="code">pull</code> so that the
parser continues exactly at the point where it stopped the last time.
<p>

Note that files and other resources of the operating system are kept
open while parsing is in progress. It is expected by the user to
continue calling <code class="code">push</code> until the end of the stream is reached (at
least until <code class="code"><span class="constructor">Some</span> <span class="constructor">E_end_of_stream</span></code>, <code class="code"><span class="constructor">Some</span> <span class="constructor">E_error</span></code>, or <code class="code"><span class="constructor">None</span></code> is
returned by <code class="code">pull</code>). See the description of
<a href="Pxp_ev_parser.html#VALclose_entities"><code class="code"><span class="constructor">Pxp_ev_parser</span>.close_entities</code></a> for a way of prematurely closing
the parser for the exceptional cases where parsing cannot go on until
the final parser state is reached.
<p>

<a name="3_Preprocessor"></a>
<h3>Preprocessor</h3>
<p>

The PXP preprocessor (see <a href="Intro_preprocessor.html"><code class="code"><span class="constructor">Intro_preprocessor</span></code></a>) allows one to create
event streams programmatically. One can get the events either as
list (type <a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a><code class="code"> list</code>), or in a form compatible with
pull parsing. For example,
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;book_list&nbsp;=&nbsp;<br>
&nbsp;&nbsp;&lt;:pxp_evlist&lt;&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&lt;book&gt;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&nbsp;&lt;title&gt;[&nbsp;<span class="string">"The&nbsp;Lord&nbsp;of&nbsp;The&nbsp;Rings"</span>&nbsp;]<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;author&gt;[&nbsp;<span class="string">"J.R.R.&nbsp;Tolkien"</span>&nbsp;]<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;]<br>
&nbsp;&nbsp;&gt;&gt;<br>
</code><pre></pre>
<p>

returns the events as a <a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a><code class="code"> list</code> whereas 
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;pull_book&nbsp;=&nbsp;<br>
&nbsp;&nbsp;&lt;:pxp_evpull&lt;&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&lt;book&gt;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&nbsp;&lt;title&gt;[&nbsp;<span class="string">"The&nbsp;Lord&nbsp;of&nbsp;The&nbsp;Rings"</span>&nbsp;]<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;author&gt;[&nbsp;<span class="string">"J.R.R.&nbsp;Tolkien"</span>&nbsp;]<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;]<br>
&nbsp;&nbsp;&gt;&gt;<br>
</code><pre></pre>
<p>

defines <code class="code">pull_book</code> as an automaton from which one can pull the events
like from a pull parser, i.e. <code class="code">pull_book</code> is of type
<code class="code">unit<span class="keywordsign">-&gt;</span></code><a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a><code class="code"> option</code>, and by calling it one can get the
events one after the other. <code class="code">pull_book</code> has the same type as the pull
function returned by the pull parser.
<p>

For a more complete discussion see <a href="Intro_preprocessor.html#events"><i>Generating events: pxp_evlist and pxp_evpull</i></a>.
<p>

Note that the preprocessor does not add any wrapping for documents or
non-documents to the event stream. See <a href="Intro_preprocessor.html#documents"><i>Documents</i></a>
for an example how to add such a wrapping in user code postprocessing
step.
<p>

<a name="3_Pushorpull"></a>
<h3>Push or pull?</h3>
<p>

The question arises whether one should prefer the push or the pull
model.  Generally, it is easy to turn a pull parser into a push parser
by adding a loop that repeatedly invokes <code class="code">pull</code> to get the events, and
then calls the push function to deliver each event. There is no such
possibility the other way round, i.e. one cannot take a push parser
and make it look like a pull parser by wrapping it into some interface
adapter - at least not in a language like O'Caml that does not know
coroutines or continuations as language elements. Effectively, the
pull model is the more general one.
<p>

The function <a href="Pxp_event.html#VALiter"><code class="code"><span class="constructor">Pxp_event</span>.iter</code></a> can be used to turn a pull parser into
a push parser:
<p>

<pre></pre><code class="code"><span class="constructor">Pxp_event</span>.iter&nbsp;push&nbsp;pull<br>
</code><pre></pre>
<p>

The events <code class="code">pull</code>-ed out of the parser engine are delivered one by
one to the receiver by invoking <code class="code">push</code>.
<p>

In PXP, the pull model is preferred, and a number of helper functions
are only available for the pull model. If you need a push-stream
nevertheless, it is recommended to use the pull parser, and to do all
required transformations on it (like filtering, see below). Finally
use <a href="Pxp_event.html#VALiter"><code class="code"><span class="constructor">Pxp_event</span>.iter</code></a> to turn the pull stream into a push-compatible
stream.
<p>

<a name="filters"></a>
<h2>Filters</h2>
<p>

Filters are a way to transform event streams (as defined for pull parsers).
For example, one can remove the processing instruction events by
doing (given that <code class="code">pull</code> is the original parser, and we define now
a modified <code class="code">pull'</code> for the transformed stream):
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;pull'&nbsp;=&nbsp;<span class="constructor">Pxp_event</span>.pfilter<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(<span class="keyword">function</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="constructor">E_pinstr</span>(_,_,_)&nbsp;<span class="keywordsign">-&gt;</span>&nbsp;<span class="keyword">false</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;_&nbsp;<span class="keywordsign">-&gt;</span>&nbsp;<span class="keyword">true</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;pull<br>
</code><pre></pre>
<p>

When events are read from <code class="code">pull'</code>, the events are also read from <code class="code">pull</code>,
but all processing instruction events are suppressed. <a href="Pxp_event.html#VALpfilter"><code class="code"><span class="constructor">Pxp_event</span>.pfilter</code></a>
works a lot like <code class="code"><span class="constructor">List</span>.filter</code> - it only keeps the events in the stream
for which a predicate function returns <code class="code"><span class="keyword">true</span></code>.
<p>

<a name="3_Normalizingcharacterdataevents"></a>
<h3>Normalizing character data events</h3>
<p>

<a href="Pxp_event.html#VALnorm_cdata_filter"><code class="code"><span class="constructor">Pxp_event</span>.norm_cdata_filter</code></a> is a special predefined filter that
transformes <code class="code"><span class="constructor">E_char_data</span></code> events so that<ul>
<li>empty <code class="code"><span class="constructor">E_char_data</span></code> events are removed</li>
<li>adjacent <code class="code"><span class="constructor">E_char_data</span></code> events are concatenated and replaced by a single
  <code class="code"><span class="constructor">E_char_data</span></code> event</li>
</ul>

The filter is simply called by
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;pull'&nbsp;=&nbsp;<span class="constructor">Pxp_event</span>.norm_cdata_filter&nbsp;pull<br>
</code><pre></pre>
<p>

<a name="3_Removingignorablewhitespace"></a>
<h3>Removing ignorable whitespace</h3>
<p>

In validation mode, the DTD may specify ignorable whitespace. This is
whitespace for which is known it only exists to make the XML tree more
readable (indentation etc.). In tree mode, ignorable whitespace is
removed by default (see <code class="code">drop_ignorable_whitespace</code> in
<a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a>).
<p>

It is possible to clean up the event stream in this way - although the
event mode is not capable of doing a full validation of the XML
document. It is required, however, that all declarations are added to
the DTD object. This is done by setting the flags <code class="code"><span class="keywordsign">`</span><span class="constructor">Extend_dtd_fully</span></code>
or <code class="code"><span class="keywordsign">`</span><span class="constructor">Val_mode_dtd</span></code> in the entry point, e.g. use
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;entry&nbsp;=&nbsp;<span class="keywordsign">`</span><span class="constructor">Entry_document</span>&nbsp;[<span class="keywordsign">`</span><span class="constructor">Extend_dtd_fully</span>]<br>
</code><pre></pre>
<p>

when you create the pull parser. The declarations of the XML elements
are needed to check whether whitespace can be dropped.
<p>

The filter function is <a href="Pxp_event.html#VALdrop_ignorable_whitespace_filter"><code class="code"><span class="constructor">Pxp_event</span>.drop_ignorable_whitespace_filter</code></a>.
Use it like
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;pull'&nbsp;=&nbsp;<span class="constructor">Pxp_event</span>.drop_ignorable_whitespace_filter&nbsp;pull<br>
</code><pre></pre>
<p>

This filter does:<ul>
<li>it checks whether non-whitespace is used in forbidden places, e.g.
  as children of an element that is declared with a regular expression
  content model</li>
<li>it removes <code class="code"><span class="constructor">E_char_data</span></code> events only consisting of whitespace when
  they are ignorable.</li>
</ul>

The stream remains being normalized if it was already normalized, i.e.
you can use this filter before or after <a href="Pxp_event.html#VALnorm_cdata_filter"><code class="code"><span class="constructor">Pxp_event</span>.norm_cdata_filter</code></a>.
<p>

<a name="3_Unwrappingdocuments"></a>
<h3>Unwrapping documents</h3>
<p>

Sometimes it is necessary to get rid of the document wrapping. The
filter <a href="Pxp_event.html#VALunwrap_document"><code class="code"><span class="constructor">Pxp_event</span>.unwrap_document</code></a> can do this. Call it like:
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;get_doc_details,&nbsp;pull'&nbsp;=&nbsp;<span class="constructor">Pxp_event</span>.unwrap_document&nbsp;pull<br>
</code><pre></pre>
<p>

The filter removes all <code class="code"><span class="constructor">E_start_doc</span></code>, <code class="code"><span class="constructor">E_end_doc</span></code>, <code class="code"><span class="constructor">E_start_super</span></code>,
<code class="code"><span class="constructor">E_end_super</span></code>, and <code class="code"><span class="constructor">E_end_of_stream</span></code> events. Also, when an <code class="code"><span class="constructor">E_error</span></code>
event is encountered, the attached exception is raised. The information
attached to the removed <code class="code"><span class="constructor">E_start_doc</span></code> event can be retrieved by
calling <code class="code">get_doc_details</code>:
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;xml_version,&nbsp;dtd&nbsp;=&nbsp;get_doc_details()<br>
</code><pre></pre>
<p>

Note that this call will fail if there is no <code class="code"><span class="constructor">E_start_doc</span></code>, and it can
fail if it is not at the expected position in the stream. If you parse
with the entry <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_document</span></code>, this cannot happen, though.
<p>

It is allowed to call <code class="code">get_doc_details</code> before using <code class="code">pull'</code>.
<p>

<a name="3_Chainingfilters"></a>
<h3>Chaining filters</h3>
<p>

It is allowed to chain filters, e.g.
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;pull1&nbsp;=&nbsp;<span class="constructor">Pxp_event</span>.drop_ignorable_whitespace_filter&nbsp;pull<br>
<span class="keyword">let</span>&nbsp;pull2&nbsp;=&nbsp;<span class="constructor">Pxp_event</span>.norm_cdata_filter&nbsp;pull1<br>
</code><pre></pre>
<p>

<a name="3_Otherhelperfunctions"></a>
<h3>Other helper functions</h3>
<p>

In <a href="Pxp_event.html"><code class="code"><span class="constructor">Pxp_event</span></code></a> there are also other helper functions besides filters.
These functions can do:<ul>
<li>conversion of pull streams to and from lists</li>
<li>concatenation of pull streams</li>
<li>extraction of nodes from pull streams</li>
<li>printing of pull streams</li>
<li>split namespace names</li>
</ul>

<a name="namespaces"></a>
<h2>Events and namespaces</h2>
<p>

Namespace processing can also be enabled in event mode. This means that
prefix normalization is applied to all names of elements and attributes.
For example, this piece of code parses a file in event mode with enabled
namespace processing:
<p>

<pre></pre><code class="code">&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;nsmng&nbsp;=&nbsp;<span class="constructor">Pxp_dtd</span>.create_namespace_manager()<br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;config&nbsp;=&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{&nbsp;<span class="constructor">Pxp_types</span>.default_config&nbsp;<span class="keyword">with</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;enable_namespace_processing&nbsp;=&nbsp;<span class="constructor">Some</span>&nbsp;nsmng<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;source&nbsp;=&nbsp;...<br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;entmng&nbsp;=&nbsp;<span class="constructor">Pxp_ev_parser</span>.create_entity_manager&nbsp;config&nbsp;source<br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;pull&nbsp;=&nbsp;create_pull_parser&nbsp;config&nbsp;entry&nbsp;entmng<br>
</code><pre></pre>
<p>

The names returned in <code class="code"><span class="constructor">E_start_tag</span>(name,attlist,scope_opt,entid)</code> are
prefix-normalized, i.e. <code class="code">name</code> and the attribute names in <code class="code">attlist</code>.
The functions <a href="Pxp_event.html#VALnamespace_split"><code class="code"><span class="constructor">Pxp_event</span>.namespace_split</code></a> and <a href="Pxp_event.html#VALextract_prefix"><code class="code"><span class="constructor">Pxp_event</span>.extract_prefix</code></a>
can be useful to analyze the names. For example, to get the namespace
URI of an element name, do:
<p>

<pre></pre><code class="code">&nbsp;&nbsp;<span class="keyword">match</span>&nbsp;ev&nbsp;<span class="keyword">with</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="constructor">Pxp_types</span>.<span class="constructor">E_start_tag</span>(name,_,_,_)&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;prefix&nbsp;=&nbsp;<span class="constructor">Pxp_event</span>.extract_prefix&nbsp;name&nbsp;<span class="keyword">in</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;uri&nbsp;=&nbsp;nsmng&nbsp;<span class="keywordsign">#</span>&nbsp;get_primary_uri&nbsp;prefix&nbsp;<span class="keyword">in</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;...<br>
</code><pre></pre>
<p>

Note that this may raise the exception <code class="code"><span class="constructor">Namespace_prefix_not_managed</span></code>
if the prefix is unknown or empty.
<p>

When namespace processing is enabled, the namespace scopes are
included in the <code class="code"><span class="constructor">E_start_tag</span></code> events. This can be used to get the
display (original) prefix:
<p>

<pre></pre><code class="code">&nbsp;&nbsp;<span class="keyword">match</span>&nbsp;ev&nbsp;<span class="keyword">with</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="constructor">Pxp_types</span>.<span class="constructor">E_start_tag</span>(name,_,<span class="constructor">Some</span>&nbsp;scope,_)&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;prefix&nbsp;=&nbsp;<span class="constructor">Pxp_event</span>.extract_prefix&nbsp;name&nbsp;<span class="keyword">in</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;dsp_prefix&nbsp;=&nbsp;scope&nbsp;<span class="keywordsign">#</span>&nbsp;display_prefix_of_normprefix&nbsp;prefix&nbsp;<span class="keyword">in</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;...<br>
</code><pre></pre>
<p>

Note that this may raise the exception <code class="code"><span class="constructor">Namespace_prefix_not_managed</span></code>
if the prefix is unknown or empty, or <code class="code"><span class="constructor">Namespace_not_in_scope</span></code> if the
prefix is not declared for this part of the XML text.
<p>

<a name="2_ExamplePrinttheeventswhileparsing"></a>
<h2>Example: Print the events while parsing</h2>
<p>

The following piece of code parses an XML file in event mode, and
prints the events. The reader is encouraged to modify the code by
e.g. adding filters, to see the effect.
<p>

<pre></pre><code class="code">&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;config&nbsp;=&nbsp;<span class="constructor">Pxp_types</span>.default_config<br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;source&nbsp;=&nbsp;<span class="constructor">Pxp_types</span>.from_file&nbsp;<span class="string">"filename.xml"</span><br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;entmng&nbsp;=&nbsp;<span class="constructor">Pxp_ev_parser</span>.create_entity_manager&nbsp;config&nbsp;source<br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;pull&nbsp;=&nbsp;create_pull_parser&nbsp;config&nbsp;entry&nbsp;entmng<br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;()&nbsp;=&nbsp;<span class="constructor">Pxp_event</span>.iter<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(<span class="keyword">fun</span>&nbsp;ev&nbsp;<span class="keywordsign">-&gt;</span>&nbsp;print_endline&nbsp;(<span class="constructor">Pxp_event</span>.string_of_event&nbsp;ev))<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;pull<br>
</code><pre></pre>
<p>

<a name="recdesc"></a>
<h2>Connect PXP with a recursive-descent parser</h2>
<p>

We assume here that a list of integers like
<p>

<pre></pre><code class="code">&nbsp;&nbsp;&nbsp;43&nbsp;::&nbsp;44&nbsp;::&nbsp;[]<br>
</code><pre></pre>
<p>

is represented in XML as
<p>

<pre></pre><code class="code">&nbsp;&nbsp;&lt;list&gt;&lt;cons&gt;&lt;int&gt;43&lt;/int&gt;&lt;cons&gt;&lt;int&gt;44&lt;/int&gt;&lt;nil/&gt;&lt;/cons&gt;&lt;/cons&gt;&lt;/list&gt;<br>
</code><pre></pre>
<p>

i.e. we have<ul>
<li><code class="code">list</code> indicates that the single child is a list</li>
<li><code class="code">cons</code> has two children: the first is the head of the list, and the
  second the tail (think <code class="code">head :: tail</code> in O'Caml)</li>
<li><code class="code">nil</code> is the empty list</li>
<li><code class="code">int</code> is an integer member of the list</li>
</ul>

We want to parse such XML texts by using the event-oriented parser, and
combine it with a recursive-descent grammar. The XML parser delivers
events which are taken as the tokens of the second parser.
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;parse_list&nbsp;(s:string)&nbsp;=<br>
<br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;<span class="keyword">rec</span>&nbsp;parse_whole_list&nbsp;stream&nbsp;=<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="comment">(*&nbsp;Production:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;whole_list&nbsp;::=&nbsp;"&lt;list&gt;"&nbsp;sub_list&nbsp;"&lt;/list&gt;"&nbsp;END<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*)</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">match</span>&nbsp;stream&nbsp;<span class="keyword">with</span>&nbsp;<span class="keyword">parser</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&lt;&nbsp;<span class="keywordsign">'</span><span class="constructor">E_start_tag</span>(<span class="string">"list"</span>,_,_,_);<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;l&nbsp;=&nbsp;parse_sub_list;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">'</span><span class="constructor">E_end_tag</span>(<span class="string">"list"</span>,_);<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">'</span><span class="constructor">E_end_of_stream</span>;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt;]&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;l<br>
<br>
&nbsp;&nbsp;<span class="keyword">and</span>&nbsp;parse_sub_list&nbsp;stream&nbsp;=<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="comment">(*&nbsp;Production:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sub_list&nbsp;::=&nbsp;"&lt;cons&gt;"&nbsp;object&nbsp;sub_list&nbsp;"&lt;/cons&gt;"<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;"&lt;nil&gt;"&nbsp;"&lt;/nil&gt;"<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*)</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">match</span>&nbsp;stream&nbsp;<span class="keyword">with</span>&nbsp;<span class="keyword">parser</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&lt;&nbsp;<span class="keywordsign">'</span><span class="constructor">E_start_tag</span>(<span class="string">"cons"</span>,_,_,_);&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;head&nbsp;=&nbsp;parse_object;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tail&nbsp;=&nbsp;parse_sub_list;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">'</span><span class="constructor">E_end_tag</span>(<span class="string">"cons"</span>,_)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt;]&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;head&nbsp;::&nbsp;tail<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;[&lt;&nbsp;<span class="keywordsign">'</span><span class="constructor">E_start_tag</span>(<span class="string">"nil"</span>,_,_,_);&nbsp;<span class="keywordsign">'</span><span class="constructor">E_end_tag</span>(<span class="string">"nil"</span>,_)&nbsp;&gt;]&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[]<br>
<br>
&nbsp;&nbsp;<span class="keyword">and</span>&nbsp;parse_object&nbsp;stream&nbsp;=<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="comment">(*&nbsp;Production:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;object&nbsp;::=&nbsp;"&lt;int&gt;"&nbsp;text&nbsp;"&lt;/int&gt;"<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;with&nbsp;constraint&nbsp;that&nbsp;text&nbsp;is&nbsp;an&nbsp;integer&nbsp;parsable&nbsp;by&nbsp;int_of_string<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*)</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">match</span>&nbsp;stream&nbsp;<span class="keyword">with</span>&nbsp;<span class="keyword">parser</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&lt;&nbsp;<span class="keywordsign">'</span><span class="constructor">E_start_tag</span>(<span class="string">"int"</span>,_,_,_);<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;number&nbsp;=&nbsp;parse_text;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">'</span><span class="constructor">E_end_tag</span>(<span class="string">"int"</span>,_)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt;]&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;int_of_string&nbsp;number<br>
<br>
&nbsp;&nbsp;<span class="keyword">and</span>&nbsp;parse_text&nbsp;stream&nbsp;=<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="comment">(*&nbsp;Production.<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;text&nbsp;::=&nbsp;"any&nbsp;XML&nbsp;character&nbsp;data"<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*)</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">match</span>&nbsp;stream&nbsp;<span class="keyword">with</span>&nbsp;<span class="keyword">parser</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&lt;&nbsp;<span class="keywordsign">'</span><span class="constructor">E_char_data</span>&nbsp;data;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;rest&nbsp;=&nbsp;parse_text<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&gt;]&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;data&nbsp;^&nbsp;rest<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;[&lt;&nbsp;&gt;]&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="string">""</span><br>
&nbsp;&nbsp;<span class="keyword">in</span><br>
<br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;config&nbsp;=&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;{&nbsp;<span class="constructor">Pxp_types</span>.default_config&nbsp;<span class="keyword">with</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;store_element_positions&nbsp;=&nbsp;<span class="keyword">false</span>;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="comment">(*&nbsp;don't&nbsp;produce&nbsp;E_position&nbsp;events&nbsp;*)</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;}<br>
&nbsp;&nbsp;<span class="keyword">in</span><br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;mgr&nbsp;=&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">Pxp_ev_parser</span>.create_entity_manager<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;config<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(<span class="constructor">Pxp_types</span>.from_string&nbsp;s)&nbsp;<span class="keyword">in</span><br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;pull&nbsp;=&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">Pxp_ev_parser</span>.create_pull_parser&nbsp;config&nbsp;(<span class="keywordsign">`</span><span class="constructor">Entry_content</span>[])&nbsp;mgr&nbsp;<span class="keyword">in</span><br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;pull'&nbsp;=<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">Pxp_event</span>.norm_cdata_filter&nbsp;pull&nbsp;<span class="keyword">in</span><br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;next_event_or_error&nbsp;n&nbsp;=<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;e&nbsp;=&nbsp;pull'&nbsp;n&nbsp;<span class="keyword">in</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">match</span>&nbsp;e&nbsp;<span class="keyword">with</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">Some</span>(<span class="constructor">E_error</span>&nbsp;exn)&nbsp;<span class="keywordsign">-&gt;</span>&nbsp;raise&nbsp;exn<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;_&nbsp;<span class="keywordsign">-&gt;</span>&nbsp;e<br>
&nbsp;&nbsp;<span class="keyword">in</span><br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;stream&nbsp;=<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">Stream</span>.from&nbsp;next_event_or_error&nbsp;<span class="keyword">in</span><br>
&nbsp;&nbsp;parse_whole_list&nbsp;stream<br>
</code><pre></pre>
<p>

The trick is to use <code class="code"><span class="constructor">Stream</span>.from</code> to convert the "pull-style" event stream
into a <code class="code"><span class="constructor">Stream</span>.t</code>. The kind of stream can be parsed in a recursive-descent
way by using stream parser capability built into O'Caml.
<p>

Note that we normalize the character data nodes. The grammar can only
process a single <code class="code"><span class="constructor">E_char_data</span></code> event, and this normalization enforces
that adjacent <code class="code"><span class="constructor">E_char_data</span></code> events are merged.
<p>

Note that you have to enable camlp4 when compiling this example, because
the stream parsers are only available via camlp4.
<p>

<a name="escape"></a>
<h2>Escape PXP parsing</h2>
<p>

<b>This feature is still considered as experimental!</b>
<p>

It is possible to define two escaping functions in <a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a>:<ul>
<li><code class="code">escape_contents</code>: This function is called when one of the characters
  <code class="code">{</code>, <code class="code">}</code>, <code class="code">{{</code>, or <code class="code">}}</code> is found in character data context.</li>
<li><code class="code">escape_attributes</code>: This function is called when one of the 
  mentioned special characters is found in the value of an attribute.</li>
</ul>

Both escaping functions are allowed to operate directly on the
underlying lexical buffer PXP uses, and because of this these
functions can interpret the following characters in an arbitrary
special way. The escaping functions have to return a replacement text,
i.e.  a string that is to be taken as character data or as attribute
value (depending on context).
<p>

Why are the curly braces taken as escaping characters? This is
motivated by the XQuery language. Here, a single <code class="code">{</code> switches from
the XML object language to the XQuery meta language until another <code class="code">}</code>
terminates this mode. By doubling the brace character, it loses its
escaping function, and a single brace character is assumed.
<p>

A simple example makes this clearer. We allow here that a number
is written between curly braces in hexadecimal, octal or binary
notation using the conventions of O'Caml. The number is inserted into
the event stream in normalized decimal notation (i.e. no leading zeros).
For instance, one can write
<p>

<pre></pre><code class="code">&nbsp;&nbsp;&lt;foo&nbsp;x=<span class="string">"{0xff}"</span>&nbsp;y=<span class="string">"{{}}"</span>&gt;{0o76}&lt;/foo&gt;<br>
</code><pre></pre>
<p>

and the parser emits the events
<p>

<pre></pre><code class="code">&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_start_tag</span>(<span class="string">"foo"</span>,&nbsp;[<span class="string">"x"</span>,&nbsp;<span class="string">"255"</span>;&nbsp;<span class="string">"y"</span>,&nbsp;<span class="string">"{}"</span>&nbsp;],&nbsp;_,&nbsp;_)<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_char_data</span>(<span class="string">"62"</span>)<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="constructor">E_end_tag</span>(<span class="string">"foo"</span>,_)<br>
</code><pre></pre>
<p>

Of course, this example is very trivial, and in this case, one could
also get the same effect by postprocessing the XML events. We want
to point out, however, that the escaping feature makes it possible to
combine PXP with a foreign language with its own lexing and parsing
functions.
<p>

First, we need a lexer - this is <code class="code">lex.mll</code>:
<p>

<pre></pre><code class="code">&nbsp;&nbsp;rule&nbsp;scan_number&nbsp;=&nbsp;parse<br>
&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;[&nbsp;<span class="string">'0'</span>-<span class="string">'9'</span>&nbsp;]+&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{&nbsp;<span class="keywordsign">`</span><span class="constructor">Int</span>&nbsp;(int_of_string&nbsp;(<span class="constructor">Lexing</span>.lexeme&nbsp;lexbuf))&nbsp;}<br>
&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;(<span class="string">"0b"</span><span class="keywordsign">|</span><span class="string">"0B"</span>)&nbsp;[&nbsp;<span class="string">'0'</span>-<span class="string">'1'</span>&nbsp;]+&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{&nbsp;<span class="keywordsign">`</span><span class="constructor">Int</span>&nbsp;(int_of_string&nbsp;(<span class="constructor">Lexing</span>.lexeme&nbsp;lexbuf))&nbsp;}<br>
&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;(<span class="string">"0o"</span><span class="keywordsign">|</span><span class="string">"0O"</span>)&nbsp;[&nbsp;<span class="string">'0'</span>-<span class="string">'7'</span>&nbsp;]+&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{&nbsp;<span class="keywordsign">`</span><span class="constructor">Int</span>&nbsp;(int_of_string&nbsp;(<span class="constructor">Lexing</span>.lexeme&nbsp;lexbuf))&nbsp;}<br>
&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;(<span class="string">"0x"</span><span class="keywordsign">|</span><span class="string">"0X"</span>)&nbsp;[&nbsp;<span class="string">'0'</span>-<span class="string">'9'</span>&nbsp;<span class="string">'a'</span>-<span class="string">'f'</span>&nbsp;<span class="string">'A'</span>-<span class="string">'F'</span>&nbsp;]+&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{&nbsp;<span class="keywordsign">`</span><span class="constructor">Int</span>&nbsp;(int_of_string&nbsp;(<span class="constructor">Lexing</span>.lexeme&nbsp;lexbuf))&nbsp;}<br>
&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="string">"}"</span>&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{&nbsp;<span class="keywordsign">`</span><span class="constructor">End</span>&nbsp;}<br>
&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;_<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{&nbsp;<span class="keywordsign">`</span><span class="constructor">Bad</span>&nbsp;}<br>
&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;eof<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{&nbsp;<span class="keywordsign">`</span><span class="constructor">Eof</span>&nbsp;}<br>
</code><pre></pre>
<p>

This lexer parses the various forms of numbers. We are lucky that we
can use <code class="code">int_of_string</code> to convert these forms to ints. The right
curly brace is also recognized. Any other character leads to a lexing
error (<code class="code"><span class="keywordsign">`</span><span class="constructor">Bad</span></code>). If the XML file stops, <code class="code"><span class="keywordsign">`</span><span class="constructor">Eof</span></code> is emitted.
<p>

Now the escape functions. <code class="code">escape_contents</code> looks at the passed token.
If it is a double curly brace, it immediately returns a single brace
as replacement. A single left brace is processed by <code class="code">parse_number</code>,
defined below. A single right brace is forbidden. Any other tokens
cannot be passed to <code class="code">escape_contents</code>. <code class="code">escape_attributes</code> has
an additional argument, but we can ignore this for now. (This argument
is the position in the attribute value, for advanced post-processing.)
<p>

<pre></pre><code class="code">&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;escape_contents&nbsp;tok&nbsp;mng&nbsp;=<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">match</span>&nbsp;tok&nbsp;<span class="keyword">with</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="constructor">Lcurly</span>&nbsp;<span class="comment">(*&nbsp;"{"&nbsp;*)</span>&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;parse_number&nbsp;mng<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="constructor">LLcurly</span>&nbsp;<span class="comment">(*&nbsp;"{{"&nbsp;*)</span>&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="string">"{"</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="constructor">Rcurly</span>&nbsp;<span class="comment">(*&nbsp;"}"&nbsp;*)</span>&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;failwith&nbsp;<span class="string">"Single&nbsp;}&nbsp;not&nbsp;allowed"</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="constructor">RRcurly</span>&nbsp;<span class="comment">(*&nbsp;"}}"&nbsp;*)</span>&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="string">"}"</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;_&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">assert</span>&nbsp;<span class="keyword">false</span><br>
<br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;escape_attributes&nbsp;tok&nbsp;pos&nbsp;mng&nbsp;=<br>
&nbsp;&nbsp;&nbsp;&nbsp;escape_contents&nbsp;tok&nbsp;mng<br>
</code><pre></pre>
<p>

Now, <code class="code">parse_number</code> invokes our custom lexer <code class="code"><span class="constructor">Lex</span>.scan_number</code> with
the (otherwise) internal PXP lexbuf. The function returns the replacement
text.
<p>

It is part of the interface that the next token of the lexbuf must be
the character following the right curly brace.
<p>

<pre></pre><code class="code">&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;parse_number&nbsp;mng&nbsp;=<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;lexbuf&nbsp;=&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">match</span>&nbsp;mng&nbsp;<span class="keywordsign">#</span>&nbsp;current_lexer_obj&nbsp;<span class="keywordsign">#</span>&nbsp;lexbuf&nbsp;<span class="keyword">with</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="keywordsign">`</span><span class="constructor">Ocamllex</span>&nbsp;lexbuf&nbsp;<span class="keywordsign">-&gt;</span>&nbsp;lexbuf<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="keywordsign">`</span><span class="constructor">Netulex</span>&nbsp;_&nbsp;<span class="keywordsign">-&gt;</span>&nbsp;failwith&nbsp;<span class="string">"Netulex&nbsp;lexbufs&nbsp;not&nbsp;supported"</span>&nbsp;<span class="keyword">in</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">match</span>&nbsp;<span class="constructor">Lex</span>.scan_number&nbsp;lexbuf&nbsp;<span class="keyword">with</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="keywordsign">`</span><span class="constructor">Int</span>&nbsp;n&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;s&nbsp;=&nbsp;string_of_int&nbsp;n&nbsp;<span class="keyword">in</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(&nbsp;<span class="keyword">match</span>&nbsp;&nbsp;<span class="constructor">Lex</span>.scan_number&nbsp;lexbuf&nbsp;<span class="keyword">with</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="keywordsign">`</span><span class="constructor">Int</span>&nbsp;_&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;failwith&nbsp;<span class="string">"More&nbsp;than&nbsp;one&nbsp;number"</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="keywordsign">`</span><span class="constructor">End</span>&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;()<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="keywordsign">`</span><span class="constructor">Bad</span>&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;failwith&nbsp;<span class="string">"Bad&nbsp;character"</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="keywordsign">`</span><span class="constructor">Eof</span>&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;failwith&nbsp;<span class="string">"Unexpected&nbsp;EOF"</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;);<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;s<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="keywordsign">`</span><span class="constructor">End</span>&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;failwith&nbsp;<span class="string">"Empty&nbsp;curly&nbsp;braces"</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="keywordsign">`</span><span class="constructor">Bad</span>&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;failwith&nbsp;<span class="string">"Bad&nbsp;character"</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="keywordsign">|</span>&nbsp;<span class="keywordsign">`</span><span class="constructor">Eof</span>&nbsp;<span class="keywordsign">-&gt;</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;failwith&nbsp;<span class="string">"Unexpected&nbsp;EOF"</span><br>
</code><pre></pre>
<p>

Due to the way PXP works internally, the method <code class="code">mng <span class="keywordsign">#</span> current_lexobj
<span class="keywordsign">#</span> lexbuf</code> can return two different kinds of lexical buffers. <code class="code"><span class="keywordsign">`</span><span class="constructor">Ocamllex</span></code>
means it is a <code class="code"><span class="constructor">Lexing</span>.lexbuf</code> buffer. This type of buffer is used for
all 8 bit encodings, and if the special <code class="code">pxp-lex-utf8</code> lexer is used.
The lexer <code class="code">pxp-ulex-utf8</code>, however, will return a <code class="code"><span class="constructor">Netulex</span></code>-style buffer.
<p>

Finally, we enable to use our escaping functions in the config record:
<p>

<pre></pre><code class="code"><span class="keyword">let</span>&nbsp;config&nbsp;=<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{&nbsp;<span class="constructor">Pxp_types</span>.default_config&nbsp;<span class="keyword">with</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;escape_contents&nbsp;=&nbsp;escape_contents;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;escape_attributes&nbsp;=&nbsp;escape_attributes<br>
</code><pre></pre>
<p>

<a name="3_Howacomplexexamplecouldwork"></a>
<h3>How a complex example could work</h3>
<p>

The mentioned example is simple because the return value is a
string. One can imagine, however, complex scenarios where one wants to
insert custom events into the event stream. The PXP interface does not
allow this directly. As workaround we suggest the following.
<p>

The custom events are collected in special buffers. The buffers are
numbered by sequential integers (0, 1, ...). So <code class="code">escape_contents</code> would
allocate such a buffer and get a number:
<p>

<pre></pre><code class="code">&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;buffer,&nbsp;n&nbsp;=&nbsp;allocate_event_buffer()<br>
</code><pre></pre>
<p>

Here, <code class="code">buffer</code> could be an <code class="code">event <span class="constructor">Queue</span>.t</code>. The number
<code class="code">n</code> identifies the buffer. The buffers, once filled, can be looked up
by
<p>

<pre></pre><code class="code">&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;buffer&nbsp;=&nbsp;lookup_event_buffer&nbsp;n<br>
</code><pre></pre>
<p>

So <code class="code">escape_contents</code> would like to return the events collected in the
buffer, so that these are inserted into the event stream at the 
position where the curly escape occurs. As this is not allowed, it
returns simply the buffer number instead so that it can be later
identified, e.g.
<p>

<pre></pre><code class="code">&nbsp;&nbsp;<span class="string">"{BUFFER&nbsp;"</span>&nbsp;^&nbsp;string_of_int&nbsp;n&nbsp;^&nbsp;<span class="string">"}"</span><br>
</code><pre></pre>
<p>

For unescaping curly braces one would insert special tokens, e.g.
<code class="code"><span class="string">"{LCURLY}"</span></code> and <code class="code"><span class="string">"{RCURLY}"</span></code>.
<p>

Now, the parser, specially configured with <code class="code">escape_contents</code>, will
return event streams where <code class="code"><span class="constructor">E_char_data</span></code> events may include this 
special pointers to buffers <code class="code">{<span class="constructor">BUFFER</span> </code>&lt;n&gt;<code class="code">}</code>, and the curly brace tokens
<code class="code">{<span class="constructor">LCURLY</span>}</code> and <code class="code">{<span class="constructor">RCURLY</span>}</code>. In a postprocessing step, all occurrences
of these tokens are localized in the event stream, and<ul>
<li>for buffer tokens the buffer contents are looked up (<code class="code">lookup_event_buffer</code>),
  and the events found there are substituted</li>
<li>for <code class="code">{<span class="constructor">LCURLY</span>}</code> an <code class="code"><span class="constructor">E_char_data</span> <span class="string">"{"</span></code> event is substituted</li>
<li>for <code class="code">{<span class="constructor">RCURLY</span>}</code> an <code class="code"><span class="constructor">E_char_data</span> <span class="string">"}"</span></code> event is substituted</li>
</ul>

It can be assumed that the tokens to localize are still <code class="code"><span class="constructor">E_char_data</span></code>
events of their own, i.e. not merged with adjacent <code class="code"><span class="constructor">E_char_data</span></code>
events.
<p>

It is admitted that this is a complicated workaround.
<p>

For attributes one can do basically the same. The postprocessing step
may be a lot more complicated, however.
<br>
</body></html>