<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <link rel="stylesheet" href="style.css" type="text/css"> <meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"> <link rel="Start" href="index.html"> <link rel="previous" href="Intro_namespaces.html"> <link rel="next" href="Intro_resolution.html"> <link rel="Up" href="index.html"> <link title="Index of types" rel=Appendix href="index_types.html"> <link title="Index of exceptions" rel=Appendix href="index_exceptions.html"> <link title="Index of values" rel=Appendix href="index_values.html"> <link title="Index of class methods" rel=Appendix href="index_methods.html"> <link title="Index of classes" rel=Appendix href="index_classes.html"> <link title="Index of class types" rel=Appendix href="index_class_types.html"> <link title="Index of modules" rel=Appendix href="index_modules.html"> <link title="Index of module types" rel=Appendix href="index_module_types.html"> <link title="Pxp_types" rel="Chapter" href="Pxp_types.html"> <link title="Pxp_document" rel="Chapter" href="Pxp_document.html"> <link title="Pxp_dtd" rel="Chapter" href="Pxp_dtd.html"> <link title="Pxp_tree_parser" rel="Chapter" href="Pxp_tree_parser.html"> <link title="Pxp_core_types" rel="Chapter" href="Pxp_core_types.html"> <link title="Pxp_ev_parser" rel="Chapter" href="Pxp_ev_parser.html"> <link title="Pxp_event" rel="Chapter" href="Pxp_event.html"> <link title="Pxp_dtd_parser" rel="Chapter" href="Pxp_dtd_parser.html"> <link title="Pxp_codewriter" rel="Chapter" href="Pxp_codewriter.html"> <link title="Pxp_marshal" rel="Chapter" href="Pxp_marshal.html"> <link title="Pxp_yacc" rel="Chapter" href="Pxp_yacc.html"> <link title="Pxp_reader" rel="Chapter" href="Pxp_reader.html"> <link title="Intro_trees" rel="Chapter" href="Intro_trees.html"> <link title="Intro_extensions" rel="Chapter" href="Intro_extensions.html"> <link title="Intro_namespaces" rel="Chapter" href="Intro_namespaces.html"> <link title="Intro_events" rel="Chapter" href="Intro_events.html"> <link title="Intro_resolution" rel="Chapter" href="Intro_resolution.html"> <link title="Intro_getting_started" rel="Chapter" href="Intro_getting_started.html"> <link title="Intro_advanced" rel="Chapter" href="Intro_advanced.html"> <link title="Intro_preprocessor" rel="Chapter" href="Intro_preprocessor.html"> <link title="Example_readme" rel="Chapter" href="Example_readme.html"><link title="XML data as stream of events" rel="Section" href="#1_XMLdataasstreamofevents"> <link title="The structure of event streams" rel="Subsection" href="#structure"> <link title="Calling the parser in event mode" rel="Subsection" href="#calling"> <link title="Filters" rel="Subsection" href="#filters"> <link title="Events and namespaces" rel="Subsection" href="#namespaces"> <link title="Example: Print the events while parsing" rel="Subsection" href="#2_ExamplePrinttheeventswhileparsing"> <link title="Connect PXP with a recursive-descent parser" rel="Subsection" href="#recdesc"> <link title="Escape PXP parsing" rel="Subsection" href="#escape"> <title>PXP Reference : Intro_events</title> </head> <body> <div class="navbar"><a href="Intro_namespaces.html">Previous</a> <a href="index.html">Up</a> <a href="Intro_resolution.html">Next</a> </div> <center><h1>Intro_events</h1></center> <br> <br> <a name="1_XMLdataasstreamofevents"></a> <h1>XML data as stream of events</h1> <p> In contrast to the tree mode (see <a href="Intro_trees.html"><code class="code"><span class="constructor">Intro_trees</span></code></a>), the parser does not return the complete document at once in event mode, but as a sequence of so-called events. The parser makes a number of guarantees about the structure of the emitted events, especially it is ensured that they conform to the well-formedness constraints. For instance, it is ensured that start tags and end tags are properly nested. Nevertheless, it is up to the caller to process and/or aggregate the events. This leaves a lot of freedom for the caller. <p> The event mode is especially well-suited for processing very large documents. As PXP does not by itself represent the complete document in memory, PXP needs usually not to maintain large data structures in event mode. Of course, the caller should also try to avoid such data structures. This makes it then possible to even process arbitrarily large documents in many cases. Note, however, that not all limits are taken out of effect. For example, for checking well-formedness the parser still needs to maintain a stack of start elements whose end elements have not been seen yet. Because of this, it is not possible to parse arbitrarily deeply nested documents with constant memory. On 32 bit platforms, there is still a limit of the maximum string length of 16 MB. <p> Another application of event mode is the direct combination with recursive-descent parsers for postprocessing the stream of events. See below <a href="Intro_events.html#recdesc"><i>Connect PXP with a recursive-descent parser</i></a> for more. <p> The event mode also makes it feasible to enable the special escape tokens <code class="code">{</code>, <code class="code">}</code>, <code class="code">{{</code>, and <code class="code">}}</code>. PXP can be configured such that these tokens trigger a user-defined add-on parser that reads directly from the character stream. See below <a href="Intro_events.html#escape"><i>Escape PXP parsing</i></a> for more. <p> We should also mention one basic limitation of event-oriented parsing: It is fundamentally incompatible with validation, as the tree view is required to validate a document. <p> <a name="links"></a> <h3>Links to other documentation</h3> <p> <ul> <li><a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a> is the data type of events. Also explained below</li> <li><a href="Pxp_ev_parser.html"><code class="code"><span class="constructor">Pxp_ev_parser</span></code></a> is the module with parsing functions in event mode</li> <li><a href="Pxp_event.html"><code class="code"><span class="constructor">Pxp_event</span></code></a> is a module with helper functions for event mode, such as concatenation of event streams</li> <li><a href="Pxp_document.html#VALliquefy"><code class="code"><span class="constructor">Pxp_document</span>.liquefy</code></a> allows one to convert a tree into an event stream</li> <li><a href="Pxp_document.html#VALsolidify"><code class="code"><span class="constructor">Pxp_document</span>.solidify</code></a> allows one to convert an event stream into a tree</li> <li><a href="Intro_preprocessor.html#events"><i>Generating events: pxp_evlist and pxp_evpull</i></a> explains how to use the preprocessor to construct event streams</li> </ul> <a name="compat"></a> <h3>Compatibility</h3> <p> Event mode is compatible with: <p> <ul> <li>Well-formedness parsing</li> <li>Namespaces: Namespace processing works as outlined in <a href="Intro_namespaces.html"><code class="code"><span class="constructor">Intro_namespaces</span></code></a>, only that the user needs to interpret the namespace information contained in the events differently. See below <a href="Intro_events.html#namespaces"><i>Events and namespaces</i></a> for more.</li> <li>Reading from arbitrary sources as described in <a href="Intro_resolution.html"><code class="code"><span class="constructor">Intro_resolution</span></code></a></li> </ul> Event mode is incompatible with: <p> <ul> <li>Validation</li> </ul> <a name="structure"></a> <h2>The structure of event streams</h2> <p> First we describe how well-formed XML fragments are represented in stream format, i.e. XML text that is properly nested with respect to start tags and end tags. For a real text, the parser will also emit some wrapping. It is distinguished between documents and non-document entities. A document is a formally closed text that consists of one main entity (file) and optionally a number of referenced entities. One can parse a file as document, and in this case the parser will add a wrapping suited for documents. Alternatively, one can parse an entity as a plain entity, and in this case the parser will add a wrapping suited for non-documents. Note that the XML declaration (<code class="code"><?xml ... <span class="keywordsign">?></span></code>) for such non-document entities is slightly different, and that no <code class="code"><span class="constructor">DOCTYPE</span></code> clause is permitted. <p> <a name="wf"></a> <h3>The structure of well-formed XML fragments</h3> <p> The type of events is <a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a>. The events do not strictly correspond to syntactical elements of XML, but more to a logical interpretation. <p> The parser emits events for <ul> <li><code class="code"><span class="constructor">E_char_data</span>(text)</code>: Character data - The parser emits character data events for sequences of characters. It is unspecified how long these sequences are. This means it is up to the parser how a contiguous section of characters is split up into one or more character data events, i.e. <b>adjacent character data events may be emitted by the parser.</b> Also, it is not tried to suppress whitespace of any kind. For example, the XML text <pre></pre><code class="code"> <span class="constructor">Hello</span> world </code><pre></pre> might lead to the emission of <pre></pre><code class="code"> [<span class="constructor">E_char_data</span> <span class="string">"Hello "</span>; <span class="constructor">E_char_data</span> <span class="string">"world"</span>] </code><pre></pre> but also to any other split into events. <p> </li> <li><code class="code"><span class="constructor">E_start_tag</span>(name,atts,scope_opt,entid)</code>: Start tags of elements - Includes everything within the angle brackets, i.e. <code class="code">name</code> and attribute list <code class="code">atts</code> (as name/value pairs). The event also includes the namespace scope <code class="code">scope_opt</code> if namespace processing is enabled (or <code class="code"><span class="constructor">None</span></code>), and it includes a reference <code class="code">entid</code> to the entity the tag occurs in. Note that the tag name and the attribute names are subject to prefix normalization if namespace processing is enabled. <p> </li> <li><code class="code"><span class="constructor">E_end_tag</span>(name,entid)</code>: End tags of elements - The event mentions the <code class="code">name</code>, and the entity <code class="code">entid</code> the tag occurs in. Both <code class="code">name</code> and <code class="code">entid</code> are always identical to the values attached to the corresponding start tag. <p> Note that the short form of empty elements, <code class="code"><tag/></code> are emitted as a start tag followed by an end tag. </li> <li><code class="code"><span class="constructor">E_pinstr</span>(name,value,entid)</code>: Processing instructions (PI's) - In tree mode, PI's can be represented in two ways: Either by attaching them to the surrounding elements, or by including them into the tree exactly where they occurred in the text. For symmetry, the same two ways of handling PI's are also present in the event stream representation (event streams and trees should be convertible into each other without data loss). Although there is only one event (<code class="code"><span class="constructor">E_pinstr</span></code>), it depends on the config option <code class="code">enable_pinstr_nodes</code> where this event is placed into the event stream. If the option is enabled, <code class="code"><span class="constructor">E_pinstr</span></code> is always emitted where the PI occurs in the XML text. If it is disabled, the emission of <code class="code"><span class="constructor">E_pinstr</span></code> may be delayed, but it is still guaranteed that this happens in the same context (surrounding element). It is not possible to turn the emission of PI events completely off. (See <a href="Intro_events.html#filters"><i>Filters</i></a> for an example how to filter out PI events in a postprocessing step.) <p> </li> <li><code class="code"><span class="constructor">E_comment</span> text</code>: Comments - If enabled (by <code class="code">enable_comment_nodes</code> in <a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a>), the parser emits comment events. <p> </li> <li><code class="code"><span class="constructor">E_start_super</span></code> and <code class="code"><span class="constructor">E_end_super</span></code>: Super root nodes - If enabled (by <code class="code">enable_super_root_node</code> in <a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a>), the parser emits a start event for the super root node at the beginning of the stream, and an end event at the end of the stream. This is comparable to an element embracing the whole text. <p> </li> <li><code class="code"><span class="constructor">E_position</span>(e,l,p)</code>: Position events - If enabled (by <code class="code">store_element_positions</code> in <a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a>), the parser emits special position events. These events refer to the immediately following event, and say from where in the XML text the following event originates. Position events are emitted before <code class="code"><span class="constructor">E_start_tag</span></code>, <code class="code"><span class="constructor">E_pinstr</span></code>, and <code class="code"><span class="constructor">E_comment</span></code>. The argument <code class="code">e</code> is a textual description of the entity. <code class="code">l</code> is the line. <code class="code">p</code> is the byte position of the character. </li> </ul> <p> As in the tree mode, entities are fully resolved, and do not appear in the parsed events. Also, syntactic elements like CDATA sections, the XML declaration, the DOCTYPE clause, and all elements only allowed in the DTD part are not represented. <p> Example for an event stream: The XML fragment <p> <pre></pre><code class="code"> <p a1=<span class="string">"one"</span>><q>data1</q><r>data2</r><s></s><t/></p><br> </code><pre></pre> <p> could be represented as <p> <pre></pre><code class="code"> [ <span class="constructor">E_start_tag</span>(<span class="string">"p"</span>,[<span class="string">"a1"</span>,<span class="string">"one"</span>],<span class="constructor">None</span>,<entid>);<br> <span class="constructor">E_start_tag</span>(<span class="string">"q"</span>,[],<span class="constructor">None</span>,<entid>);<br> <span class="constructor">E_char_data</span> <span class="string">"data1"</span>;<br> <span class="constructor">E_end_tag</span>(<span class="string">"q"</span>,<entid>);<br> <span class="constructor">E_start_tag</span>(<span class="string">"r"</span>,[],<span class="constructor">None</span>,<entid>);<br> <span class="constructor">E_char_data</span> <span class="string">"data2"</span>;<br> <span class="constructor">E_end_tag</span>(<span class="string">"r"</span>,<entid>);<br> <span class="constructor">E_start_tag</span>(<span class="string">"s"</span>,[],<span class="constructor">None</span>,<entid>);<br> <span class="constructor">E_end_tag</span>(<span class="string">"s"</span>,<entid>);<br> <span class="constructor">E_start_tag</span>(<span class="string">"t"</span>,[],<span class="constructor">None</span>,<entid>);<br> <span class="constructor">E_end_tag</span>(<span class="string">"t"</span>,<entid>);<br> <span class="constructor">E_end_tah</span>(<span class="string">"p"</span>,<entid>);<br> ]<br> </code><pre></pre> <p> where <code class="code"><entid></code> is the entity ID object. <p> <a name="nondocs"></a> <h3>The wrapping for non-document entities</h3> <p> The XML specification demands that external XML entities (that are referenced from a document entity or another external entity) comply to this grammar (excerpt from the W3C definition): <p> <pre></pre><code class="code">extParsedEnt ::= <span class="constructor">TextDecl</span>? content<br> <span class="constructor">TextDecl</span> ::= <span class="keywordsign">'</span><?xml' <span class="constructor">VersionInfo</span>? <span class="constructor">EncodingDecl</span> <span class="constructor">S</span>? <span class="keywordsign">'</span><span class="keywordsign">?></span><span class="keywordsign">'</span><br> content ::= (element <span class="keywordsign">|</span> <span class="constructor">CharData</span> <span class="keywordsign">|</span> <span class="constructor">Reference</span> <span class="keywordsign">|</span> <span class="constructor">CDSect</span> <span class="keywordsign">|</span> <span class="constructor">PI</span> <span class="keywordsign">|</span> <span class="constructor">Comment</span>)*<br> </code><pre></pre> <p> i.e. there can be an XML declaration at the beginning (always with an <code class="code">encoding</code> declaration), but the declaration is optional. It is followed by a sequence of elements, character data, processing instructions and comments (which are reflected by the events emitted by the parser), and by entity references and CDATA sections (which are already resolved by the parser). <p> The emitted events are now:<ul> <li>No event is emitted for the XML declaration</li> <li>The stream consists of the events for the <code class="code">content</code> production</li> <li>Finally, there is an <code class="code"><span class="constructor">E_end_of_stream</span></code> event.</li> </ul> When the parser detects an error, it stops the event stream, and emits a last <code class="code"><span class="constructor">E_error</span></code> event instead. <p> <a name="docs"></a> <h3>The wrapping for closed documents</h3> <p> Closed documents have to match this grammar (excerpt from the W3C definition): <p> <pre></pre><code class="code">document ::= prolog element <span class="constructor">Misc</span>*<br> prolog ::= <span class="constructor">XMLDecl</span>? <span class="constructor">Misc</span>* (doctypedecl <span class="constructor">Misc</span>*)?<br> <span class="constructor">XMLDecl</span> ::= <span class="keywordsign">'</span><?xml' <span class="constructor">VersionInfo</span> <span class="constructor">EncodingDecl</span>? <span class="constructor">SDDecl</span>? <span class="constructor">S</span>? <span class="keywordsign">'</span><span class="keywordsign">?></span><span class="keywordsign">'</span><br> </code><pre></pre> <p> That means there can be an XML declaration at the beginning (always with a <code class="code"><span class="constructor">VersionInfo</span></code> declaration), but the declaration is optional. There can be a <code class="code"><span class="constructor">DOCTYPE</span></code> declaration. Finally, there must be a single element. The production <code class="code"><span class="constructor">Misc</span></code> stands for a comment, a processing instruction, or whitespace. <p> The emitted events are now:<ul> <li><code class="code"><span class="constructor">E_start_doc</span>(version,dtd)</code> is always emitted at the beginning. The <code class="code">version</code> string is from <code class="code"><span class="constructor">VersionInfo</span></code>, or "1.0" if the whole XML declaration is missing. The <code class="code">dtd</code> object may contain the declaration of the parsed <code class="code"><span class="constructor">DOCTYPE</span></code> clause. However, by setting parsing parameters it is possible to control which declarations are added to the <code class="code">dtd</code> object.</li> <li>If <code class="code">enable_super_root</code>: <code class="code"><span class="constructor">E_start_super</span></code></li> <li>If there are comments or processing instructions before the topmost element, and the node type is enabled, these events are now emitted.</li> <li>Now the events of the topmost <code class="code">element</code> follow.</li> <li>If there are comments or processing instructions after the topmost element, and the node type is enabled, these events are now emitted.</li> <li>If <code class="code">enable_super_root</code>: <code class="code"><span class="constructor">E_end_super</span></code></li> <li><code class="code"><span class="constructor">E_end_doc</span> name</code>: ends the document. The <code class="code">name</code> is the literal name of the topmost element, without any prefix normalization even if namespace processing is enabled</li> <li>Finally, there is an <code class="code"><span class="constructor">E_end_of_stream</span></code> event.</li> </ul> When the parser detects an error, it stops the event stream, and emits a last <code class="code"><span class="constructor">E_error</span></code> event instead. <p> <a name="calling"></a> <h2>Calling the parser in event mode</h2> <p> The parser returns the emitted events while it is parsing. There are two models for that: <p> <ul> <li>Push parsing: The caller passes a callback function to the parser, and whenever the parser emits an event, this function is invoked</li> <li>Pull parsing: The parser runs as a coroutine together with the caller. The invocation of the parser returns the pull function. The caller now repeatedly invokes the pull function to get the emitted events until the end of the stream is indicated.</li> </ul> Let's look at both models in detail by giving an example. There is some code that is needed in both push and pull parsing. This example is similar to the examples given in <a href="Intro_getting_started.html"><code class="code"><span class="constructor">Intro_getting_started</span></code></a>. First we need a <a href="Pxp_types.html#TYPEsource"><code class="code"><span class="constructor">Pxp_types</span>.source</code></a> that says from where the input to parse comes. Second, we need an entity manager (of the opaque PXP type <code class="code"><span class="constructor">Pxp_entity_manager</span>.entity_manager</code>). The entity manager is a device that controls the source and switches between the entities to parse (if such switches are necessary). The entity manager is visible to the caller in event mode - in tree mode it is also needed but hidden in the parser driver. <p> <pre></pre><code class="code"><span class="keyword">let</span> config = <span class="constructor">Pxp_types</span>.default_config<br> <span class="keyword">let</span> source = <span class="constructor">Pxp_types</span>.from_file <span class="string">"filename.xml"</span><br> <span class="keyword">let</span> entmng = <span class="constructor">Pxp_ev_parser</span>.create_entity_manager config source<br> </code><pre></pre> <p> (See also: <a href="Pxp_ev_parser.html#VALcreate_entity_manager"><code class="code"><span class="constructor">Pxp_ev_parser</span>.create_entity_manager</code></a>.) <p> From here on, the required code differs in both parsing modes. <p> <a name="push"></a> <h3>Push parsing</h3> <p> The function <a href="Pxp_ev_parser.html#VALprocess_entity"><code class="code"><span class="constructor">Pxp_ev_parser</span>.process_entity</code></a> invokes the parser in push mode: <p> <pre></pre><code class="code"><span class="keyword">let</span> () = <span class="constructor">Pxp_ev_parser</span>.process_entity config entry entmng (<span class="keyword">fun</span> ev <span class="keywordsign">-></span> ...)<br> </code><pre></pre> <p> The callback function is here shown as <code class="code">(<span class="keyword">fun</span> ev <span class="keywordsign">-></span> ...)</code>. It is called back for every emitted event <code class="code">ev</code> (of type <a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a>). It is ensured that the last emitted event is either <code class="code"><span class="constructor">E_end_of_stream</span></code> or <code class="code"><span class="constructor">E_error</span></code>. See the documentation of <a href="Pxp_ev_parser.html#VALprocess_entity"><code class="code"><span class="constructor">Pxp_ev_parser</span>.process_entity</code></a> for details about error handling. <p> The parameter <code class="code">entry</code> (of type <a href="Pxp_types.html#TYPEentry"><code class="code"><span class="constructor">Pxp_types</span>.entry</code></a>) determines the entry point in the XML grammar. Essentially, it says what kind of thing to parse. Most users will want to pass <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_document</span></code> here to parse a closed document. Note that the emitted event stream includes the wrapping for documents as described in <a href="Intro_events.html#docs"><i>The wrapping for closed documents</i></a>. <p> The entry point <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_content</span></code> is for non-document external entities, as described in <a href="Intro_events.html#nondocs"><i>The wrapping for non-document entities</i></a>. There is a similar entry point, <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_element_content</span></code>, which additionally enforces some constraints on the node structure. In particular, there must be a single top-level element so that the enforced node structure looks like a document. We do not recommend to use <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_element_content</span></code> - rather use <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_document</span></code>, and remove the document wrapping in a postprocessing step. <p> The entry point <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_expr</span></code> reads a single node (see <a href="Pxp_types.html#TYPEentry"><code class="code"><span class="constructor">Pxp_types</span>.entry</code></a> for details). It is recommended to use <a href="Pxp_ev_parser.html#VALprocess_expr"><code class="code"><span class="constructor">Pxp_ev_parser</span>.process_expr</code></a> instead of <a href="Pxp_ev_parser.html#VALprocess_entity"><code class="code"><span class="constructor">Pxp_ev_parser</span>.process_entity</code></a> together with this entry point, as this allows to start and end parsing within an entity, instead of having to parse an entity as a whole. (This is intended for special applications only.) <p> The entry point <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_declarations</span></code> is currently unused. <p> <b>Flags for <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_document</span></code>.</b> This entry point takes some flags as arguments that determine some details. It is usually ok to just pass the empty list of flags, i.e. <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_document</span> []</code>. The flags may enable some validation checks, or at least configure that some data is stored in the DTD object so that it is available for a later validation pass. Remember that the event mode by itself can only do well-formedness parsing. It can be reasonable, however, to enable flags when the event stream is later validated by some other means (e.g. by converting it into a tree and validating it). <p> <a name="pull"></a> <h3>Pull parsing</h3> <p> The pull parser is created by <a href="Pxp_ev_parser.html#VALcreate_pull_parser"><code class="code"><span class="constructor">Pxp_ev_parser</span>.create_pull_parser</code></a> like: <p> <pre></pre><code class="code"><span class="keyword">let</span> pull = create_pull_parser config entry entmng<br> </code><pre></pre> <p> The arguments <code class="code">config</code>, <code class="code">entry</code>, and <code class="code">entmng</code> have the same meaning as for the push parser. In the case of the pull parser, however, no callback function is passed by the user. Instead, the return value <code class="code">pull</code> is a function one can call to "pull" the events out of the parser engine. The <code class="code">pull</code> function returns <code class="code"><span class="constructor">Some</span> ev</code> where <code class="code">ev</code> is the event of type <a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a>. After the end of the stream is reached, the function returns <code class="code"><span class="constructor">None</span></code>. <p> Essentially, the parser works like an engine that can be started and stopped. When the <code class="code">pull</code> function is invoked, the parser engine is "turned on", and runs for a while until (at least) the next event is available. Then, the engine is stopped again, and the event is returned. The engine keeps its state between invocations of <code class="code">pull</code> so that the parser continues exactly at the point where it stopped the last time. <p> Note that files and other resources of the operating system are kept open while parsing is in progress. It is expected by the user to continue calling <code class="code">push</code> until the end of the stream is reached (at least until <code class="code"><span class="constructor">Some</span> <span class="constructor">E_end_of_stream</span></code>, <code class="code"><span class="constructor">Some</span> <span class="constructor">E_error</span></code>, or <code class="code"><span class="constructor">None</span></code> is returned by <code class="code">pull</code>). See the description of <a href="Pxp_ev_parser.html#VALclose_entities"><code class="code"><span class="constructor">Pxp_ev_parser</span>.close_entities</code></a> for a way of prematurely closing the parser for the exceptional cases where parsing cannot go on until the final parser state is reached. <p> <a name="3_Preprocessor"></a> <h3>Preprocessor</h3> <p> The PXP preprocessor (see <a href="Intro_preprocessor.html"><code class="code"><span class="constructor">Intro_preprocessor</span></code></a>) allows one to create event streams programmatically. One can get the events either as list (type <a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a><code class="code"> list</code>), or in a form compatible with pull parsing. For example, <p> <pre></pre><code class="code"><span class="keyword">let</span> book_list = <br> <:pxp_evlist< <br> <book><br> [ <title>[ <span class="string">"The Lord of The Rings"</span> ]<br> <author>[ <span class="string">"J.R.R. Tolkien"</span> ]<br> ]<br> >><br> </code><pre></pre> <p> returns the events as a <a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a><code class="code"> list</code> whereas <p> <pre></pre><code class="code"><span class="keyword">let</span> pull_book = <br> <:pxp_evpull< <br> <book><br> [ <title>[ <span class="string">"The Lord of The Rings"</span> ]<br> <author>[ <span class="string">"J.R.R. Tolkien"</span> ]<br> ]<br> >><br> </code><pre></pre> <p> defines <code class="code">pull_book</code> as an automaton from which one can pull the events like from a pull parser, i.e. <code class="code">pull_book</code> is of type <code class="code">unit<span class="keywordsign">-></span></code><a href="Pxp_types.html#TYPEevent"><code class="code"><span class="constructor">Pxp_types</span>.event</code></a><code class="code"> option</code>, and by calling it one can get the events one after the other. <code class="code">pull_book</code> has the same type as the pull function returned by the pull parser. <p> For a more complete discussion see <a href="Intro_preprocessor.html#events"><i>Generating events: pxp_evlist and pxp_evpull</i></a>. <p> Note that the preprocessor does not add any wrapping for documents or non-documents to the event stream. See <a href="Intro_preprocessor.html#documents"><i>Documents</i></a> for an example how to add such a wrapping in user code postprocessing step. <p> <a name="3_Pushorpull"></a> <h3>Push or pull?</h3> <p> The question arises whether one should prefer the push or the pull model. Generally, it is easy to turn a pull parser into a push parser by adding a loop that repeatedly invokes <code class="code">pull</code> to get the events, and then calls the push function to deliver each event. There is no such possibility the other way round, i.e. one cannot take a push parser and make it look like a pull parser by wrapping it into some interface adapter - at least not in a language like O'Caml that does not know coroutines or continuations as language elements. Effectively, the pull model is the more general one. <p> The function <a href="Pxp_event.html#VALiter"><code class="code"><span class="constructor">Pxp_event</span>.iter</code></a> can be used to turn a pull parser into a push parser: <p> <pre></pre><code class="code"><span class="constructor">Pxp_event</span>.iter push pull<br> </code><pre></pre> <p> The events <code class="code">pull</code>-ed out of the parser engine are delivered one by one to the receiver by invoking <code class="code">push</code>. <p> In PXP, the pull model is preferred, and a number of helper functions are only available for the pull model. If you need a push-stream nevertheless, it is recommended to use the pull parser, and to do all required transformations on it (like filtering, see below). Finally use <a href="Pxp_event.html#VALiter"><code class="code"><span class="constructor">Pxp_event</span>.iter</code></a> to turn the pull stream into a push-compatible stream. <p> <a name="filters"></a> <h2>Filters</h2> <p> Filters are a way to transform event streams (as defined for pull parsers). For example, one can remove the processing instruction events by doing (given that <code class="code">pull</code> is the original parser, and we define now a modified <code class="code">pull'</code> for the transformed stream): <p> <pre></pre><code class="code"><span class="keyword">let</span> pull' = <span class="constructor">Pxp_event</span>.pfilter<br> (<span class="keyword">function</span><br> <span class="keywordsign">|</span> <span class="constructor">E_pinstr</span>(_,_,_) <span class="keywordsign">-></span> <span class="keyword">false</span><br> <span class="keywordsign">|</span> _ <span class="keywordsign">-></span> <span class="keyword">true</span><br> )<br> pull<br> </code><pre></pre> <p> When events are read from <code class="code">pull'</code>, the events are also read from <code class="code">pull</code>, but all processing instruction events are suppressed. <a href="Pxp_event.html#VALpfilter"><code class="code"><span class="constructor">Pxp_event</span>.pfilter</code></a> works a lot like <code class="code"><span class="constructor">List</span>.filter</code> - it only keeps the events in the stream for which a predicate function returns <code class="code"><span class="keyword">true</span></code>. <p> <a name="3_Normalizingcharacterdataevents"></a> <h3>Normalizing character data events</h3> <p> <a href="Pxp_event.html#VALnorm_cdata_filter"><code class="code"><span class="constructor">Pxp_event</span>.norm_cdata_filter</code></a> is a special predefined filter that transformes <code class="code"><span class="constructor">E_char_data</span></code> events so that<ul> <li>empty <code class="code"><span class="constructor">E_char_data</span></code> events are removed</li> <li>adjacent <code class="code"><span class="constructor">E_char_data</span></code> events are concatenated and replaced by a single <code class="code"><span class="constructor">E_char_data</span></code> event</li> </ul> The filter is simply called by <p> <pre></pre><code class="code"><span class="keyword">let</span> pull' = <span class="constructor">Pxp_event</span>.norm_cdata_filter pull<br> </code><pre></pre> <p> <a name="3_Removingignorablewhitespace"></a> <h3>Removing ignorable whitespace</h3> <p> In validation mode, the DTD may specify ignorable whitespace. This is whitespace for which is known it only exists to make the XML tree more readable (indentation etc.). In tree mode, ignorable whitespace is removed by default (see <code class="code">drop_ignorable_whitespace</code> in <a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a>). <p> It is possible to clean up the event stream in this way - although the event mode is not capable of doing a full validation of the XML document. It is required, however, that all declarations are added to the DTD object. This is done by setting the flags <code class="code"><span class="keywordsign">`</span><span class="constructor">Extend_dtd_fully</span></code> or <code class="code"><span class="keywordsign">`</span><span class="constructor">Val_mode_dtd</span></code> in the entry point, e.g. use <p> <pre></pre><code class="code"><span class="keyword">let</span> entry = <span class="keywordsign">`</span><span class="constructor">Entry_document</span> [<span class="keywordsign">`</span><span class="constructor">Extend_dtd_fully</span>]<br> </code><pre></pre> <p> when you create the pull parser. The declarations of the XML elements are needed to check whether whitespace can be dropped. <p> The filter function is <a href="Pxp_event.html#VALdrop_ignorable_whitespace_filter"><code class="code"><span class="constructor">Pxp_event</span>.drop_ignorable_whitespace_filter</code></a>. Use it like <p> <pre></pre><code class="code"><span class="keyword">let</span> pull' = <span class="constructor">Pxp_event</span>.drop_ignorable_whitespace_filter pull<br> </code><pre></pre> <p> This filter does:<ul> <li>it checks whether non-whitespace is used in forbidden places, e.g. as children of an element that is declared with a regular expression content model</li> <li>it removes <code class="code"><span class="constructor">E_char_data</span></code> events only consisting of whitespace when they are ignorable.</li> </ul> The stream remains being normalized if it was already normalized, i.e. you can use this filter before or after <a href="Pxp_event.html#VALnorm_cdata_filter"><code class="code"><span class="constructor">Pxp_event</span>.norm_cdata_filter</code></a>. <p> <a name="3_Unwrappingdocuments"></a> <h3>Unwrapping documents</h3> <p> Sometimes it is necessary to get rid of the document wrapping. The filter <a href="Pxp_event.html#VALunwrap_document"><code class="code"><span class="constructor">Pxp_event</span>.unwrap_document</code></a> can do this. Call it like: <p> <pre></pre><code class="code"><span class="keyword">let</span> get_doc_details, pull' = <span class="constructor">Pxp_event</span>.unwrap_document pull<br> </code><pre></pre> <p> The filter removes all <code class="code"><span class="constructor">E_start_doc</span></code>, <code class="code"><span class="constructor">E_end_doc</span></code>, <code class="code"><span class="constructor">E_start_super</span></code>, <code class="code"><span class="constructor">E_end_super</span></code>, and <code class="code"><span class="constructor">E_end_of_stream</span></code> events. Also, when an <code class="code"><span class="constructor">E_error</span></code> event is encountered, the attached exception is raised. The information attached to the removed <code class="code"><span class="constructor">E_start_doc</span></code> event can be retrieved by calling <code class="code">get_doc_details</code>: <p> <pre></pre><code class="code"><span class="keyword">let</span> xml_version, dtd = get_doc_details()<br> </code><pre></pre> <p> Note that this call will fail if there is no <code class="code"><span class="constructor">E_start_doc</span></code>, and it can fail if it is not at the expected position in the stream. If you parse with the entry <code class="code"><span class="keywordsign">`</span><span class="constructor">Entry_document</span></code>, this cannot happen, though. <p> It is allowed to call <code class="code">get_doc_details</code> before using <code class="code">pull'</code>. <p> <a name="3_Chainingfilters"></a> <h3>Chaining filters</h3> <p> It is allowed to chain filters, e.g. <p> <pre></pre><code class="code"><span class="keyword">let</span> pull1 = <span class="constructor">Pxp_event</span>.drop_ignorable_whitespace_filter pull<br> <span class="keyword">let</span> pull2 = <span class="constructor">Pxp_event</span>.norm_cdata_filter pull1<br> </code><pre></pre> <p> <a name="3_Otherhelperfunctions"></a> <h3>Other helper functions</h3> <p> In <a href="Pxp_event.html"><code class="code"><span class="constructor">Pxp_event</span></code></a> there are also other helper functions besides filters. These functions can do:<ul> <li>conversion of pull streams to and from lists</li> <li>concatenation of pull streams</li> <li>extraction of nodes from pull streams</li> <li>printing of pull streams</li> <li>split namespace names</li> </ul> <a name="namespaces"></a> <h2>Events and namespaces</h2> <p> Namespace processing can also be enabled in event mode. This means that prefix normalization is applied to all names of elements and attributes. For example, this piece of code parses a file in event mode with enabled namespace processing: <p> <pre></pre><code class="code"> <span class="keyword">let</span> nsmng = <span class="constructor">Pxp_dtd</span>.create_namespace_manager()<br> <span class="keyword">let</span> config = <br> { <span class="constructor">Pxp_types</span>.default_config <span class="keyword">with</span><br> enable_namespace_processing = <span class="constructor">Some</span> nsmng<br> }<br> <span class="keyword">let</span> source = ...<br> <span class="keyword">let</span> entmng = <span class="constructor">Pxp_ev_parser</span>.create_entity_manager config source<br> <span class="keyword">let</span> pull = create_pull_parser config entry entmng<br> </code><pre></pre> <p> The names returned in <code class="code"><span class="constructor">E_start_tag</span>(name,attlist,scope_opt,entid)</code> are prefix-normalized, i.e. <code class="code">name</code> and the attribute names in <code class="code">attlist</code>. The functions <a href="Pxp_event.html#VALnamespace_split"><code class="code"><span class="constructor">Pxp_event</span>.namespace_split</code></a> and <a href="Pxp_event.html#VALextract_prefix"><code class="code"><span class="constructor">Pxp_event</span>.extract_prefix</code></a> can be useful to analyze the names. For example, to get the namespace URI of an element name, do: <p> <pre></pre><code class="code"> <span class="keyword">match</span> ev <span class="keyword">with</span><br> <span class="keywordsign">|</span> <span class="constructor">Pxp_types</span>.<span class="constructor">E_start_tag</span>(name,_,_,_) <span class="keywordsign">-></span><br> <span class="keyword">let</span> prefix = <span class="constructor">Pxp_event</span>.extract_prefix name <span class="keyword">in</span><br> <span class="keyword">let</span> uri = nsmng <span class="keywordsign">#</span> get_primary_uri prefix <span class="keyword">in</span><br> ...<br> </code><pre></pre> <p> Note that this may raise the exception <code class="code"><span class="constructor">Namespace_prefix_not_managed</span></code> if the prefix is unknown or empty. <p> When namespace processing is enabled, the namespace scopes are included in the <code class="code"><span class="constructor">E_start_tag</span></code> events. This can be used to get the display (original) prefix: <p> <pre></pre><code class="code"> <span class="keyword">match</span> ev <span class="keyword">with</span><br> <span class="keywordsign">|</span> <span class="constructor">Pxp_types</span>.<span class="constructor">E_start_tag</span>(name,_,<span class="constructor">Some</span> scope,_) <span class="keywordsign">-></span><br> <span class="keyword">let</span> prefix = <span class="constructor">Pxp_event</span>.extract_prefix name <span class="keyword">in</span><br> <span class="keyword">let</span> dsp_prefix = scope <span class="keywordsign">#</span> display_prefix_of_normprefix prefix <span class="keyword">in</span><br> ...<br> </code><pre></pre> <p> Note that this may raise the exception <code class="code"><span class="constructor">Namespace_prefix_not_managed</span></code> if the prefix is unknown or empty, or <code class="code"><span class="constructor">Namespace_not_in_scope</span></code> if the prefix is not declared for this part of the XML text. <p> <a name="2_ExamplePrinttheeventswhileparsing"></a> <h2>Example: Print the events while parsing</h2> <p> The following piece of code parses an XML file in event mode, and prints the events. The reader is encouraged to modify the code by e.g. adding filters, to see the effect. <p> <pre></pre><code class="code"> <span class="keyword">let</span> config = <span class="constructor">Pxp_types</span>.default_config<br> <span class="keyword">let</span> source = <span class="constructor">Pxp_types</span>.from_file <span class="string">"filename.xml"</span><br> <span class="keyword">let</span> entmng = <span class="constructor">Pxp_ev_parser</span>.create_entity_manager config source<br> <span class="keyword">let</span> pull = create_pull_parser config entry entmng<br> <span class="keyword">let</span> () = <span class="constructor">Pxp_event</span>.iter<br> (<span class="keyword">fun</span> ev <span class="keywordsign">-></span> print_endline (<span class="constructor">Pxp_event</span>.string_of_event ev))<br> pull<br> </code><pre></pre> <p> <a name="recdesc"></a> <h2>Connect PXP with a recursive-descent parser</h2> <p> We assume here that a list of integers like <p> <pre></pre><code class="code"> 43 :: 44 :: []<br> </code><pre></pre> <p> is represented in XML as <p> <pre></pre><code class="code"> <list><cons><int>43</int><cons><int>44</int><nil/></cons></cons></list><br> </code><pre></pre> <p> i.e. we have<ul> <li><code class="code">list</code> indicates that the single child is a list</li> <li><code class="code">cons</code> has two children: the first is the head of the list, and the second the tail (think <code class="code">head :: tail</code> in O'Caml)</li> <li><code class="code">nil</code> is the empty list</li> <li><code class="code">int</code> is an integer member of the list</li> </ul> We want to parse such XML texts by using the event-oriented parser, and combine it with a recursive-descent grammar. The XML parser delivers events which are taken as the tokens of the second parser. <p> <pre></pre><code class="code"><span class="keyword">let</span> parse_list (s:string) =<br> <br> <span class="keyword">let</span> <span class="keyword">rec</span> parse_whole_list stream =<br> <span class="comment">(* Production:<br> whole_list ::= "<list>" sub_list "</list>" END<br> *)</span><br> <span class="keyword">match</span> stream <span class="keyword">with</span> <span class="keyword">parser</span><br> [< <span class="keywordsign">'</span><span class="constructor">E_start_tag</span>(<span class="string">"list"</span>,_,_,_);<br> l = parse_sub_list;<br> <span class="keywordsign">'</span><span class="constructor">E_end_tag</span>(<span class="string">"list"</span>,_);<br> <span class="keywordsign">'</span><span class="constructor">E_end_of_stream</span>;<br> >] <span class="keywordsign">-></span><br> l<br> <br> <span class="keyword">and</span> parse_sub_list stream =<br> <span class="comment">(* Production:<br> sub_list ::= "<cons>" object sub_list "</cons>"<br> | "<nil>" "</nil>"<br> *)</span><br> <span class="keyword">match</span> stream <span class="keyword">with</span> <span class="keyword">parser</span><br> [< <span class="keywordsign">'</span><span class="constructor">E_start_tag</span>(<span class="string">"cons"</span>,_,_,_); <br> head = parse_object;<br> tail = parse_sub_list;<br> <span class="keywordsign">'</span><span class="constructor">E_end_tag</span>(<span class="string">"cons"</span>,_)<br> >] <span class="keywordsign">-></span><br> head :: tail<br> <br> <span class="keywordsign">|</span> [< <span class="keywordsign">'</span><span class="constructor">E_start_tag</span>(<span class="string">"nil"</span>,_,_,_); <span class="keywordsign">'</span><span class="constructor">E_end_tag</span>(<span class="string">"nil"</span>,_) >] <span class="keywordsign">-></span><br> []<br> <br> <span class="keyword">and</span> parse_object stream =<br> <span class="comment">(* Production:<br> object ::= "<int>" text "</int>"<br> with constraint that text is an integer parsable by int_of_string<br> *)</span><br> <span class="keyword">match</span> stream <span class="keyword">with</span> <span class="keyword">parser</span><br> [< <span class="keywordsign">'</span><span class="constructor">E_start_tag</span>(<span class="string">"int"</span>,_,_,_);<br> number = parse_text;<br> <span class="keywordsign">'</span><span class="constructor">E_end_tag</span>(<span class="string">"int"</span>,_)<br> >] <span class="keywordsign">-></span><br> int_of_string number<br> <br> <span class="keyword">and</span> parse_text stream =<br> <span class="comment">(* Production.<br> text ::= "any XML character data"<br> *)</span><br> <span class="keyword">match</span> stream <span class="keyword">with</span> <span class="keyword">parser</span><br> [< <span class="keywordsign">'</span><span class="constructor">E_char_data</span> data;<br> rest = parse_text<br> >] <span class="keywordsign">-></span><br> data ^ rest<br> <span class="keywordsign">|</span> [< >] <span class="keywordsign">-></span><br> <span class="string">""</span><br> <span class="keyword">in</span><br> <br> <span class="keyword">let</span> config = <br> { <span class="constructor">Pxp_types</span>.default_config <span class="keyword">with</span><br> store_element_positions = <span class="keyword">false</span>;<br> <span class="comment">(* don't produce E_position events *)</span><br> }<br> <span class="keyword">in</span><br> <span class="keyword">let</span> mgr = <br> <span class="constructor">Pxp_ev_parser</span>.create_entity_manager<br> config<br> (<span class="constructor">Pxp_types</span>.from_string s) <span class="keyword">in</span><br> <span class="keyword">let</span> pull = <br> <span class="constructor">Pxp_ev_parser</span>.create_pull_parser config (<span class="keywordsign">`</span><span class="constructor">Entry_content</span>[]) mgr <span class="keyword">in</span><br> <span class="keyword">let</span> pull' =<br> <span class="constructor">Pxp_event</span>.norm_cdata_filter pull <span class="keyword">in</span><br> <span class="keyword">let</span> next_event_or_error n =<br> <span class="keyword">let</span> e = pull' n <span class="keyword">in</span><br> <span class="keyword">match</span> e <span class="keyword">with</span><br> <span class="constructor">Some</span>(<span class="constructor">E_error</span> exn) <span class="keywordsign">-></span> raise exn<br> <span class="keywordsign">|</span> _ <span class="keywordsign">-></span> e<br> <span class="keyword">in</span><br> <span class="keyword">let</span> stream =<br> <span class="constructor">Stream</span>.from next_event_or_error <span class="keyword">in</span><br> parse_whole_list stream<br> </code><pre></pre> <p> The trick is to use <code class="code"><span class="constructor">Stream</span>.from</code> to convert the "pull-style" event stream into a <code class="code"><span class="constructor">Stream</span>.t</code>. The kind of stream can be parsed in a recursive-descent way by using stream parser capability built into O'Caml. <p> Note that we normalize the character data nodes. The grammar can only process a single <code class="code"><span class="constructor">E_char_data</span></code> event, and this normalization enforces that adjacent <code class="code"><span class="constructor">E_char_data</span></code> events are merged. <p> Note that you have to enable camlp4 when compiling this example, because the stream parsers are only available via camlp4. <p> <a name="escape"></a> <h2>Escape PXP parsing</h2> <p> <b>This feature is still considered as experimental!</b> <p> It is possible to define two escaping functions in <a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a>:<ul> <li><code class="code">escape_contents</code>: This function is called when one of the characters <code class="code">{</code>, <code class="code">}</code>, <code class="code">{{</code>, or <code class="code">}}</code> is found in character data context.</li> <li><code class="code">escape_attributes</code>: This function is called when one of the mentioned special characters is found in the value of an attribute.</li> </ul> Both escaping functions are allowed to operate directly on the underlying lexical buffer PXP uses, and because of this these functions can interpret the following characters in an arbitrary special way. The escaping functions have to return a replacement text, i.e. a string that is to be taken as character data or as attribute value (depending on context). <p> Why are the curly braces taken as escaping characters? This is motivated by the XQuery language. Here, a single <code class="code">{</code> switches from the XML object language to the XQuery meta language until another <code class="code">}</code> terminates this mode. By doubling the brace character, it loses its escaping function, and a single brace character is assumed. <p> A simple example makes this clearer. We allow here that a number is written between curly braces in hexadecimal, octal or binary notation using the conventions of O'Caml. The number is inserted into the event stream in normalized decimal notation (i.e. no leading zeros). For instance, one can write <p> <pre></pre><code class="code"> <foo x=<span class="string">"{0xff}"</span> y=<span class="string">"{{}}"</span>>{0o76}</foo><br> </code><pre></pre> <p> and the parser emits the events <p> <pre></pre><code class="code"> <span class="constructor">E_start_tag</span>(<span class="string">"foo"</span>, [<span class="string">"x"</span>, <span class="string">"255"</span>; <span class="string">"y"</span>, <span class="string">"{}"</span> ], _, _)<br> <span class="constructor">E_char_data</span>(<span class="string">"62"</span>)<br> <span class="constructor">E_end_tag</span>(<span class="string">"foo"</span>,_)<br> </code><pre></pre> <p> Of course, this example is very trivial, and in this case, one could also get the same effect by postprocessing the XML events. We want to point out, however, that the escaping feature makes it possible to combine PXP with a foreign language with its own lexing and parsing functions. <p> First, we need a lexer - this is <code class="code">lex.mll</code>: <p> <pre></pre><code class="code"> rule scan_number = parse<br> <span class="keywordsign">|</span> [ <span class="string">'0'</span>-<span class="string">'9'</span> ]+ <br> { <span class="keywordsign">`</span><span class="constructor">Int</span> (int_of_string (<span class="constructor">Lexing</span>.lexeme lexbuf)) }<br> <span class="keywordsign">|</span> (<span class="string">"0b"</span><span class="keywordsign">|</span><span class="string">"0B"</span>) [ <span class="string">'0'</span>-<span class="string">'1'</span> ]+ <br> { <span class="keywordsign">`</span><span class="constructor">Int</span> (int_of_string (<span class="constructor">Lexing</span>.lexeme lexbuf)) }<br> <span class="keywordsign">|</span> (<span class="string">"0o"</span><span class="keywordsign">|</span><span class="string">"0O"</span>) [ <span class="string">'0'</span>-<span class="string">'7'</span> ]+ <br> { <span class="keywordsign">`</span><span class="constructor">Int</span> (int_of_string (<span class="constructor">Lexing</span>.lexeme lexbuf)) }<br> <span class="keywordsign">|</span> (<span class="string">"0x"</span><span class="keywordsign">|</span><span class="string">"0X"</span>) [ <span class="string">'0'</span>-<span class="string">'9'</span> <span class="string">'a'</span>-<span class="string">'f'</span> <span class="string">'A'</span>-<span class="string">'F'</span> ]+ <br> { <span class="keywordsign">`</span><span class="constructor">Int</span> (int_of_string (<span class="constructor">Lexing</span>.lexeme lexbuf)) }<br> <span class="keywordsign">|</span> <span class="string">"}"</span> <br> { <span class="keywordsign">`</span><span class="constructor">End</span> }<br> <span class="keywordsign">|</span> _<br> { <span class="keywordsign">`</span><span class="constructor">Bad</span> }<br> <span class="keywordsign">|</span> eof<br> { <span class="keywordsign">`</span><span class="constructor">Eof</span> }<br> </code><pre></pre> <p> This lexer parses the various forms of numbers. We are lucky that we can use <code class="code">int_of_string</code> to convert these forms to ints. The right curly brace is also recognized. Any other character leads to a lexing error (<code class="code"><span class="keywordsign">`</span><span class="constructor">Bad</span></code>). If the XML file stops, <code class="code"><span class="keywordsign">`</span><span class="constructor">Eof</span></code> is emitted. <p> Now the escape functions. <code class="code">escape_contents</code> looks at the passed token. If it is a double curly brace, it immediately returns a single brace as replacement. A single left brace is processed by <code class="code">parse_number</code>, defined below. A single right brace is forbidden. Any other tokens cannot be passed to <code class="code">escape_contents</code>. <code class="code">escape_attributes</code> has an additional argument, but we can ignore this for now. (This argument is the position in the attribute value, for advanced post-processing.) <p> <pre></pre><code class="code"> <span class="keyword">let</span> escape_contents tok mng =<br> <span class="keyword">match</span> tok <span class="keyword">with</span><br> <span class="keywordsign">|</span> <span class="constructor">Lcurly</span> <span class="comment">(* "{" *)</span> <span class="keywordsign">-></span><br> parse_number mng<br> <span class="keywordsign">|</span> <span class="constructor">LLcurly</span> <span class="comment">(* "{{" *)</span> <span class="keywordsign">-></span><br> <span class="string">"{"</span><br> <span class="keywordsign">|</span> <span class="constructor">Rcurly</span> <span class="comment">(* "}" *)</span> <span class="keywordsign">-></span><br> failwith <span class="string">"Single } not allowed"</span><br> <span class="keywordsign">|</span> <span class="constructor">RRcurly</span> <span class="comment">(* "}}" *)</span> <span class="keywordsign">-></span><br> <span class="string">"}"</span><br> <span class="keywordsign">|</span> _ <span class="keywordsign">-></span><br> <span class="keyword">assert</span> <span class="keyword">false</span><br> <br> <span class="keyword">let</span> escape_attributes tok pos mng =<br> escape_contents tok mng<br> </code><pre></pre> <p> Now, <code class="code">parse_number</code> invokes our custom lexer <code class="code"><span class="constructor">Lex</span>.scan_number</code> with the (otherwise) internal PXP lexbuf. The function returns the replacement text. <p> It is part of the interface that the next token of the lexbuf must be the character following the right curly brace. <p> <pre></pre><code class="code"> <span class="keyword">let</span> parse_number mng =<br> <span class="keyword">let</span> lexbuf = <br> <span class="keyword">match</span> mng <span class="keywordsign">#</span> current_lexer_obj <span class="keywordsign">#</span> lexbuf <span class="keyword">with</span><br> <span class="keywordsign">|</span> <span class="keywordsign">`</span><span class="constructor">Ocamllex</span> lexbuf <span class="keywordsign">-></span> lexbuf<br> <span class="keywordsign">|</span> <span class="keywordsign">`</span><span class="constructor">Netulex</span> _ <span class="keywordsign">-></span> failwith <span class="string">"Netulex lexbufs not supported"</span> <span class="keyword">in</span><br> <span class="keyword">match</span> <span class="constructor">Lex</span>.scan_number lexbuf <span class="keyword">with</span><br> <span class="keywordsign">|</span> <span class="keywordsign">`</span><span class="constructor">Int</span> n <span class="keywordsign">-></span><br> <span class="keyword">let</span> s = string_of_int n <span class="keyword">in</span><br> ( <span class="keyword">match</span> <span class="constructor">Lex</span>.scan_number lexbuf <span class="keyword">with</span><br> <span class="keywordsign">|</span> <span class="keywordsign">`</span><span class="constructor">Int</span> _ <span class="keywordsign">-></span><br> failwith <span class="string">"More than one number"</span><br> <span class="keywordsign">|</span> <span class="keywordsign">`</span><span class="constructor">End</span> <span class="keywordsign">-></span><br> ()<br> <span class="keywordsign">|</span> <span class="keywordsign">`</span><span class="constructor">Bad</span> <span class="keywordsign">-></span><br> failwith <span class="string">"Bad character"</span><br> <span class="keywordsign">|</span> <span class="keywordsign">`</span><span class="constructor">Eof</span> <span class="keywordsign">-></span><br> failwith <span class="string">"Unexpected EOF"</span><br> );<br> s<br> <span class="keywordsign">|</span> <span class="keywordsign">`</span><span class="constructor">End</span> <span class="keywordsign">-></span><br> failwith <span class="string">"Empty curly braces"</span><br> <span class="keywordsign">|</span> <span class="keywordsign">`</span><span class="constructor">Bad</span> <span class="keywordsign">-></span><br> failwith <span class="string">"Bad character"</span><br> <span class="keywordsign">|</span> <span class="keywordsign">`</span><span class="constructor">Eof</span> <span class="keywordsign">-></span><br> failwith <span class="string">"Unexpected EOF"</span><br> </code><pre></pre> <p> Due to the way PXP works internally, the method <code class="code">mng <span class="keywordsign">#</span> current_lexobj <span class="keywordsign">#</span> lexbuf</code> can return two different kinds of lexical buffers. <code class="code"><span class="keywordsign">`</span><span class="constructor">Ocamllex</span></code> means it is a <code class="code"><span class="constructor">Lexing</span>.lexbuf</code> buffer. This type of buffer is used for all 8 bit encodings, and if the special <code class="code">pxp-lex-utf8</code> lexer is used. The lexer <code class="code">pxp-ulex-utf8</code>, however, will return a <code class="code"><span class="constructor">Netulex</span></code>-style buffer. <p> Finally, we enable to use our escaping functions in the config record: <p> <pre></pre><code class="code"><span class="keyword">let</span> config =<br> { <span class="constructor">Pxp_types</span>.default_config <span class="keyword">with</span><br> escape_contents = escape_contents;<br> escape_attributes = escape_attributes<br> </code><pre></pre> <p> <a name="3_Howacomplexexamplecouldwork"></a> <h3>How a complex example could work</h3> <p> The mentioned example is simple because the return value is a string. One can imagine, however, complex scenarios where one wants to insert custom events into the event stream. The PXP interface does not allow this directly. As workaround we suggest the following. <p> The custom events are collected in special buffers. The buffers are numbered by sequential integers (0, 1, ...). So <code class="code">escape_contents</code> would allocate such a buffer and get a number: <p> <pre></pre><code class="code"> <span class="keyword">let</span> buffer, n = allocate_event_buffer()<br> </code><pre></pre> <p> Here, <code class="code">buffer</code> could be an <code class="code">event <span class="constructor">Queue</span>.t</code>. The number <code class="code">n</code> identifies the buffer. The buffers, once filled, can be looked up by <p> <pre></pre><code class="code"> <span class="keyword">let</span> buffer = lookup_event_buffer n<br> </code><pre></pre> <p> So <code class="code">escape_contents</code> would like to return the events collected in the buffer, so that these are inserted into the event stream at the position where the curly escape occurs. As this is not allowed, it returns simply the buffer number instead so that it can be later identified, e.g. <p> <pre></pre><code class="code"> <span class="string">"{BUFFER "</span> ^ string_of_int n ^ <span class="string">"}"</span><br> </code><pre></pre> <p> For unescaping curly braces one would insert special tokens, e.g. <code class="code"><span class="string">"{LCURLY}"</span></code> and <code class="code"><span class="string">"{RCURLY}"</span></code>. <p> Now, the parser, specially configured with <code class="code">escape_contents</code>, will return event streams where <code class="code"><span class="constructor">E_char_data</span></code> events may include this special pointers to buffers <code class="code">{<span class="constructor">BUFFER</span> </code><n><code class="code">}</code>, and the curly brace tokens <code class="code">{<span class="constructor">LCURLY</span>}</code> and <code class="code">{<span class="constructor">RCURLY</span>}</code>. In a postprocessing step, all occurrences of these tokens are localized in the event stream, and<ul> <li>for buffer tokens the buffer contents are looked up (<code class="code">lookup_event_buffer</code>), and the events found there are substituted</li> <li>for <code class="code">{<span class="constructor">LCURLY</span>}</code> an <code class="code"><span class="constructor">E_char_data</span> <span class="string">"{"</span></code> event is substituted</li> <li>for <code class="code">{<span class="constructor">RCURLY</span>}</code> an <code class="code"><span class="constructor">E_char_data</span> <span class="string">"}"</span></code> event is substituted</li> </ul> It can be assumed that the tokens to localize are still <code class="code"><span class="constructor">E_char_data</span></code> events of their own, i.e. not merged with adjacent <code class="code"><span class="constructor">E_char_data</span></code> events. <p> It is admitted that this is a complicated workaround. <p> For attributes one can do basically the same. The postprocessing step may be a lot more complicated, however. <br> </body></html>