Sophie: ocaml-pxp-1.2.1-1mdv2010.1 x86

ocaml-pxp-1.2.1-1mdv2010.1.x86_64.rpm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<link rel="stylesheet" href="style.css" type="text/css">
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type">
<link rel="Start" href="index.html">
<link rel="previous" href="Pxp_reader.html">
<link rel="next" href="Intro_extensions.html">
<link rel="Up" href="index.html">
<link title="Index of types" rel=Appendix href="index_types.html">
<link title="Index of exceptions" rel=Appendix href="index_exceptions.html">
<link title="Index of values" rel=Appendix href="index_values.html">
<link title="Index of class methods" rel=Appendix href="index_methods.html">
<link title="Index of classes" rel=Appendix href="index_classes.html">
<link title="Index of class types" rel=Appendix href="index_class_types.html">
<link title="Index of modules" rel=Appendix href="index_modules.html">
<link title="Index of module types" rel=Appendix href="index_module_types.html">
<link title="Pxp_types" rel="Chapter" href="Pxp_types.html">
<link title="Pxp_document" rel="Chapter" href="Pxp_document.html">
<link title="Pxp_dtd" rel="Chapter" href="Pxp_dtd.html">
<link title="Pxp_tree_parser" rel="Chapter" href="Pxp_tree_parser.html">
<link title="Pxp_core_types" rel="Chapter" href="Pxp_core_types.html">
<link title="Pxp_ev_parser" rel="Chapter" href="Pxp_ev_parser.html">
<link title="Pxp_event" rel="Chapter" href="Pxp_event.html">
<link title="Pxp_dtd_parser" rel="Chapter" href="Pxp_dtd_parser.html">
<link title="Pxp_codewriter" rel="Chapter" href="Pxp_codewriter.html">
<link title="Pxp_marshal" rel="Chapter" href="Pxp_marshal.html">
<link title="Pxp_yacc" rel="Chapter" href="Pxp_yacc.html">
<link title="Pxp_reader" rel="Chapter" href="Pxp_reader.html">
<link title="Intro_trees" rel="Chapter" href="Intro_trees.html">
<link title="Intro_extensions" rel="Chapter" href="Intro_extensions.html">
<link title="Intro_namespaces" rel="Chapter" href="Intro_namespaces.html">
<link title="Intro_events" rel="Chapter" href="Intro_events.html">
<link title="Intro_resolution" rel="Chapter" href="Intro_resolution.html">
<link title="Intro_getting_started" rel="Chapter" href="Intro_getting_started.html">
<link title="Intro_advanced" rel="Chapter" href="Intro_advanced.html">
<link title="Intro_preprocessor" rel="Chapter" href="Intro_preprocessor.html">
<link title="Example_readme" rel="Chapter" href="Example_readme.html"><link title="The structure of document trees" rel="Section" href="#1_Thestructureofdocumenttrees">
<link title="Access methods" rel="Subsection" href="#access">
<link title="Mutation methods" rel="Subsection" href="#2_Mutationmethods">
<link title="Links between nodes" rel="Subsection" href="#2_Linksbetweennodes">
<link title="Optional features of document trees" rel="Subsection" href="#2_Optionalfeaturesofdocumenttrees">
<link title="Optional features of nodes" rel="Subsection" href="#2_Optionalfeaturesofnodes">
<link title="Creating nodes and trees" rel="Subsection" href="#2_Creatingnodesandtrees">
<link title="Extended nodes" rel="Subsection" href="#2_Extendednodes">
<link title="Namespaces" rel="Subsection" href="#2_Namespaces">
<link title="Details of the mapping from XML text to the tree representation" rel="Subsection" href="#2_DetailsofthemappingfromXMLtexttothetreerepresentation">
<title>PXP Reference : Intro_trees</title>
</head>
<body>
<div class="navbar"><a href="Pxp_reader.html">Previous</a>
&nbsp;<a href="index.html">Up</a>
&nbsp;<a href="Intro_extensions.html">Next</a>
</div>
<center><h1>Intro_trees</h1></center>
<br>
<br>
<b>Editorial note:</b> The PXP tree API has some complexity. The
definition in <a href="Pxp_document.html"><code class="code"><span class="constructor">Pxp_document</span></code></a> is hard to read without some background
information. This introduction is supposed to provide this information.
<p>

Also note that there is also a stream representation of XML. See
<a href="Intro_events.html"><code class="code"><span class="constructor">Intro_events</span></code></a> for more about this.
<p>

<a name="1_Thestructureofdocumenttrees"></a>
<h1>The structure of document trees</h1>
<p>

In a document parsed with the default parser settings every node
represents either an element or a character data section. There are
two classes implementing the two aspects of nodes:
<a href="Pxp_document.element_impl.html"><code class="code"><span class="constructor">Pxp_document</span>.element_impl</code></a>, and
<a href="Pxp_document.data_impl.html"><code class="code"><span class="constructor">Pxp_document</span>.data_impl</code></a>. There are configurations which
allow more node types to be created, in particular processing
instruction nodes, comment nodes, and super root nodes, but these are
discussed later.  Note that you can always add these extra node types
yourself to the node tree no matter what the parser configuration
specifies.
<p>

The following figure 
shows an example how
a tree is constructed from element and data nodes. The circular areas 
represent element nodes whereas the ovals denote data nodes. Only elements
may have subnodes; data nodes are always leaves of the tree. The subnodes
of an element can be either element or data nodes; in both cases the O'Caml
objects storing the nodes have the class type <a href="Pxp_document.node.html"><code class="code"><span class="constructor">Pxp_document</span>.node</code></a>.
<p>

<div class="picture"><div class="picture-caption">A tree with element nodes, data nodes, and attributes</div><img src="../pic/node_term.gif"></div>
<p>

Attributes (the clouds in the picture) do not appear as nodes of the
tree, and one must use special access methods to get them.
<p>

You would get such a tree by parsing with
<p>

<pre></pre><code class="code">  let config = Pxp_types.default_config
  let source = Pxp_types.from_string 
                  "<a att=\"apple\"><b><a att="\"orange\">An orange</a>Cherries</b><c/></a>"
  let spec = Pxp_tree_parser.default_spec
  let doc = Pxp_tree_parser.parse_document_entity config source spec
  let root = doc#root
</code><pre></pre>
<p>

The <code class="code">config</code> record sets a lot of parsing options. A number of these
options are explained below. The <code class="code">source</code> argument says from where
the parsed text comes. For the mysterious <code class="code">spec</code> parameter see below.
<p>

The parser returns <code class="code">doc</code>, which is a <a href="Pxp_document.document.html"><code class="code"><span class="constructor">Pxp_document</span>.document</code></a>. You
have to call its <code class="code">root</code> method to get the root of the tree. Note that
there are other parsing functions that return directly nodes; these
are intended for parsing XML fragments, however. For the usual closed
XML documents use a function that returns a document.
<p>

The <code class="code">root</code> is, as other nodes of the tree, an object instance of the
<a href="Pxp_document.node.html"><code class="code"><span class="constructor">Pxp_document</span>.node</code></a> class type.
<p>

What about other things that can occur in XML text? As mentioned,
by default only elements and data nodes appear in the tree, but it is
possible to enable more node types by setting appropriate 
<a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a> options:
<p>
<ul>
<li>Comments are ignored by default. By setting the config option
<code class="code">enable_comment_nodes</code>, however, comments are added to the tree. There is a
special node type for comments.</li>
<li>Processing instructions (denoted by <code class="code">&lt;? ... <span class="keywordsign">?&gt;</span></code> parentheses) are
not ignored, but normally no nodes are created for them. The
instructions are only gathered up, and attached to the surrounding
node, so one can check for their
presence but not for their exact location. By setting the
config option <code class="code">enable_pinstr_nodes</code>, however, processing instructions
are added to the tree as normal nodes. There is also a special node
type for them.</li>
<li>Usually, the topmost element is the root of the tree. There is, however,
the difficulty that the XML syntax allows one to surround
the topmost element
by comments and processing instructions. For an exact 
representation of this, it is possible to put an artificial root
node at the top of the tree, so that the topmost element is one of
the children, and the other surrounding material appears as the other
children. This mode is enabled by setting <code class="code">enable_super_root_node</code>.
The node is called super root node, and is also a special type of node.</li>
<li>It is possible to also get attributes and even namespaces as
node objects, but they are never put into the regular tree. To get
these very special nodes, one has to use special access methods.</li>
<li>CDATA sections (like <code class="code">&lt;![<span class="constructor">CDATA</span>[some text]]&gt;</code>) are simply added to the
surrounding data node,
so they do not appear as nodes of their own.</li>
<li>Entity references (like <code class="code"><span class="keywordsign">&amp;</span>amp;</code>) are automatically resolved, and
the resolution is added to the surrounding node</li>
</ul>

The parser collapses as much data material into one
data node as possible such that there are normally never two adjacent data
nodes. This invariant is enforced even if data material is included by entity
references or CDATA sections, or if a data sequence is interrupted by
comments. So 
<p>

<pre></pre><code class="code">&nbsp;a&nbsp;<span class="keywordsign">&amp;</span>amp;&nbsp;b&nbsp;&lt;!--&nbsp;comment&nbsp;--&gt;&nbsp;c&nbsp;&lt;![<span class="constructor">CDATA</span>[&lt;&gt;&nbsp;d]]&gt;&nbsp;</code><pre></pre>
<p>

is represented by only one data node, for
instance (for the default case where no comment nodes are created).
Of course, you can create document trees manually which break this
invariant; it is only the way the parser forms the tree.
<p>

All types of nodes are represented by the same Ocaml objects of 
class type <a href="Pxp_document.node.html"><code class="code"><span class="constructor">Pxp_document</span>.node</code></a>. The method 
<a href="Pxp_document.node.html#METHODnode_type"><code class="code"><span class="constructor">Pxp_document</span>.node.node_type</code></a> returns
a hint which type of node the object is. See the type
<a href="Pxp_document.html#TYPEnode_type"><code class="code"><span class="constructor">Pxp_document</span>.node_type</code></a> for details how the mentioned node types are
reflected by this method. For instance, for elements this method
returns <code class="code"><span class="constructor">T_element</span> n</code> where <code class="code">n</code> is the name of the element.
<p>

Note that this means formally that all access methods are implemented
for all node types. For example, you can get the attributes of 
data nodes by calling the <a href="Pxp_document.node.html#METHODattributes"><code class="code"><span class="constructor">Pxp_document</span>.node.attributes</code></a> method, 
although this does not
make sense. This problem is resolved on a case-by-case basis by
either returning an "empty value" or by raising appropriate
exceptions (e.g. <a href="Pxp_types.html#EXCEPTIONMethod_not_applicable"><code class="code"><span class="constructor">Pxp_types</span>.<span class="constructor">Method_not_applicable</span></code></a>).
For the chosen typing it is not possible to define slimmer class types
that better fit the various node types.
<p>

Attributes are usually represented as pairs
<code class="code">string * att_value</code> of names and values. Here,
<a href="Pxp_types.html#TYPEatt_value"><code class="code"><span class="constructor">Pxp_types</span>.att_value</code></a> is a conventional variant type. There are lots of
access methods for attributes, see below. It is optionally possible
to wrap the attributes as nodes (method
<a href="Pxp_document.node.html#METHODattributes_as_nodes"><code class="code"><span class="constructor">Pxp_document</span>.node.attributes_as_nodes</code></a>), but even in this case the attributes
are outside the regular document tree.
<p>

Normally, the processing instructions are also not included
into the document tree. They are considered as an extra property of the
element to which they are attached, and can be retrieved by the
<a href="Pxp_document.node.html#METHODpinstr"><code class="code"><span class="constructor">Pxp_document</span>.node.pinstr</code></a>
method of the element node. If this way of handling processing instructions
is not exact enough, the parser can optionally create processing instruction
nodes that are regular members of the document tree.
<p>

<a name="access"></a>
<h2>Access methods</h2>
<p>

An overview over some relevant access methods:
<p>

<ul>
<li><b>General:</b>
    <ul>
<li><code class="code">dtd</code> (<a href="Pxp_document.node.html#METHODdtd"><code class="code"><span class="constructor">Pxp_document</span>.node.dtd</code></a>): returns the DTD object.
         All nodes have such an object, even in well-formed mode.</li>
<li><code class="code">encoding</code> (<a href="Pxp_document.node.html#METHODencoding"><code class="code"><span class="constructor">Pxp_document</span>.node.encoding</code></a>): returns the
         encoding of the in-memory document representation.</li>
</ul>
</li>
<li><b>Navigation:</b>
    <ul>
<li><code class="code">parent</code> (<a href="Pxp_document.node.html#METHODparent"><code class="code"><span class="constructor">Pxp_document</span>.node.parent</code></a>): returns the parent object
         of the node it is called on</li>
<li><code class="code">root</code> (<a href="Pxp_document.node.html#METHODroot"><code class="code"><span class="constructor">Pxp_document</span>.node.root</code></a>): returns the root of the tree
         the node is member of</li>
<li><code class="code">sub_nodes</code> (<a href="Pxp_document.node.html#METHODsub_nodes"><code class="code"><span class="constructor">Pxp_document</span>.node.sub_nodes</code></a>): returns the children
         of the node </li>
<li><code class="code">previous_node</code> (<a href="Pxp_document.node.html#METHODprevious_node"><code class="code"><span class="constructor">Pxp_document</span>.node.previous_node</code></a>): returns the
         left sibling</li>
<li><code class="code">next_node</code> (<a href="Pxp_document.node.html#METHODnext_node"><code class="code"><span class="constructor">Pxp_document</span>.node.next_node</code></a>): returns the
         right sibling</li>
</ul>
</li>
<li><b>Information:</b>
    <ul>
<li><code class="code">node_position</code> (<a href="Pxp_document.node.html#METHODnode_position"><code class="code"><span class="constructor">Pxp_document</span>.node.node_position</code></a>): returns the
         ordinal position of this node as child of the parent</li>
<li><code class="code">node_path</code> (<a href="Pxp_document.node.html#METHODnode_path"><code class="code"><span class="constructor">Pxp_document</span>.node.node_path</code></a>): returns the positional
         path of this node in the whole tree</li>
<li><code class="code">node_type</code> (<a href="Pxp_document.node.html#METHODnode_type"><code class="code"><span class="constructor">Pxp_document</span>.node.node_type</code></a>): returns the type of
         the node</li>
<li><code class="code">position</code> (<a href="Pxp_document.node.html#METHODposition"><code class="code"><span class="constructor">Pxp_document</span>.node.position</code></a>): returns the position
         of the node in the parsed XML text</li>
</ul>
</li>
<li><b>Content:</b>
    <ul>
<li><code class="code">data</code> (<a href="Pxp_document.node.html#METHODdata"><code class="code"><span class="constructor">Pxp_document</span>.node.data</code></a>): returns the data contents of
         data nodes</li>
<li><code class="code">attributes</code> (<a href="Pxp_document.node.html#METHODattributes"><code class="code"><span class="constructor">Pxp_document</span>.node.attributes</code></a>): returns the attributes
         of elements</li>
<li><code class="code">attributes_as_nodes</code> (<a href="Pxp_document.node.html#METHODattributes_as_nodes"><code class="code"><span class="constructor">Pxp_document</span>.node.attributes_as_nodes</code></a>):
         also returns the attributes, but represented as a list of nodes
         residing outside the tree </li>
<li><code class="code">comment</code> (<a href="Pxp_document.node.html#METHODcomment"><code class="code"><span class="constructor">Pxp_document</span>.node.comment</code></a>): returns the text of the
         XML comment </li>
</ul>
</li>
<li><b>Validation:</b>
    <ul>
<li><code class="code">validate</code> (<a href="Pxp_document.node.html#METHODvalidate"><code class="code"><span class="constructor">Pxp_document</span>.node.validate</code></a>): validates the element
         locally </li>
</ul>
</li>
<li><b>Namespace:</b> (Only if namespaces are enabled, and the namespace-aware
        node implementation is used)
    <ul>
<li><code class="code">localname</code> (<a href="Pxp_document.node.html#METHODlocalname"><code class="code"><span class="constructor">Pxp_document</span>.node.localname</code></a>): returns the local name
         of the element in the namespace </li>
<li><code class="code">namespace_uri</code> (<a href="Pxp_document.node.html#METHODnamespace_uri"><code class="code"><span class="constructor">Pxp_document</span>.node.namespace_uri</code></a>): returns the
         namespace URI of the node </li>
<li><code class="code">namespace_scope</code> (<a href="Pxp_document.node.html#METHODnamespace_scope"><code class="code"><span class="constructor">Pxp_document</span>.node.namespace_scope</code></a>): returns
         the scope object with more namespace query methods </li>
<li><code class="code">namespaces_as_nodes</code> (<a href="Pxp_document.node.html#METHODnamespaces_as_nodes"><code class="code"><span class="constructor">Pxp_document</span>.node.namespaces_as_nodes</code></a>):
         returns the namespaces this node is member of, and the namespaces
         are represented as list of nodes</li>
</ul>
</li>
</ul>

<p>

<a name="2_Mutationmethods"></a>
<h2>Mutation methods</h2>
<p>

Trees are mutable, and nodes are mutable. Note that the tree is not 
automatically (re-)validated when it is changed. You have to explicitly
call validation methods, or the <a href="Pxp_document.html#VALvalidate"><code class="code"><span class="constructor">Pxp_document</span>.validate</code></a> function for the
whole tree.
<p>

<ul>
<li><b>Building trees, changing the structure of trees:</b>
    <ul>
<li><code class="code">append_node</code> (<a href="Pxp_document.node.html#METHODappend_node"><code class="code"><span class="constructor">Pxp_document</span>.node.append_node</code></a>): appends a
        node as new child to this node </li>
<li><code class="code">remove</code> (<a href="Pxp_document.node.html#METHODremove"><code class="code"><span class="constructor">Pxp_document</span>.node.remove</code></a>): removes this node from
        the tree </li>
</ul>
</li>
<li><b>Changing the content of nodes:</b>
    <ul>
<li><code class="code">set_data</code> (<a href="Pxp_document.node.html#METHODset_data"><code class="code"><span class="constructor">Pxp_document</span>.node.set_data</code></a>): changes the contents
        of data nodes </li>
<li><code class="code">set_attribute</code> (<a href="Pxp_document.node.html#METHODset_attribute"><code class="code"><span class="constructor">Pxp_document</span>.node.set_attribute</code></a>): adds or
        changes an attribute </li>
<li><code class="code">set_comment</code> (<a href="Pxp_document.node.html#METHODset_comment"><code class="code"><span class="constructor">Pxp_document</span>.node.set_comment</code></a>): changes the
        contents of comment nodes </li>
</ul>
</li>
<li><b>Creating nodes:</b>
    <ul>
<li><code class="code">create_element</code> (<a href="Pxp_document.node.html#METHODcreate_element"><code class="code"><span class="constructor">Pxp_document</span>.node.create_element</code></a>):
         called on an element node, this method creates a new tree only
         consisting of an element, and the only node of the tree is an
         object of the same class as this node</li>
<li><code class="code">create_data</code> (<a href="Pxp_document.node.html#METHODcreate_data"><code class="code"><span class="constructor">Pxp_document</span>.node.create_data</code></a>): same for data
         nodes </li>
<li><code class="code">orphaned_clone</code> (<a href="Pxp_document.node.html#METHODorphaned_clone"><code class="code"><span class="constructor">Pxp_document</span>.node.orphaned_clone</code></a>):
         creates a copy of the subtree starting at this node </li>
</ul>
</li>
</ul>

<p>

<a name="2_Linksbetweennodes"></a>
<h2>Links between nodes</h2>
<p>

The node tree has links in both directions: Every node has a link to its
parent (if any), and it has links to the subnodes (see the following
picture). Obviously,
this doubly-linked structure simplifies the navigation in the tree; but has
also some consequences for the possible operations on trees.
<p>

<div class="picture"><div class="picture-caption">Nodes are doubly linked trees</div><img src="../pic/node_general.gif"></div>
<p>

(Definitions: <a href="Pxp_document.node.html#METHODparent"><code class="code"><span class="constructor">Pxp_document</span>.node.parent</code></a>, <a href="Pxp_document.node.html#METHODsub_nodes"><code class="code"><span class="constructor">Pxp_document</span>.node.sub_nodes</code></a>.)
<p>

Because every node must have at most <b>one</b> parent node,
operations are illegal if they violate this condition. The following figure
shows on the left side
that node <code class="code">y</code> is added to <code class="code">x</code> as new subnode
which is allowed because <code class="code">y</code> does not have a parent yet. The
right side of the picture illustrates what would happen if <code class="code">y</code>
had a parent node; this is illegal because <code class="code">y</code> would have two
parents after the operation.
<p>

<div class="picture"><div class="picture-caption">A node can only be added if it is a root</div><img src="../pic/node_add.gif"></div>
<p>

(Definition: <a href="Pxp_document.node.html#METHODappend_node"><code class="code"><span class="constructor">Pxp_document</span>.node.append_node</code></a>.)
<p>

The <code class="code">remove</code> operation simply removes the links between two nodes. In the
following picture the node
<code class="code">x</code> is deleted from the list of subnodes of
<code class="code">y</code>. After that, <code class="code">x</code> becomes the root of the
subtree starting at this node.
<p>

<div class="picture"><div class="picture-caption">A removed node becomes the root of the subtree</div><img src="../pic/node_delete.gif"></div>
<p>

(Definition: <a href="Pxp_document.node.html#METHODremove"><code class="code"><span class="constructor">Pxp_document</span>.node.remove</code></a>.)
<p>

It is also possible to make a clone of a subtree; illustrated in the
next picture. In this case, the
clone is a copy of the original subtree except that it is no longer a
subnode. Because cloning never keeps the connection to the parent, the clones
are called <b>orphaned</b>.
<p>

<div class="picture"><div class="picture-caption">The clone of a subtree</div><img src="../pic/node_clone.gif"></div>
<p>

(Definition: <a href="Pxp_document.node.html#METHODorphaned_clone"><code class="code"><span class="constructor">Pxp_document</span>.node.orphaned_clone</code></a>.)
<p>

<a name="2_Optionalfeaturesofdocumenttrees"></a>
<h2>Optional features of document trees</h2>
<p>

As already pointed out, the parser does only create element and data nodes by
default. The configuration of the parser can be controlled by the 
<a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a> record. There are a number of optional features that
change the way the document trees are constructed by the parser:
<p>

Note that the parser configuration only controls the parser. If
you create trees of your own, you can simply add all the additional node types
to the tree without needing to enable these features.
<p>

<ul>
<li>When <code class="code">enable_super_root_node</code> is set, the extra super root node
is generated at the top of the tree. This node has type <code class="code"><span class="constructor">T_super_root</span></code>.</li>
<li>The option <code class="code">enable_comment_nodes</code> lets the
parser add comment nodes when it parses comments. These nodes have
type <code class="code"><span class="constructor">T_comment</span></code>.</li>
<li>The option <code class="code">enable_pinstr_nodes</code> changes the
way processing instructions are added to the document. Instead of attaching
such instructions to their containing elements as additional properties, this
mode forces the parser to create real nodes of type <code class="code"><span class="constructor">T_pinstr</span></code> for them.</li>
<li>The option <code class="code">drop_ignorable_whitespace</code> (enabled by default) can
be turned off. It controls whether the parser skips over so-called ignorable
whitespace. The XML standard allows that elements contain whitespace 
characters even if they are declared in the DTD not to contain character data. 
Because of this, the parser considers such whitespace as ignorable detail 
of the XML instance, and drops the characters silently. You can change
this by setting <code class="code">drop_ignorable_whitespace</code> to <code class="code"><span class="keyword">false</span></code>. In
this case, every character of the XML instance will be accepted by the
parser and will be added to a data node of the document tree.</li>
<li>By default, the parser creates elements with an annotation
about the location in the XML source file. You can query this location by
calling the method <code class="code">position</code>. As this requires a lot of
memory, it is possible to turn this off by setting
<code class="code">store_element_positions</code> to <code class="code"><span class="keyword">false</span></code>.</li>
</ul>

<p>

There are a number of further configuration options; however,
these options do not change the structure of the document tree. 
<p>

<a name="2_Optionalfeaturesofnodes"></a>
<h2>Optional features of nodes</h2>
<p>

The following features exist per node, and are simply invoked by
using the methods dealing with them.
<p>

<ul>
<li>Attribute nodes: These are useful 
if you want to have data structures that contain
attributes together with other types of nodes. The method
<code class="code">attributes_as_nodes</code> returns the attributes wrapped into node
objects. Note that these nodes are read-only.</li>
<li>Validation: The document
nodes contain the routines validating the document body. Of course, the
validation checks depend on what is stored in the DTD object. 
(There is always a DTD object - even in well-formedness mode, only
that it is mostly empty then, and validation is a no-op.)
<p>

The DTD object contains the declarations of
elements, attribute lists, entities, and notations. Furthermore, the 
DTD knows
whether the document is flagged as "standalone". As a PXP extension to
classic XML processing, the DTD may specify a mixed mode between
"validating mode" and "well-formedness mode". It is possible to allow
non-declared elements in the document, but to check declared elements 
against their declaration at
the same time. Moreover, there is a similar feature for attribute lists; 
you can allow non-declared attributes and check declared attributes. 
(Well, the
whole truth is that the parser always works in this mix mode, and that
the "validating mode" and the "well-formedness mode" are only the extremes
of the mix mode.)</li>
</ul>

<p>

<a name="2_Creatingnodesandtrees"></a>
<h2>Creating nodes and trees</h2>
<p>

Often, the parser creates the trees, but on occasion it is useful to
create trees manually. We explain here only the basic mechanism. There
is a nice camlp4 syntax extension called pxp-pp (XXX: LINK) allowing
for a much better notation in programs.
<p>

The most basic way of creating new nodes are the <code class="code">create_element</code>,
<code class="code">create_data</code>, and <code class="code">create_other</code> methods of nodes. It is not recommended
to use them directly, however, as they are very primitive.
<p>

In the <a href="Pxp_document.html"><code class="code"><span class="constructor">Pxp_document</span></code></a> module there are a number of functions creating
individual nodes (without children), the node constructors:
<p>
<ul>
<li><a href="Pxp_document.html#VALcreate_element_node"><code class="code"><span class="constructor">Pxp_document</span>.create_element_node</code></a>: creates an element node</li>
<li><a href="Pxp_document.html#VALcreate_data_node"><code class="code"><span class="constructor">Pxp_document</span>.create_data_node</code></a>: creates a data node</li>
<li><a href="Pxp_document.html#VALcreate_comment_node"><code class="code"><span class="constructor">Pxp_document</span>.create_comment_node</code></a>: creates a comment node</li>
<li><a href="Pxp_document.html#VALcreate_pinstr_node"><code class="code"><span class="constructor">Pxp_document</span>.create_pinstr_node</code></a>: creates a processing instruction node</li>
<li><a href="Pxp_document.html#VALcreate_super_root_node"><code class="code"><span class="constructor">Pxp_document</span>.create_super_root_node</code></a>: creates a super root node</li>
</ul>

There are no functions to create attribute and namespace nodes - these
are always created automatically by their containing nodes, so the user
does not need to do anything for creating them.
<p>

The node constructors must be equipped with all required data to
create the requested type of node. This includes the data that would
have been available in the textual XML representation, and some of the
meta data passed to the parsers, and meta data implicitly generated by
the parsers. For an element, this is at minimum:
<p>
<ul>
<li>The name of the element (e.g. the "foo" in <code class="code">&lt;foo&gt;</code>)</li>
<li>The attributes of the element</li>
<li>The DTD object to use (a rudimentary DTD object is even required if only 
  well-formedness checks will be applied but no validation)</li>
<li>The specification which classes are instantiated to create the nodes</li>
</ul>

For the latter two, see below. Optionally one can provide:
<p>
<ul>
<li>The position of the element in the XML text</li>
<li>Whether the attribute list is to be validated at creation time</li>
<li>Whether name pools are to be used for the attributes</li>
</ul>

Regarding validation, the default is to validate local data such as 
attributes, but to omit any checks of the position the node has in the
tree. The tree is still a singleton, and consists only of one node
after creation, so non-local checks do not make sense.
<p>

After some nodes have been created, they can be combined to more complex
trees by mutation methods (e.g. <a href="Pxp_document.node.html#METHODappend_node"><code class="code"><span class="constructor">Pxp_document</span>.node.append_node</code></a>).
<p>

As mentioned, a node must always be connected with a DTD object, even
if no validation checks will be done. It is possible to create DTD objects
that do not impose restrictions on the document:
<p>

<pre></pre><code class="code">&nbsp;<br>
&nbsp;&nbsp;<span class="keyword">let</span>&nbsp;dtd&nbsp;=&nbsp;<span class="constructor">Pxp_dtd_parser</span>.create_empty_dtd&nbsp;config<br>
&nbsp;&nbsp;dtd&nbsp;<span class="keywordsign">#</span>&nbsp;allow_arbitrary<br>
</code><pre></pre>
<p>

Even such a DTD object can contain entity definitions, and can demand
a certain way of dealing with namespaces. Also, the character encoding
of the nodes is taken from the DTD. See <a href="Pxp_dtd.dtd.html"><code class="code"><span class="constructor">Pxp_dtd</span>.dtd</code></a> for
DTD methods, and <a href="Pxp_dtd_parser.html"><code class="code"><span class="constructor">Pxp_dtd_parser</span></code></a> for convenient ways to create DTD
objects. Note that all nodes of a tree must be connected to the same
DTD object.
<p>

PXP is not restricted to using built-in classes for nodes. When the
parser is invoked and a tree is built, it is looked up in a so-called
<b>document model specification</b> how the new objects have to be
created (type <a href="Pxp_document.html#TYPEspec"><code class="code"><span class="constructor">Pxp_document</span>.spec</code></a>. Basically, it is a list of sample
objects to use (which are called <b>exemplars</b>), and these objects are
cloned when a node is actually created.
<p>

When calling the node constructors directly (bypassing the parser),
the document model specification has also to be passed to them as
argument. It is used in the same way as the parser uses it.
<p>

For getting the built-in classes without any modification, just use
<a href="Pxp_tree_parser.html#VALdefault_spec"><code class="code"><span class="constructor">Pxp_tree_parser</span>.default_spec</code></a>. For the variant with enabled namespaces,
prefer <a href="Pxp_tree_parser.html#VALdefault_namespace_spec"><code class="code"><span class="constructor">Pxp_tree_parser</span>.default_namespace_spec</code></a>.
<p>

<a name="2_Extendednodes"></a>
<h2>Extended nodes</h2>
<p>

Every node in a tree has a so-called extension. By default, the
extension is practically empty and only present for formal uniformity.
However, one can also define custom extension classes, and effectively
add new methods to the node classes.
<p>

Node extensions are explained in detail here: <a href="Intro_extensions.html"><code class="code"><span class="constructor">Intro_extensions</span></code></a>
<p>

<a name="2_Namespaces"></a>
<h2>Namespaces</h2>
<p>

As an option, PXP processes namespace declarations in XML text.
See this separate introduction for details: <a href="Intro_namespaces.html"><code class="code"><span class="constructor">Intro_namespaces</span></code></a>.
<p>

<a name="2_DetailsofthemappingfromXMLtexttothetreerepresentation"></a>
<h2>Details of the mapping from XML text to the tree representation</h2>
<p>

If an element declaration does not allow the element to 
contain character data, the following rules apply.
<p>

If the element must be empty, i.e. it is declared with the
keyword <code class="code"><span class="constructor">EMPTY</span></code>, the element instance must be effectively
empty (it must not even contain whitespace characters). The parser guarantees
that a declared <code class="code"><span class="constructor">EMPTY</span></code> element never contains a data
node, even if the data node represents the empty string.
<p>

If the element declaration only permits other elements to occur
within that element but not character data, it is still possible to insert
whitespace characters between the subelements. The parser ignores these
characters, too, and does not create data nodes for them.
<p>

<b>Example.</b> Consider the following element types:
<p>

<pre></pre><code class="code">&lt;!<span class="constructor">ELEMENT</span>&nbsp;x&nbsp;(&nbsp;<span class="keywordsign">#</span><span class="constructor">PCDATA</span>&nbsp;<span class="keywordsign">|</span>&nbsp;z&nbsp;)*&nbsp;&gt;<br>
&lt;!<span class="constructor">ELEMENT</span>&nbsp;y&nbsp;(&nbsp;z&nbsp;)*&nbsp;&gt;<br>
&lt;!<span class="constructor">ELEMENT</span>&nbsp;z&nbsp;<span class="constructor">EMPTY</span>&gt;<br>
</code><pre></pre>
<p>

Only <code class="code">x</code> may contain character data, the keyword
<code class="code"><span class="keywordsign">#</span><span class="constructor">PCDATA</span></code> indicates this. The other types are character-free. 
<p>

The XML term
<p>

<pre></pre><code class="code">&lt;x&gt;&lt;z/&gt;&nbsp;&lt;z/&gt;&lt;/x&gt;<br>
</code><pre></pre>
<p>

will be internally represented by an element node for <code class="code">x</code> 
with three subnodes: the first <code class="code">z</code> element, a data node
containing the space character, and the second <code class="code">z</code> element. 
In contrast to this, the term
<p>

<pre></pre><code class="code">&lt;y&gt;&lt;z/&gt;&nbsp;&lt;z/&gt;&lt;/y&gt;<br>
</code><pre></pre>
<p>

is represented by an  element node for <code class="code">y</code> with only
<b>two</b> subnodes, the two <code class="code">z</code> elements. There
is no data node for the space character because spaces are ignored in the
character-free element <code class="code">y</code>.
<p>

<b>Parser option:</b>
By setting the parser option <code class="code">drop_ignorable_whitespace</code> to
<code class="code"><span class="keyword">false</span></code>, the behaviour of the parser is changed such that
even ignorable whitespace characters are represented by data nodes.
<p>

<a name="3_Therepresentationofcharacterdata"></a>
<h3>The representation of character data</h3>
<p>

The XML specification allows all Unicode characters in XML
texts. This parser can be configured such that UTF-8 is used to represent the
characters internally; however, the default character encoding is
ISO-8859-1. (Currently, no other encodings are possible for the internal string
representation; the type <a href="Pxp_types.html#TYPErep_encoding"><code class="code"><span class="constructor">Pxp_types</span>.rep_encoding</code></a> enumerates
the possible encodings. Principally, the parser could use any encoding that is
ASCII-compatible, but there are currently only lexical analyzers for UTF-8 and
ISO-8859-1. It is currently impossible to use UTF-16 or UCS-4 as internal
encodings (or other multibyte encodings which are not ASCII-compatible) unless
major parts of the parser are rewritten - unlikely...)
<p>

The internal encoding may be different from the external encoding (specified
in the XML declaration <code class="code">&lt;?xml ... encoding=<span class="string">"..."</span><span class="keywordsign">?&gt;</span></code>); in
this case the strings are automatically converted to the internal encoding.
<p>

If the internal encoding is ISO-8859-1, it is possible that there are
characters that cannot be represented. In this case, the parser ignores such
characters and prints a warning (to the <code class="code">collect_warning</code>
object that must be passed when the parser is called).
<p>

The XML specification allows lines to be separated by single LF
characters, by CR LF character sequences, or by single CR
characters. Internally, these separators are always converted to single LF
characters.
<p>

The parser guarantees that there are never two adjacent data
nodes; if necessary, data material that would otherwise be represented by
several nodes is collapsed into one node. Note that you can still create node
trees with adjacent data nodes; however, the parser does not return such trees.
<p>

Note that CDATA sections are not represented specially; such
sections are added to the current data material that is being collected for the
next data node.
<p>

<a name="3_Therepresentationofentitieswithindocuments"></a>
<h3>The representation of entities within documents</h3>
<p>

<b>Entities are not represented within
documents!</b> If the parser finds an entity reference in the document
content, the reference is immediately expanded, and the parser reads the
expansion text instead of the reference.
<p>

<a name="3_Therepresentationofattributes"></a>
<h3>The representation of attributes</h3>
<p>

As attribute
values are composed of Unicode characters, too, the same problems with the
character encoding arise as for character material. Attribute values are
converted to the internal encoding, too; and if there are characters that
cannot be represented, these are dropped, and a warning is printed.
<p>

Attribute values are normalized before they are returned by
methods like <code class="code">attribute</code>. First, any remaining entity
references are expanded; if necessary, expansion is performed recursively.
Second, newline characters (any of LF, CR LF, or CR characters) are converted
to single space characters. Note that especially the latter action is
prescribed by the XML standard (but <code class="code"><span class="keywordsign">&amp;</span><span class="keywordsign">#</span>10;</code> is not converted
such that it is still possible to include line feeds into attributes).
<p>

<a name="3_Therepresentationofprocessinginstructions"></a>
<h3>The representation of processing instructions</h3>
<p>

Processing instructions are parsed to some extent: The first word of the
PI is called the target, and it is stored separated from the rest of the PI:
<p>

<pre></pre><code class="code">&lt;?target&nbsp;rest<span class="keywordsign">?&gt;</span><br>
</code><pre></pre>
<p>

The exact location where a PI occurs is not represented (by
default). The parser attaches the PI to the object that represents the
embracing construct (an element, a DTD, or the whole document); that
means you can find out which PIs occur in a certain element, in the
DTD, or in the whole document, but you cannot lookup the exact
position within the construct.
<p>

<b>Parser option:</b>
If you require the exact location of PIs, it is possible to
create regular nodes for them instead of attaching them to the surrounding
node as property. This mode is controlled by the option
<code class="code">enable_pinstr_nodes</code>. The additional nodes have the node type
<code class="code"><span class="constructor">T_pinstr</span> target</code>, and are created
from special exemplars contained in the <code class="code">spec</code> (see
<a href="Pxp_document.html#TYPEspec"><code class="code"><span class="constructor">Pxp_document</span>.spec</code></a>). 
<p>

<a name="3_Therepresentationofcomments"></a>
<h3>The representation of comments</h3>
<p>

Normally, comments are not represented; they are dropped by
default.
<p>

<b>Parser option:</b>
However, if you require comment in the document tree, it is possible to create
<code class="code"><span class="constructor">T_comment</span></code> nodes for them. This mode can be specified by the
option <code class="code">enable_comment_nodes</code>. Comment nodes are created from
special exemplars contained in the <code class="code">spec</code> (see
<a href="Pxp_document.html#TYPEspec"><code class="code"><span class="constructor">Pxp_document</span>.spec</code></a>). You can access the contents of comments through the 
method <code class="code">comment</code>.
<p>

<a name="3_Theattributesxmllangandxmlspace"></a>
<h3>The attributes <code class="code">xml:lang</code> and <code class="code">xml:space</code> </h3>
<p>

These attributes are not supported specially; they are handled
like any other attribute.
<p>

Note that the utility function
<a href="Pxp_document.html#VALstrip_whitespace"><code class="code"><span class="constructor">Pxp_document</span>.strip_whitespace</code></a> respects <code class="code">xml:space</code>
<br>
</body></html>