<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <link rel="stylesheet" href="style.css" type="text/css"> <meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"> <link rel="Start" href="index.html"> <link rel="previous" href="Pxp_reader.html"> <link rel="next" href="Intro_extensions.html"> <link rel="Up" href="index.html"> <link title="Index of types" rel=Appendix href="index_types.html"> <link title="Index of exceptions" rel=Appendix href="index_exceptions.html"> <link title="Index of values" rel=Appendix href="index_values.html"> <link title="Index of class methods" rel=Appendix href="index_methods.html"> <link title="Index of classes" rel=Appendix href="index_classes.html"> <link title="Index of class types" rel=Appendix href="index_class_types.html"> <link title="Index of modules" rel=Appendix href="index_modules.html"> <link title="Index of module types" rel=Appendix href="index_module_types.html"> <link title="Pxp_types" rel="Chapter" href="Pxp_types.html"> <link title="Pxp_document" rel="Chapter" href="Pxp_document.html"> <link title="Pxp_dtd" rel="Chapter" href="Pxp_dtd.html"> <link title="Pxp_tree_parser" rel="Chapter" href="Pxp_tree_parser.html"> <link title="Pxp_core_types" rel="Chapter" href="Pxp_core_types.html"> <link title="Pxp_ev_parser" rel="Chapter" href="Pxp_ev_parser.html"> <link title="Pxp_event" rel="Chapter" href="Pxp_event.html"> <link title="Pxp_dtd_parser" rel="Chapter" href="Pxp_dtd_parser.html"> <link title="Pxp_codewriter" rel="Chapter" href="Pxp_codewriter.html"> <link title="Pxp_marshal" rel="Chapter" href="Pxp_marshal.html"> <link title="Pxp_yacc" rel="Chapter" href="Pxp_yacc.html"> <link title="Pxp_reader" rel="Chapter" href="Pxp_reader.html"> <link title="Intro_trees" rel="Chapter" href="Intro_trees.html"> <link title="Intro_extensions" rel="Chapter" href="Intro_extensions.html"> <link title="Intro_namespaces" rel="Chapter" href="Intro_namespaces.html"> <link title="Intro_events" rel="Chapter" href="Intro_events.html"> <link title="Intro_resolution" rel="Chapter" href="Intro_resolution.html"> <link title="Intro_getting_started" rel="Chapter" href="Intro_getting_started.html"> <link title="Intro_advanced" rel="Chapter" href="Intro_advanced.html"> <link title="Intro_preprocessor" rel="Chapter" href="Intro_preprocessor.html"> <link title="Example_readme" rel="Chapter" href="Example_readme.html"><link title="The structure of document trees" rel="Section" href="#1_Thestructureofdocumenttrees"> <link title="Access methods" rel="Subsection" href="#access"> <link title="Mutation methods" rel="Subsection" href="#2_Mutationmethods"> <link title="Links between nodes" rel="Subsection" href="#2_Linksbetweennodes"> <link title="Optional features of document trees" rel="Subsection" href="#2_Optionalfeaturesofdocumenttrees"> <link title="Optional features of nodes" rel="Subsection" href="#2_Optionalfeaturesofnodes"> <link title="Creating nodes and trees" rel="Subsection" href="#2_Creatingnodesandtrees"> <link title="Extended nodes" rel="Subsection" href="#2_Extendednodes"> <link title="Namespaces" rel="Subsection" href="#2_Namespaces"> <link title="Details of the mapping from XML text to the tree representation" rel="Subsection" href="#2_DetailsofthemappingfromXMLtexttothetreerepresentation"> <title>PXP Reference : Intro_trees</title> </head> <body> <div class="navbar"><a href="Pxp_reader.html">Previous</a> <a href="index.html">Up</a> <a href="Intro_extensions.html">Next</a> </div> <center><h1>Intro_trees</h1></center> <br> <br> <b>Editorial note:</b> The PXP tree API has some complexity. The definition in <a href="Pxp_document.html"><code class="code"><span class="constructor">Pxp_document</span></code></a> is hard to read without some background information. This introduction is supposed to provide this information. <p> Also note that there is also a stream representation of XML. See <a href="Intro_events.html"><code class="code"><span class="constructor">Intro_events</span></code></a> for more about this. <p> <a name="1_Thestructureofdocumenttrees"></a> <h1>The structure of document trees</h1> <p> In a document parsed with the default parser settings every node represents either an element or a character data section. There are two classes implementing the two aspects of nodes: <a href="Pxp_document.element_impl.html"><code class="code"><span class="constructor">Pxp_document</span>.element_impl</code></a>, and <a href="Pxp_document.data_impl.html"><code class="code"><span class="constructor">Pxp_document</span>.data_impl</code></a>. There are configurations which allow more node types to be created, in particular processing instruction nodes, comment nodes, and super root nodes, but these are discussed later. Note that you can always add these extra node types yourself to the node tree no matter what the parser configuration specifies. <p> The following figure shows an example how a tree is constructed from element and data nodes. The circular areas represent element nodes whereas the ovals denote data nodes. Only elements may have subnodes; data nodes are always leaves of the tree. The subnodes of an element can be either element or data nodes; in both cases the O'Caml objects storing the nodes have the class type <a href="Pxp_document.node.html"><code class="code"><span class="constructor">Pxp_document</span>.node</code></a>. <p> <div class="picture"><div class="picture-caption">A tree with element nodes, data nodes, and attributes</div><img src="../pic/node_term.gif"></div> <p> Attributes (the clouds in the picture) do not appear as nodes of the tree, and one must use special access methods to get them. <p> You would get such a tree by parsing with <p> <pre></pre><code class="code"> let config = Pxp_types.default_config let source = Pxp_types.from_string "<a att=\"apple\"><b><a att="\"orange\">An orange</a>Cherries</b><c/></a>" let spec = Pxp_tree_parser.default_spec let doc = Pxp_tree_parser.parse_document_entity config source spec let root = doc#root </code><pre></pre> <p> The <code class="code">config</code> record sets a lot of parsing options. A number of these options are explained below. The <code class="code">source</code> argument says from where the parsed text comes. For the mysterious <code class="code">spec</code> parameter see below. <p> The parser returns <code class="code">doc</code>, which is a <a href="Pxp_document.document.html"><code class="code"><span class="constructor">Pxp_document</span>.document</code></a>. You have to call its <code class="code">root</code> method to get the root of the tree. Note that there are other parsing functions that return directly nodes; these are intended for parsing XML fragments, however. For the usual closed XML documents use a function that returns a document. <p> The <code class="code">root</code> is, as other nodes of the tree, an object instance of the <a href="Pxp_document.node.html"><code class="code"><span class="constructor">Pxp_document</span>.node</code></a> class type. <p> What about other things that can occur in XML text? As mentioned, by default only elements and data nodes appear in the tree, but it is possible to enable more node types by setting appropriate <a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a> options: <p> <ul> <li>Comments are ignored by default. By setting the config option <code class="code">enable_comment_nodes</code>, however, comments are added to the tree. There is a special node type for comments.</li> <li>Processing instructions (denoted by <code class="code"><? ... <span class="keywordsign">?></span></code> parentheses) are not ignored, but normally no nodes are created for them. The instructions are only gathered up, and attached to the surrounding node, so one can check for their presence but not for their exact location. By setting the config option <code class="code">enable_pinstr_nodes</code>, however, processing instructions are added to the tree as normal nodes. There is also a special node type for them.</li> <li>Usually, the topmost element is the root of the tree. There is, however, the difficulty that the XML syntax allows one to surround the topmost element by comments and processing instructions. For an exact representation of this, it is possible to put an artificial root node at the top of the tree, so that the topmost element is one of the children, and the other surrounding material appears as the other children. This mode is enabled by setting <code class="code">enable_super_root_node</code>. The node is called super root node, and is also a special type of node.</li> <li>It is possible to also get attributes and even namespaces as node objects, but they are never put into the regular tree. To get these very special nodes, one has to use special access methods.</li> <li>CDATA sections (like <code class="code"><![<span class="constructor">CDATA</span>[some text]]></code>) are simply added to the surrounding data node, so they do not appear as nodes of their own.</li> <li>Entity references (like <code class="code"><span class="keywordsign">&</span>amp;</code>) are automatically resolved, and the resolution is added to the surrounding node</li> </ul> The parser collapses as much data material into one data node as possible such that there are normally never two adjacent data nodes. This invariant is enforced even if data material is included by entity references or CDATA sections, or if a data sequence is interrupted by comments. So <p> <pre></pre><code class="code"> a <span class="keywordsign">&</span>amp; b <!-- comment --> c <![<span class="constructor">CDATA</span>[<> d]]> </code><pre></pre> <p> is represented by only one data node, for instance (for the default case where no comment nodes are created). Of course, you can create document trees manually which break this invariant; it is only the way the parser forms the tree. <p> All types of nodes are represented by the same Ocaml objects of class type <a href="Pxp_document.node.html"><code class="code"><span class="constructor">Pxp_document</span>.node</code></a>. The method <a href="Pxp_document.node.html#METHODnode_type"><code class="code"><span class="constructor">Pxp_document</span>.node.node_type</code></a> returns a hint which type of node the object is. See the type <a href="Pxp_document.html#TYPEnode_type"><code class="code"><span class="constructor">Pxp_document</span>.node_type</code></a> for details how the mentioned node types are reflected by this method. For instance, for elements this method returns <code class="code"><span class="constructor">T_element</span> n</code> where <code class="code">n</code> is the name of the element. <p> Note that this means formally that all access methods are implemented for all node types. For example, you can get the attributes of data nodes by calling the <a href="Pxp_document.node.html#METHODattributes"><code class="code"><span class="constructor">Pxp_document</span>.node.attributes</code></a> method, although this does not make sense. This problem is resolved on a case-by-case basis by either returning an "empty value" or by raising appropriate exceptions (e.g. <a href="Pxp_types.html#EXCEPTIONMethod_not_applicable"><code class="code"><span class="constructor">Pxp_types</span>.<span class="constructor">Method_not_applicable</span></code></a>). For the chosen typing it is not possible to define slimmer class types that better fit the various node types. <p> Attributes are usually represented as pairs <code class="code">string * att_value</code> of names and values. Here, <a href="Pxp_types.html#TYPEatt_value"><code class="code"><span class="constructor">Pxp_types</span>.att_value</code></a> is a conventional variant type. There are lots of access methods for attributes, see below. It is optionally possible to wrap the attributes as nodes (method <a href="Pxp_document.node.html#METHODattributes_as_nodes"><code class="code"><span class="constructor">Pxp_document</span>.node.attributes_as_nodes</code></a>), but even in this case the attributes are outside the regular document tree. <p> Normally, the processing instructions are also not included into the document tree. They are considered as an extra property of the element to which they are attached, and can be retrieved by the <a href="Pxp_document.node.html#METHODpinstr"><code class="code"><span class="constructor">Pxp_document</span>.node.pinstr</code></a> method of the element node. If this way of handling processing instructions is not exact enough, the parser can optionally create processing instruction nodes that are regular members of the document tree. <p> <a name="access"></a> <h2>Access methods</h2> <p> An overview over some relevant access methods: <p> <ul> <li><b>General:</b> <ul> <li><code class="code">dtd</code> (<a href="Pxp_document.node.html#METHODdtd"><code class="code"><span class="constructor">Pxp_document</span>.node.dtd</code></a>): returns the DTD object. All nodes have such an object, even in well-formed mode.</li> <li><code class="code">encoding</code> (<a href="Pxp_document.node.html#METHODencoding"><code class="code"><span class="constructor">Pxp_document</span>.node.encoding</code></a>): returns the encoding of the in-memory document representation.</li> </ul> </li> <li><b>Navigation:</b> <ul> <li><code class="code">parent</code> (<a href="Pxp_document.node.html#METHODparent"><code class="code"><span class="constructor">Pxp_document</span>.node.parent</code></a>): returns the parent object of the node it is called on</li> <li><code class="code">root</code> (<a href="Pxp_document.node.html#METHODroot"><code class="code"><span class="constructor">Pxp_document</span>.node.root</code></a>): returns the root of the tree the node is member of</li> <li><code class="code">sub_nodes</code> (<a href="Pxp_document.node.html#METHODsub_nodes"><code class="code"><span class="constructor">Pxp_document</span>.node.sub_nodes</code></a>): returns the children of the node </li> <li><code class="code">previous_node</code> (<a href="Pxp_document.node.html#METHODprevious_node"><code class="code"><span class="constructor">Pxp_document</span>.node.previous_node</code></a>): returns the left sibling</li> <li><code class="code">next_node</code> (<a href="Pxp_document.node.html#METHODnext_node"><code class="code"><span class="constructor">Pxp_document</span>.node.next_node</code></a>): returns the right sibling</li> </ul> </li> <li><b>Information:</b> <ul> <li><code class="code">node_position</code> (<a href="Pxp_document.node.html#METHODnode_position"><code class="code"><span class="constructor">Pxp_document</span>.node.node_position</code></a>): returns the ordinal position of this node as child of the parent</li> <li><code class="code">node_path</code> (<a href="Pxp_document.node.html#METHODnode_path"><code class="code"><span class="constructor">Pxp_document</span>.node.node_path</code></a>): returns the positional path of this node in the whole tree</li> <li><code class="code">node_type</code> (<a href="Pxp_document.node.html#METHODnode_type"><code class="code"><span class="constructor">Pxp_document</span>.node.node_type</code></a>): returns the type of the node</li> <li><code class="code">position</code> (<a href="Pxp_document.node.html#METHODposition"><code class="code"><span class="constructor">Pxp_document</span>.node.position</code></a>): returns the position of the node in the parsed XML text</li> </ul> </li> <li><b>Content:</b> <ul> <li><code class="code">data</code> (<a href="Pxp_document.node.html#METHODdata"><code class="code"><span class="constructor">Pxp_document</span>.node.data</code></a>): returns the data contents of data nodes</li> <li><code class="code">attributes</code> (<a href="Pxp_document.node.html#METHODattributes"><code class="code"><span class="constructor">Pxp_document</span>.node.attributes</code></a>): returns the attributes of elements</li> <li><code class="code">attributes_as_nodes</code> (<a href="Pxp_document.node.html#METHODattributes_as_nodes"><code class="code"><span class="constructor">Pxp_document</span>.node.attributes_as_nodes</code></a>): also returns the attributes, but represented as a list of nodes residing outside the tree </li> <li><code class="code">comment</code> (<a href="Pxp_document.node.html#METHODcomment"><code class="code"><span class="constructor">Pxp_document</span>.node.comment</code></a>): returns the text of the XML comment </li> </ul> </li> <li><b>Validation:</b> <ul> <li><code class="code">validate</code> (<a href="Pxp_document.node.html#METHODvalidate"><code class="code"><span class="constructor">Pxp_document</span>.node.validate</code></a>): validates the element locally </li> </ul> </li> <li><b>Namespace:</b> (Only if namespaces are enabled, and the namespace-aware node implementation is used) <ul> <li><code class="code">localname</code> (<a href="Pxp_document.node.html#METHODlocalname"><code class="code"><span class="constructor">Pxp_document</span>.node.localname</code></a>): returns the local name of the element in the namespace </li> <li><code class="code">namespace_uri</code> (<a href="Pxp_document.node.html#METHODnamespace_uri"><code class="code"><span class="constructor">Pxp_document</span>.node.namespace_uri</code></a>): returns the namespace URI of the node </li> <li><code class="code">namespace_scope</code> (<a href="Pxp_document.node.html#METHODnamespace_scope"><code class="code"><span class="constructor">Pxp_document</span>.node.namespace_scope</code></a>): returns the scope object with more namespace query methods </li> <li><code class="code">namespaces_as_nodes</code> (<a href="Pxp_document.node.html#METHODnamespaces_as_nodes"><code class="code"><span class="constructor">Pxp_document</span>.node.namespaces_as_nodes</code></a>): returns the namespaces this node is member of, and the namespaces are represented as list of nodes</li> </ul> </li> </ul> <p> <a name="2_Mutationmethods"></a> <h2>Mutation methods</h2> <p> Trees are mutable, and nodes are mutable. Note that the tree is not automatically (re-)validated when it is changed. You have to explicitly call validation methods, or the <a href="Pxp_document.html#VALvalidate"><code class="code"><span class="constructor">Pxp_document</span>.validate</code></a> function for the whole tree. <p> <ul> <li><b>Building trees, changing the structure of trees:</b> <ul> <li><code class="code">append_node</code> (<a href="Pxp_document.node.html#METHODappend_node"><code class="code"><span class="constructor">Pxp_document</span>.node.append_node</code></a>): appends a node as new child to this node </li> <li><code class="code">remove</code> (<a href="Pxp_document.node.html#METHODremove"><code class="code"><span class="constructor">Pxp_document</span>.node.remove</code></a>): removes this node from the tree </li> </ul> </li> <li><b>Changing the content of nodes:</b> <ul> <li><code class="code">set_data</code> (<a href="Pxp_document.node.html#METHODset_data"><code class="code"><span class="constructor">Pxp_document</span>.node.set_data</code></a>): changes the contents of data nodes </li> <li><code class="code">set_attribute</code> (<a href="Pxp_document.node.html#METHODset_attribute"><code class="code"><span class="constructor">Pxp_document</span>.node.set_attribute</code></a>): adds or changes an attribute </li> <li><code class="code">set_comment</code> (<a href="Pxp_document.node.html#METHODset_comment"><code class="code"><span class="constructor">Pxp_document</span>.node.set_comment</code></a>): changes the contents of comment nodes </li> </ul> </li> <li><b>Creating nodes:</b> <ul> <li><code class="code">create_element</code> (<a href="Pxp_document.node.html#METHODcreate_element"><code class="code"><span class="constructor">Pxp_document</span>.node.create_element</code></a>): called on an element node, this method creates a new tree only consisting of an element, and the only node of the tree is an object of the same class as this node</li> <li><code class="code">create_data</code> (<a href="Pxp_document.node.html#METHODcreate_data"><code class="code"><span class="constructor">Pxp_document</span>.node.create_data</code></a>): same for data nodes </li> <li><code class="code">orphaned_clone</code> (<a href="Pxp_document.node.html#METHODorphaned_clone"><code class="code"><span class="constructor">Pxp_document</span>.node.orphaned_clone</code></a>): creates a copy of the subtree starting at this node </li> </ul> </li> </ul> <p> <a name="2_Linksbetweennodes"></a> <h2>Links between nodes</h2> <p> The node tree has links in both directions: Every node has a link to its parent (if any), and it has links to the subnodes (see the following picture). Obviously, this doubly-linked structure simplifies the navigation in the tree; but has also some consequences for the possible operations on trees. <p> <div class="picture"><div class="picture-caption">Nodes are doubly linked trees</div><img src="../pic/node_general.gif"></div> <p> (Definitions: <a href="Pxp_document.node.html#METHODparent"><code class="code"><span class="constructor">Pxp_document</span>.node.parent</code></a>, <a href="Pxp_document.node.html#METHODsub_nodes"><code class="code"><span class="constructor">Pxp_document</span>.node.sub_nodes</code></a>.) <p> Because every node must have at most <b>one</b> parent node, operations are illegal if they violate this condition. The following figure shows on the left side that node <code class="code">y</code> is added to <code class="code">x</code> as new subnode which is allowed because <code class="code">y</code> does not have a parent yet. The right side of the picture illustrates what would happen if <code class="code">y</code> had a parent node; this is illegal because <code class="code">y</code> would have two parents after the operation. <p> <div class="picture"><div class="picture-caption">A node can only be added if it is a root</div><img src="../pic/node_add.gif"></div> <p> (Definition: <a href="Pxp_document.node.html#METHODappend_node"><code class="code"><span class="constructor">Pxp_document</span>.node.append_node</code></a>.) <p> The <code class="code">remove</code> operation simply removes the links between two nodes. In the following picture the node <code class="code">x</code> is deleted from the list of subnodes of <code class="code">y</code>. After that, <code class="code">x</code> becomes the root of the subtree starting at this node. <p> <div class="picture"><div class="picture-caption">A removed node becomes the root of the subtree</div><img src="../pic/node_delete.gif"></div> <p> (Definition: <a href="Pxp_document.node.html#METHODremove"><code class="code"><span class="constructor">Pxp_document</span>.node.remove</code></a>.) <p> It is also possible to make a clone of a subtree; illustrated in the next picture. In this case, the clone is a copy of the original subtree except that it is no longer a subnode. Because cloning never keeps the connection to the parent, the clones are called <b>orphaned</b>. <p> <div class="picture"><div class="picture-caption">The clone of a subtree</div><img src="../pic/node_clone.gif"></div> <p> (Definition: <a href="Pxp_document.node.html#METHODorphaned_clone"><code class="code"><span class="constructor">Pxp_document</span>.node.orphaned_clone</code></a>.) <p> <a name="2_Optionalfeaturesofdocumenttrees"></a> <h2>Optional features of document trees</h2> <p> As already pointed out, the parser does only create element and data nodes by default. The configuration of the parser can be controlled by the <a href="Pxp_types.html#TYPEconfig"><code class="code"><span class="constructor">Pxp_types</span>.config</code></a> record. There are a number of optional features that change the way the document trees are constructed by the parser: <p> Note that the parser configuration only controls the parser. If you create trees of your own, you can simply add all the additional node types to the tree without needing to enable these features. <p> <ul> <li>When <code class="code">enable_super_root_node</code> is set, the extra super root node is generated at the top of the tree. This node has type <code class="code"><span class="constructor">T_super_root</span></code>.</li> <li>The option <code class="code">enable_comment_nodes</code> lets the parser add comment nodes when it parses comments. These nodes have type <code class="code"><span class="constructor">T_comment</span></code>.</li> <li>The option <code class="code">enable_pinstr_nodes</code> changes the way processing instructions are added to the document. Instead of attaching such instructions to their containing elements as additional properties, this mode forces the parser to create real nodes of type <code class="code"><span class="constructor">T_pinstr</span></code> for them.</li> <li>The option <code class="code">drop_ignorable_whitespace</code> (enabled by default) can be turned off. It controls whether the parser skips over so-called ignorable whitespace. The XML standard allows that elements contain whitespace characters even if they are declared in the DTD not to contain character data. Because of this, the parser considers such whitespace as ignorable detail of the XML instance, and drops the characters silently. You can change this by setting <code class="code">drop_ignorable_whitespace</code> to <code class="code"><span class="keyword">false</span></code>. In this case, every character of the XML instance will be accepted by the parser and will be added to a data node of the document tree.</li> <li>By default, the parser creates elements with an annotation about the location in the XML source file. You can query this location by calling the method <code class="code">position</code>. As this requires a lot of memory, it is possible to turn this off by setting <code class="code">store_element_positions</code> to <code class="code"><span class="keyword">false</span></code>.</li> </ul> <p> There are a number of further configuration options; however, these options do not change the structure of the document tree. <p> <a name="2_Optionalfeaturesofnodes"></a> <h2>Optional features of nodes</h2> <p> The following features exist per node, and are simply invoked by using the methods dealing with them. <p> <ul> <li>Attribute nodes: These are useful if you want to have data structures that contain attributes together with other types of nodes. The method <code class="code">attributes_as_nodes</code> returns the attributes wrapped into node objects. Note that these nodes are read-only.</li> <li>Validation: The document nodes contain the routines validating the document body. Of course, the validation checks depend on what is stored in the DTD object. (There is always a DTD object - even in well-formedness mode, only that it is mostly empty then, and validation is a no-op.) <p> The DTD object contains the declarations of elements, attribute lists, entities, and notations. Furthermore, the DTD knows whether the document is flagged as "standalone". As a PXP extension to classic XML processing, the DTD may specify a mixed mode between "validating mode" and "well-formedness mode". It is possible to allow non-declared elements in the document, but to check declared elements against their declaration at the same time. Moreover, there is a similar feature for attribute lists; you can allow non-declared attributes and check declared attributes. (Well, the whole truth is that the parser always works in this mix mode, and that the "validating mode" and the "well-formedness mode" are only the extremes of the mix mode.)</li> </ul> <p> <a name="2_Creatingnodesandtrees"></a> <h2>Creating nodes and trees</h2> <p> Often, the parser creates the trees, but on occasion it is useful to create trees manually. We explain here only the basic mechanism. There is a nice camlp4 syntax extension called pxp-pp (XXX: LINK) allowing for a much better notation in programs. <p> The most basic way of creating new nodes are the <code class="code">create_element</code>, <code class="code">create_data</code>, and <code class="code">create_other</code> methods of nodes. It is not recommended to use them directly, however, as they are very primitive. <p> In the <a href="Pxp_document.html"><code class="code"><span class="constructor">Pxp_document</span></code></a> module there are a number of functions creating individual nodes (without children), the node constructors: <p> <ul> <li><a href="Pxp_document.html#VALcreate_element_node"><code class="code"><span class="constructor">Pxp_document</span>.create_element_node</code></a>: creates an element node</li> <li><a href="Pxp_document.html#VALcreate_data_node"><code class="code"><span class="constructor">Pxp_document</span>.create_data_node</code></a>: creates a data node</li> <li><a href="Pxp_document.html#VALcreate_comment_node"><code class="code"><span class="constructor">Pxp_document</span>.create_comment_node</code></a>: creates a comment node</li> <li><a href="Pxp_document.html#VALcreate_pinstr_node"><code class="code"><span class="constructor">Pxp_document</span>.create_pinstr_node</code></a>: creates a processing instruction node</li> <li><a href="Pxp_document.html#VALcreate_super_root_node"><code class="code"><span class="constructor">Pxp_document</span>.create_super_root_node</code></a>: creates a super root node</li> </ul> There are no functions to create attribute and namespace nodes - these are always created automatically by their containing nodes, so the user does not need to do anything for creating them. <p> The node constructors must be equipped with all required data to create the requested type of node. This includes the data that would have been available in the textual XML representation, and some of the meta data passed to the parsers, and meta data implicitly generated by the parsers. For an element, this is at minimum: <p> <ul> <li>The name of the element (e.g. the "foo" in <code class="code"><foo></code>)</li> <li>The attributes of the element</li> <li>The DTD object to use (a rudimentary DTD object is even required if only well-formedness checks will be applied but no validation)</li> <li>The specification which classes are instantiated to create the nodes</li> </ul> For the latter two, see below. Optionally one can provide: <p> <ul> <li>The position of the element in the XML text</li> <li>Whether the attribute list is to be validated at creation time</li> <li>Whether name pools are to be used for the attributes</li> </ul> Regarding validation, the default is to validate local data such as attributes, but to omit any checks of the position the node has in the tree. The tree is still a singleton, and consists only of one node after creation, so non-local checks do not make sense. <p> After some nodes have been created, they can be combined to more complex trees by mutation methods (e.g. <a href="Pxp_document.node.html#METHODappend_node"><code class="code"><span class="constructor">Pxp_document</span>.node.append_node</code></a>). <p> As mentioned, a node must always be connected with a DTD object, even if no validation checks will be done. It is possible to create DTD objects that do not impose restrictions on the document: <p> <pre></pre><code class="code"> <br> <span class="keyword">let</span> dtd = <span class="constructor">Pxp_dtd_parser</span>.create_empty_dtd config<br> dtd <span class="keywordsign">#</span> allow_arbitrary<br> </code><pre></pre> <p> Even such a DTD object can contain entity definitions, and can demand a certain way of dealing with namespaces. Also, the character encoding of the nodes is taken from the DTD. See <a href="Pxp_dtd.dtd.html"><code class="code"><span class="constructor">Pxp_dtd</span>.dtd</code></a> for DTD methods, and <a href="Pxp_dtd_parser.html"><code class="code"><span class="constructor">Pxp_dtd_parser</span></code></a> for convenient ways to create DTD objects. Note that all nodes of a tree must be connected to the same DTD object. <p> PXP is not restricted to using built-in classes for nodes. When the parser is invoked and a tree is built, it is looked up in a so-called <b>document model specification</b> how the new objects have to be created (type <a href="Pxp_document.html#TYPEspec"><code class="code"><span class="constructor">Pxp_document</span>.spec</code></a>. Basically, it is a list of sample objects to use (which are called <b>exemplars</b>), and these objects are cloned when a node is actually created. <p> When calling the node constructors directly (bypassing the parser), the document model specification has also to be passed to them as argument. It is used in the same way as the parser uses it. <p> For getting the built-in classes without any modification, just use <a href="Pxp_tree_parser.html#VALdefault_spec"><code class="code"><span class="constructor">Pxp_tree_parser</span>.default_spec</code></a>. For the variant with enabled namespaces, prefer <a href="Pxp_tree_parser.html#VALdefault_namespace_spec"><code class="code"><span class="constructor">Pxp_tree_parser</span>.default_namespace_spec</code></a>. <p> <a name="2_Extendednodes"></a> <h2>Extended nodes</h2> <p> Every node in a tree has a so-called extension. By default, the extension is practically empty and only present for formal uniformity. However, one can also define custom extension classes, and effectively add new methods to the node classes. <p> Node extensions are explained in detail here: <a href="Intro_extensions.html"><code class="code"><span class="constructor">Intro_extensions</span></code></a> <p> <a name="2_Namespaces"></a> <h2>Namespaces</h2> <p> As an option, PXP processes namespace declarations in XML text. See this separate introduction for details: <a href="Intro_namespaces.html"><code class="code"><span class="constructor">Intro_namespaces</span></code></a>. <p> <a name="2_DetailsofthemappingfromXMLtexttothetreerepresentation"></a> <h2>Details of the mapping from XML text to the tree representation</h2> <p> If an element declaration does not allow the element to contain character data, the following rules apply. <p> If the element must be empty, i.e. it is declared with the keyword <code class="code"><span class="constructor">EMPTY</span></code>, the element instance must be effectively empty (it must not even contain whitespace characters). The parser guarantees that a declared <code class="code"><span class="constructor">EMPTY</span></code> element never contains a data node, even if the data node represents the empty string. <p> If the element declaration only permits other elements to occur within that element but not character data, it is still possible to insert whitespace characters between the subelements. The parser ignores these characters, too, and does not create data nodes for them. <p> <b>Example.</b> Consider the following element types: <p> <pre></pre><code class="code"><!<span class="constructor">ELEMENT</span> x ( <span class="keywordsign">#</span><span class="constructor">PCDATA</span> <span class="keywordsign">|</span> z )* ><br> <!<span class="constructor">ELEMENT</span> y ( z )* ><br> <!<span class="constructor">ELEMENT</span> z <span class="constructor">EMPTY</span>><br> </code><pre></pre> <p> Only <code class="code">x</code> may contain character data, the keyword <code class="code"><span class="keywordsign">#</span><span class="constructor">PCDATA</span></code> indicates this. The other types are character-free. <p> The XML term <p> <pre></pre><code class="code"><x><z/> <z/></x><br> </code><pre></pre> <p> will be internally represented by an element node for <code class="code">x</code> with three subnodes: the first <code class="code">z</code> element, a data node containing the space character, and the second <code class="code">z</code> element. In contrast to this, the term <p> <pre></pre><code class="code"><y><z/> <z/></y><br> </code><pre></pre> <p> is represented by an element node for <code class="code">y</code> with only <b>two</b> subnodes, the two <code class="code">z</code> elements. There is no data node for the space character because spaces are ignored in the character-free element <code class="code">y</code>. <p> <b>Parser option:</b> By setting the parser option <code class="code">drop_ignorable_whitespace</code> to <code class="code"><span class="keyword">false</span></code>, the behaviour of the parser is changed such that even ignorable whitespace characters are represented by data nodes. <p> <a name="3_Therepresentationofcharacterdata"></a> <h3>The representation of character data</h3> <p> The XML specification allows all Unicode characters in XML texts. This parser can be configured such that UTF-8 is used to represent the characters internally; however, the default character encoding is ISO-8859-1. (Currently, no other encodings are possible for the internal string representation; the type <a href="Pxp_types.html#TYPErep_encoding"><code class="code"><span class="constructor">Pxp_types</span>.rep_encoding</code></a> enumerates the possible encodings. Principally, the parser could use any encoding that is ASCII-compatible, but there are currently only lexical analyzers for UTF-8 and ISO-8859-1. It is currently impossible to use UTF-16 or UCS-4 as internal encodings (or other multibyte encodings which are not ASCII-compatible) unless major parts of the parser are rewritten - unlikely...) <p> The internal encoding may be different from the external encoding (specified in the XML declaration <code class="code"><?xml ... encoding=<span class="string">"..."</span><span class="keywordsign">?></span></code>); in this case the strings are automatically converted to the internal encoding. <p> If the internal encoding is ISO-8859-1, it is possible that there are characters that cannot be represented. In this case, the parser ignores such characters and prints a warning (to the <code class="code">collect_warning</code> object that must be passed when the parser is called). <p> The XML specification allows lines to be separated by single LF characters, by CR LF character sequences, or by single CR characters. Internally, these separators are always converted to single LF characters. <p> The parser guarantees that there are never two adjacent data nodes; if necessary, data material that would otherwise be represented by several nodes is collapsed into one node. Note that you can still create node trees with adjacent data nodes; however, the parser does not return such trees. <p> Note that CDATA sections are not represented specially; such sections are added to the current data material that is being collected for the next data node. <p> <a name="3_Therepresentationofentitieswithindocuments"></a> <h3>The representation of entities within documents</h3> <p> <b>Entities are not represented within documents!</b> If the parser finds an entity reference in the document content, the reference is immediately expanded, and the parser reads the expansion text instead of the reference. <p> <a name="3_Therepresentationofattributes"></a> <h3>The representation of attributes</h3> <p> As attribute values are composed of Unicode characters, too, the same problems with the character encoding arise as for character material. Attribute values are converted to the internal encoding, too; and if there are characters that cannot be represented, these are dropped, and a warning is printed. <p> Attribute values are normalized before they are returned by methods like <code class="code">attribute</code>. First, any remaining entity references are expanded; if necessary, expansion is performed recursively. Second, newline characters (any of LF, CR LF, or CR characters) are converted to single space characters. Note that especially the latter action is prescribed by the XML standard (but <code class="code"><span class="keywordsign">&</span><span class="keywordsign">#</span>10;</code> is not converted such that it is still possible to include line feeds into attributes). <p> <a name="3_Therepresentationofprocessinginstructions"></a> <h3>The representation of processing instructions</h3> <p> Processing instructions are parsed to some extent: The first word of the PI is called the target, and it is stored separated from the rest of the PI: <p> <pre></pre><code class="code"><?target rest<span class="keywordsign">?></span><br> </code><pre></pre> <p> The exact location where a PI occurs is not represented (by default). The parser attaches the PI to the object that represents the embracing construct (an element, a DTD, or the whole document); that means you can find out which PIs occur in a certain element, in the DTD, or in the whole document, but you cannot lookup the exact position within the construct. <p> <b>Parser option:</b> If you require the exact location of PIs, it is possible to create regular nodes for them instead of attaching them to the surrounding node as property. This mode is controlled by the option <code class="code">enable_pinstr_nodes</code>. The additional nodes have the node type <code class="code"><span class="constructor">T_pinstr</span> target</code>, and are created from special exemplars contained in the <code class="code">spec</code> (see <a href="Pxp_document.html#TYPEspec"><code class="code"><span class="constructor">Pxp_document</span>.spec</code></a>). <p> <a name="3_Therepresentationofcomments"></a> <h3>The representation of comments</h3> <p> Normally, comments are not represented; they are dropped by default. <p> <b>Parser option:</b> However, if you require comment in the document tree, it is possible to create <code class="code"><span class="constructor">T_comment</span></code> nodes for them. This mode can be specified by the option <code class="code">enable_comment_nodes</code>. Comment nodes are created from special exemplars contained in the <code class="code">spec</code> (see <a href="Pxp_document.html#TYPEspec"><code class="code"><span class="constructor">Pxp_document</span>.spec</code></a>). You can access the contents of comments through the method <code class="code">comment</code>. <p> <a name="3_Theattributesxmllangandxmlspace"></a> <h3>The attributes <code class="code">xml:lang</code> and <code class="code">xml:space</code> </h3> <p> These attributes are not supported specially; they are handled like any other attribute. <p> Note that the utility function <a href="Pxp_document.html#VALstrip_whitespace"><code class="code"><span class="constructor">Pxp_document</span>.strip_whitespace</code></a> respects <code class="code">xml:space</code> <br> </body></html>