<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head profile="http://internetalchemy.org/2003/02/profile"> <link rel="foaf" type="application/rdf+xml" title="FOAF" href="http://www.openlinksw.com/dataspace/uda/about.rdf" /> <link rel="schema.dc" href="http://purl.org/dc/elements/1.1/" /> <meta name="dc.title" content="13. XML Support" /> <meta name="dc.subject" content="13. XML Support" /> <meta name="dc.creator" content="OpenLink Software Documentation Team ; " /> <meta name="dc.copyright" content="OpenLink Software, 1999 - 2009" /> <link rel="top" href="index.html" title="OpenLink Virtuoso Universal Server: Documentation" /> <link rel="search" href="/doc/adv_search.vspx" title="Search OpenLink Virtuoso Universal Server: Documentation" /> <link rel="parent" href="webandxml.html" title="Chapter Contents" /> <link rel="prev" href="xmlservices.html" title="Virtuoso XML Services" /> <link rel="next" href="updategrams.html" title="Using UpdateGrams to Modify Data" /> <link rel="shortcut icon" href="../images/misc/favicon.ico" type="image/x-icon" /> <link rel="stylesheet" type="text/css" href="doc.css" /> <link rel="stylesheet" type="text/css" href="/doc/translation.css" /> <title>13. XML Support</title> <meta http-equiv="Content-Type" content="text/xhtml; charset=UTF-8" /> <meta name="author" content="OpenLink Software Documentation Team ; " /> <meta name="copyright" content="OpenLink Software, 1999 - 2009" /> <meta name="keywords" content="" /> <meta name="GENERATOR" content="OpenLink XSLT Team" /> </head> <body> <div id="header"> <a name="queryingxmldata" /> <img src="../images/misc/logo.jpg" alt="" /> <h1>13. XML Support</h1> </div> <div id="navbartop"> <div> <a class="link" href="webandxml.html">Chapter Contents</a> | <a class="link" href="xmlservices.html" title="Virtuoso XML Services">Prev</a> | <a class="link" href="updategrams.html" title="Using UpdateGrams to Modify Data">Next</a> </div> </div> <div id="currenttoc"> <form method="post" action="/doc/adv_search.vspx"> <div class="search">Keyword Search: <br /> <input type="text" name="q" /> <input type="submit" name="go" value="Go" /> </div> </form> <div> <a href="http://www.openlinksw.com/">www.openlinksw.com</a> </div> <div> <a href="http://docs.openlinksw.com/">docs.openlinksw.com</a> </div> <br /> <div> <a href="index.html">Book Home</a> </div> <br /> <div> <a href="contents.html">Contents</a> </div> <div> <a href="preface.html">Preface</a> </div> <br /> <div class="selected"> <a href="webandxml.html">XML Support</a> </div> <br /> <div> <a href="forxmlforsql.html">Rendering SQL Queries as XML (FOR XML Clause)</a> </div> <div> <a href="composingxmlinsql.html">XML Composing Functions in SQL Statements (SQLX)</a> </div> <div> <a href="xmlservices.html">Virtuoso XML Services</a> </div> <div class="selected"> <a href="queryingxmldata.html">Querying Stored XML Data</a> <div> <a href="#xpathcontainsSQLPred" title="XPATH_CONTAINS SQL Predicate">XPATH_CONTAINS SQL Predicate</a> <a href="#qryusingxpath_eval" title="Using xpath_eval()">Using xpath_eval()</a> <a href="#wxmlextentrefinxml" title="External Entity References in Stored XML">External Entity References in Stored XML</a> <a href="#wamlschmdtdfuncs" title="XML Schema & DTD Functions">XML Schema & DTD Functions</a> <a href="#usingxmlfreetext" title="Using XML and Free Text">Using XML and Free Text</a> <a href="#xcontainspredicate" title="XCONTAINS predicate">XCONTAINS predicate</a> <a href="#textcontainsxpath" title="text-contains XPath Predicate">text-contains XPath Predicate</a> <a href="#xmlfreetextrules" title="XML Free Text Indexing Rules">XML Free Text Indexing Rules</a> <a href="#xmlencoding" title="XML Processing & Free Text Encoding Issues">XML Processing & Free Text Encoding Issues</a> </div> </div> <div> <a href="updategrams.html">Using UpdateGrams to Modify Data</a> </div> <div> <a href="xmltemplates.html">XML Templates</a> </div> <div> <a href="xmlschema.html">XML DTD and XML Schemas</a> </div> <div> <a href="xq.html">XQuery 1.0 Support</a> </div> <div> <a href="xslttrans.html">XSLT Transformation</a> </div> <div> <a href="xmltype.html">XMLType</a> </div> <div> <a href="xmldom.html">Changing XML entities in DOM style</a> </div> <br /> </div> <div id="text"> <a name="queryingxmldata" /> <h2>13.4. Querying Stored XML Data</h2> <a name="xpathcontainsSQLPred" /> <h3>13.4.1. XPATH_CONTAINS SQL Predicate</h3> <p> XPath expressions can be used in SQL statements to decompose and match XML data stored in columns. The <span class="computeroutput">xpath_contains</span> SQL predicate can be used either to test for an XML value matching a path expression or to extract one or more entities from the XML value. These values can then be used later in the query as contexts for other XPath expressions. </p> <div> <pre class="programlisting"> xpath_contains (xml_column, xp_expression[, query_variable]); </pre> </div> <p> The first argument, <span class="computeroutput">xml_column</span> is the name of the column on which to perform the XPath search. The second argument, <span class="computeroutput">xp_expression</span>, takes an XPath expression. </p> <p> The third argument is an optional query variable that gets bound to each result entity value of the xpath expression. If this variable is omitted the xpath_contains predicate will qualify the query by returning true for matches. In this case the result will only return one row per match. If the variable is present, the result set could contain multiple rows per result set row of the base table, one row for each match. </p> <p> Consider the example: </p> <div> <pre class="programlisting"> select xt_file, t from xml_text where xpath_contains (xt_text, '//chapter/title[position () = 1]', t); </pre> </div> <p> This SQL statement will select the first title child of any chapter entities in the XML documents in the xt_text column of the table <span class="computeroutput">xml_text</span>. There can be several matching entities per row of xml_text. The result set will contain a row for each matching entity. </p> <p> In XPath terms the path expression of <span class="computeroutput">xpath_contains</span> is evaluated with the context node set to the root node of the XML tree represented by the value of the column that is the first argument of xpath_contains. This node is the only element of the context node set. </p> <div class="note"> <div class="notetitle">Note:</div> <p> The 't' variable in the above example gets bound to XML entities, not to their string values or other representations. One can thus use these values as context nodes for other expressions. </p> </div> <p> The XPATH expression can have a list of options in the beginning. The list of options is surrounded by square brackets. Options in the list are delimited by spaces. The most popular option is <span class="computeroutput">__quiet</span> that allows to process a set of rows if not all stored documents are valid XMLs; if an error is signalled by the XML parser when it prepares a content document for the XPATH in question and the XPATH contains <span class="computeroutput">__quiet</span> then the error is suppressed and the row is silently ignored as if XPATH found nothing. One can configure the DTD validator of the parser by placing its <a href="xmlschema.html#dtd_config">configuration parameters</a> in the list of XPATH options. </p> <p> The following example is almost identical to the previous one but it works even if not all values of <span class="computeroutput">xt_text</span> are valid XMLs, and the resulting values of the 't' variable are standalone entities even if source documents in xt_text contain external generic entities. </p> <div> <pre class="programlisting"> select xt_file, t from xml_text where xpath_contains (xt_text, '[__quiet BuildStandalone=ENABLE]//chapter/title[position () = 1]', t); </pre> </div> <br /> <a name="qryusingxpath_eval" /> <h3>13.4.2. Using xpath_eval()</h3> <p>The <span class="computeroutput">xpath_eval()</span> function is used to filter out parts of an XML fragment that match a given XPATH expression. It can be used to retrieve multiple-node answers to queries, as it is often the case that more than one node-set matches. Consider the following statements that create a table with XML stored inside.</p> <div> <pre class="programlisting"> CREATE TABLE t_articles ( article_id int NOT NULL, article_title varchar(255) NOT NULL, article_xml long varchar ); insert into t_articles (article_id, article_title) values (1, 'a'); insert into t_articles (article_id, article_title) values (2, 'b'); UPDATE t_articles SET article_xml = ' <beatles id = "b1"> <beatle instrument = "guitar" alive = "no">john lennon</beatle> <beatle instrument = "guitar" alive = "no">george harrison</beatle> </beatles>' WHERE article_id = 1; UPDATE t_articles SET article_xml = ' <beatles id = "b2"> <beatle instrument = "bass" alive = "yes">paul mccartney</beatle> <beatle instrument = "drums" alive = "yes">ringo starr</beatle> </beatles>' WHERE article_id = 2; </pre> </div> <p>Now we make a query that will return a vector of results, each vector element corresponding to a node-set of the result.</p> <div> <pre class="programlisting"> SELECT xpath_eval('//beatle/@instrument', xml_tree_doc (article_xml), 0) AS beatle_instrument FROM t_articles WHERE article_id = 2; </pre> </div> <p>The repeating nodes are returned as part of a vector, the third argument to <span class="computeroutput">xpath_eval()</span> is set to 0, which means that it is to return all nodes.</p> <p>Otherwise, we can select the first node-set by supplying 1 as the third parameter to <span class="computeroutput">xpath_eval()</span>: </p> <div> <pre class="programlisting"> SELECT xpath_eval('//beatle/@instrument', xml_tree_doc (article_xml), 1) AS beatle_instrument FROM t_articles WHERE article_id = 2; </pre> </div> <div class="tip"> <div class="tiptitle">See Also:</div> <p> <a href="fn_xpath_eval.html">xpath_eval()</a> </p> <p> <a href="fn_xquery_eval.html">xquery_eval()</a> </p> <p> <a href="">xmlupdate()</a> </p> </div> <br /> <a name="wxmlextentrefinxml" /> <h3>13.4.3. External Entity References in Stored XML</h3> <p> When an XML document is stored as either text or in persistent XML format it can contain references to external parsed entities with the <!entity ...> declaration and the &xx; syntax. These are stored as references and not expanded at storage time if the entity is external. Such references are transparently followed by XPath and XSLT. A run-time error occurs if the referenced resource cannot be accessed when needed. The reference is only followed if the actual subtree is selected by XPath or XSLT. The resource is retrieved at most once for each XPath or XSLT operation referencing it, regardless of the number of times the link is traversed. This is transparent, so that the document node of the referenced entity appears as if it were in the place of the reference. </p> <p> External entity references have an associated URI, which is either absolute, with protocol identifier and full path, or relative. Virtuoso resolves relative references with respect to the base URI of the referencing document. If the document is stored as a column value in a table it does not have a natural base URI; therefore, the application must supply one if relative references are to be supported. This is done by specifying an extra column of the same table to contain a path, in the form of collections delimited by slashes, like the path of a DAV resource or a Unix file system path. This base URI is associated with an XML column with the IDENTIFIED BY declaration: </p> <div> <pre class="programlisting"> create table XML_TEXT ( XT_ID integer, XT_FILE varchar, XT_TEXT long varchar identified by xt_file, primary key (XT_ID) ); create index XT_FILE on XML_TEXT (XT_FILE); </pre> </div> <p> Thus, each time the value of <span class="computeroutput">xt_text</span> is retrieved for XML processing by <span class="computeroutput">xpath_contains()</span> or <span class="computeroutput">xcontains()</span> the base URI is taken from <span class="computeroutput">xt_file</span>. The complete URI for the <span class="computeroutput">xt_text</span> of a column of the sample table would be: </p> <div> <pre class="programlisting"> virt://<qualified table name>.<uri column>.<text column>:<uri column value> </pre> </div> <p> An example would be: </p> <div> <pre class="programlisting"> "virt://DB.DBA.XML_TEXT.XT_FILE.XT_TEXT:sqlreference.xml" </pre> </div> <p> The '..' and '.' in relative paths are treated like file names when combining relative references to base URIs. A relative reference without a path just replaces the last part of the path in the base URI. </p> <div class="tip"> <div class="tiptitle">See Also:</div> <p> <a href="fn_xml_uri_get.html">xml_uri_get() and xml_uri_merge()</a> for more details. </p> </div> <br /> <a name="wamlschmdtdfuncs" /> <h3>13.4.4. XML Schema & DTD Functions</h3> <p>The following functions can be used to generate XML Schema or DTD information about a given SQL query:</p> <ul> <li> <a href="fn_xml_auto_schema.html">xml_auto_schema()</a> </li> <li> <a href="fn_xml_auto_dtd.html">xml_auto_dtd()</a> </li> </ul> <a name="ex_webandxmlautoschdtd" /> <div class="example"> <div class="exampletitle">Generating XML Schema and DTD Data</div> <p>This example shows trivial use of the two functions <span class="computeroutput">xml_auto_schema()</span> and <span class="computeroutput">xml_auto_dtd()</span>. </p> <div> <pre class="programlisting"> SQL> select xml_auto_schema('select u_name from sys_users', 'root'); callret VARCHAR _______________________________________________________________________________ <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> <xsd:documentation> Schema for output of the following SQL statement: <![CDATA[select u_name from sys_users]]> </xsd:documentation> </xsd:annotation> <xsd:element name="root" type="root__Type"/> <xsd:complexType name="root__Type"> <xsd:sequence> <xsd:element name="SYS_USERS" type="SYS_USERS_Type" minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="SYS_USERS_Type"> <xsd:attribute name="U_NAME" type="xsd:string"/> </xsd:complexType> </xsd:schema> 1 Rows. -- 1843 msec. SQL> select xml_auto_dtd('select u_name from sys_users', 'root'); callret VARCHAR _______________________________________________________________________________ <!-- dtd for output of the following SQL statement: select u_name from sys_users --> <!ELEMENT root (#PCDATA | SYS_USERS)* > <!ELEMENT SYS_USERS (#PCDATA)* > <!ATTLIST SYS_USERS U_NAME CDATA #IMPLIED > 1 Rows. -- 411 msec. </pre> </div> </div> <br /> <a name="usingxmlfreetext" /> <h3>13.4.5. Using XML and Free Text</h3> <p> Virtuoso integrates classic free text retrieval and XML semi-structured query features to offer a smart, scalable XML repository. When a column is declared as indexed XML with the CREATE TEXT XML INDEX statement the text is checked for well-formedness at time of storage. The specific XML structure of the text is also considered when making the free text index entries. This XML-aware free text index is used for processing XPath queries in the <span class="computeroutput">xcontains</span> SQL predicate. This predicate is only applicable to columns for which there is an XML free text index. </p> <p> Arbitrary free text criteria can appear inside the XPath expression of <span class="computeroutput">xcontains</span>. These are introduced by the XPath extension function <span class="computeroutput">text-contains()</span>, which may only be used within <span class="computeroutput">xcontains</span> as it relies on the underlying free text index. </p> <div class="note"> <div class="notetitle">Note</div> <p> <span class="computeroutput">xpath_contains()</span> does not require the existence of a free text index and can thus apply to any well-formed XML content.</p> </div> <br /> <a name="xcontainspredicate" /> <h3>13.4.6. XCONTAINS predicate</h3> <p>This predicate is used in a SQL statement, it returns "true" if a free text indexed column with XML content matches an XPATH expression. Optionally produces the matching node set as a result set.</p> <p> Syntax </p> <div> <pre class="programlisting"> xcontains_pred: xcontains (column, expr [, result_var [, opt_or_value ...]]) opt_or_value: DESCENDING | START_ID ',' scalar_exp | END_ID ',' scalar_exp | SCORE_LIMIT ',' scalar_exp | OFFBAND column result_var: IDENTIFIER | NULL </pre> </div> <p> The <strong>column</strong> must refer to a column for which there exists a free text index. </p> <p> The <strong>expr</strong> must be a narrow or wide string expression whose syntax matches the rules in 'XPATH Query Syntax'. </p> <p> The <strong>result_var</strong> variable is a query variable which, if present, will be successively bound to each element of the node set selected by the XPATH expression. if the value is not a node set and is true, the variable will be once bound to this value. The scope of the variable is the containing select and its value is a scalar or an XML entity. The <strong>result_var</strong> can be not an identifier but a NULL keyword to explicitly indicate that no query variable is required. </p> <p> The <strong>START_ID</strong> is the first allowed document ID to be selected by the expression in its traversal order, e.g. least or equal for ascending and greatest or equal for descending. </p> <p> <strong>END_ID</strong> is the last allowed id in the traversal order. For descending order the START_ID must be >= END_ID for hits to be able to exist. For ascending order the START_ID must be <= END_ID for hits to be able to exist. </p> <p> <strong>DESCENDING</strong> specifies that the search will produce the hit with the greatest ID first, as defined by integer or composite collation. This has nothing to do with a possible ORDER BY of the enclosing statement. Even if there is an ORDER BY in the enclosing statement the DESCENDING keyword of xcontains has an effect in the interpretation of the STRT_ID and END_ID xcontains options. </p> <p> <strong>RANGES</strong> specifies that the query variable following the RANGES keyword will be bound to the word position ranges of the hits of the expression inside the document. The variable is in scope inside the enclosing SELECT statement. </p> <p> <strong>SCORE_LIMIT</strong> specifies a minimum score that hits must have or exceed to be considered matches of the predicate. </p> <p> <strong>OFFBAND</strong> specifies that the following column will be retrieved from the free text index instead of the actual table. For this to be possible the column must have been declared as offband with the CLUSTERED WITH option of the <a href="creatingtxtidxs.html#createtxtidxstmt">CREATE TEXT INDEX</a> statement. </p> <p> If the select statement containing the xcontains predicate does not specify an exact match of the primary key of the table having the xcontains predicate, then the contains predicate will be the 'driving' condition, meaning that rows come in ascending or descending order of the free text document ID. If there is a full equality match of the primary key of the table, this will be the driving predicate and xcontains will only be used to check if the text expression matches the single row identified by the full match of the primary key. </p> <p> The xcontains predicate may not appear outside of a select statement and may only reference a column for which a free text index has been declared. The first argument must be a column for which there is such an index. The text expression may be variable and computed, although it must be constant during the evaluation of the select statement containing it.</p> <p> The xcontains predicate must be a part of the top level AND of the WHERE clause of the containing select. It may not for example be a term of an OR predicate in the select but can be AND'ed with an OR expression. </p> <a name="" /> <div class="example"> <div class="exampletitle">Selecting Title Elements Called 'Key'</div> <div> <pre class="programlisting"> select xt_file from xml_text2 where xcontains (xt_text, '//title = "Key"'); </pre> </div> <p> The query retrieves the <span class="computeroutput">xt_file</span> for rows whose <span class="computeroutput">xt_text</span> is an XML document containing 'Key' as the text value of a title element. </p> <p>If not all values in <span class="computeroutput">xt_text</span> are valid XMLs then '__quiet' option can be useful to disable error signalling. It is unusual to get an incorrect XML stored in a column that has free text XML index because both on insert and on update the text is parsed by an free text indexing routine, but the error is possible if e.g. a non-standalone document is stored and an important external entity was available at indexing time but disappeared later. Thus a modified example might be better for a column with non-standalone documents; <span class="computeroutput"> select xt_file from xml_text2 where xcontains (xt_text, '[__quiet] //title = "Key"'); </span> </p> <div class="exampletitle">Selecting Title Element that Contains a Specified Text</div> <div> <pre class="programlisting"> select n from xml_text2 where xcontains(xt_text, '//title[. = "AS Declaration - Column Aliasing"]',0,n); </pre> </div> <p> The query retrieves each title element from each row of <span class="computeroutput">xml_text2</span> where the <span class="computeroutput">xt_text</span> contains title elements with the text value "AS Declaration - Column Aliasing." </p> <div class="note"> <div class="notetitle">Note</div> <p>The equality test is case- and whitespace-sensitive, as normal in XPath. The free text index is used for the search but the final test is done according to XPath rules.</p> </div> </div> <div class="tip"> <div class="tiptitle">See Also:</div> <p>The <a href="queryingftcols.html#containspredicate">CONTAINS</a> Predicate.</p> </div> <br /> <a name="textcontainsxpath" /> <h3>13.4.7. text-contains XPath Predicate</h3> <div> <pre class="programlisting"> text-contains (node-set, text-expression) </pre> </div> <p> This XPath predicate is true if any of the nodes in <span class="computeroutput">node-set</span> have text values matching the <span class="computeroutput">text-expression</span>. The <span class="computeroutput">text-expression</span> should be a constant string whose syntax corresponds to the top production of the free text syntax for patterns in <span class="computeroutput">contains()</span>. The string also may not consist exclusively of spaces or noise words. </p> <div class="tip"> <div class="tiptitle">See Also:</div> <p>"Noise Words" in the <a href="htmlconductorbar.html#freetext">Free Text Search chapter</a>.</p> </div> <a name="" /> <div class="example"> <div class="exampletitle">Selecting All Titles About Aliasing</div> <div> <pre class="programlisting"> select n from xml_text2 where xcontains (xt_text, '//title[text-contains (., "Aliasing")]', 0, n); </pre> </div> </div> <p> This selects all title elements that contain the word "Aliasing" using free text match rules: case insensitive and whole word. </p> <a name="" /> <div class="example"> <div class="exampletitle">Select All Trees with Elements Containing "sql reference"</div> <div> <pre class="programlisting"> select n from xml_text2 where xcontains (xt_text, '//*[text-contains (., ''"sql reference"'')]', 0, n); </pre> </div> </div> <p> This selects all elements whose text value contains the phrase "sql reference". Free text matching rules apply. This produces all nodes in document order for all documents which contains the phrase, starting with the document node and following downward including all paths to the innermost element(s) whose text contains the phrase. </p> <br /> <a name="xmlfreetextrules" /> <h3>13.4.8. XML Free Text Indexing Rules</h3> <p> XML documents are inserted into the free text index as follows: </p> <ul> <li>The process works on the parsed XML tree; therefore character and local entity references are expanded.</li> <li>Whole words of text content, bounded by delimiters used for free text, are each assigned an ordinal number. Noise words defined in the noise.txt file used by free text indexing are not counted.</li> <li>Attribute names and values are not indexed.</li> <li>Element start and end tags are indexed using the expanded names - that is, prefixed with the namespace URI + ':'.</li> <li>An element start tag's ordinal number is one less than the ordinal number of the first whole word in the text value.</li> <li>A close tag's ordinal number is one greater than that of the last word in the text value.</li> </ul> <p> From these rules follows that: </p> <div> <pre class="programlisting"> <html> <body> <title>Title of Document</title> <p>Some <b>bold</b> text </p> </body> </html> </pre> </div> <p> will be indexed as follows: </p> <div> <pre class="programlisting"> <html> 0 <body> 0 <title> 0 Title 1 of - no number, noise word Document 2 </title> 3 <p> 3 Some 4 <b> 4 bold 5 </b> 6 text 6 </p> 6 </body> 6 </html> 6 </pre> </div> <p> As a result, the phrase "some bold text" is the string value of the <p> tag and will match the free text expression "some bold text" even though there is mark-up in it. Conversely, the phrase "Document some bold" does not match. Words will not considered adjacent if there is a mix of opening and closing tags. They will only be considered adjacent if there are solely one or more either opening or closing tags between them. This can be circumvented by using the <span class="computeroutput">NEAR</span> connective instead of the phrase construct. </p> <p> A free text condition will only be true of an element if all the words needed to satisfy the condition are part of the element's string value. This string value includes text children of descendants.</p> <br /> <a name="xmlencoding" /> <h3>13.4.9. XML Processing & Free Text Encoding Issues</h3> <p> XML document may be written in a variety of encodings, and it may cause errors if an incorrect encoding is used for reading a document. Most common errors can easily be eliminated by writing proper XML prologs in documents, but this is not always possible, e.g. if documents are composed by third-party applications. Virtuoso provides various tools to support different types of encodings and to specify encodings to use if a given document has no XML prolog. </p> <a name="encodingsvscharsets" /> <h4>13.4.9.1. Encodings: The Difference Between Encodings & Character Sets</h4> <p> Not all documents may be converted to Unicode by using simple character sets. Some of them are stored in so-called "multibyte" encodings. It means that every letter (or ideograph) is represented as a sequence of one or more bytes, not by exactly one byte. The conversion from such representation to Unicode and back is usually significantly slower than simple transformation via character sets, so these representations are supported by data import operations only, but not by internal RDBMS routines.</p> <p> The Virtuoso Server "knows" some number of built-in encodings, such as UTF-8, UTF-16BE and UTF-16LE. It can load additional encoding descriptions from a "UCM" file, and can automatically create a new encoding from a known charset with the same name. See the <a href="queryingxmldata.html#ucmencodings">UCM Encodings</a> section for more details.</p> <p> An encoding may be used in the following places: </p> <ul> <li>The XML/HTML parser to convert source text to Unicode.</li> <li>The free-text indexing engine to convert plain-text or XML documents to Unicode during the indexing.</li> <li>It may be used by the compiler of free-text search expressions to convert string constants of the expression to Unicode.</li> <li>It may be used to convert string constants of XPath/XQuery expressions.</li> </ul> <p>You can only use character sets, not encodings as an ODBC connection character set, as a character set attribute of a column of a database table, as an output encoding of the built-in XSLT processor (it is for future versions). UTF-8 is an exception, it is supported in many places where other encodings are not.</p> <div class="tip"> <div class="tiptitle">Security Note:</div> <p>Two strings converted to Unicode may be identical, but this does not guarantee that their source strings were equal byte-by-byte due to the nature of some encodings. For this reason you should avoid processing authorization data that are neither in Unicode nor in one of the standard character sets (single-byte encodings). Multibyte encodings and user-defined character sets may be unsafe for such purposes. </p> </div> <a name="ucmencodings" /> <h5>13.4.9.1.1. UCM Encodings</h5> <p> The description of a multibyte encodings is much longer than the description of a character set. It is inconvenient to keep such amounts of data inside the executable. Virtuoso can load descriptions of required encodings from external files in UCM format. Every UCM file describes one encoding.</p> <p> Virtuoso loads UCM files at system initialization. The list of UCM files is kept in the <a href="databaseadmsrv.html#VIRTINI">Virtuoso INI file</a> under a section called [Ucms]. This section should contain a UcmPath parameter and one or more parameters with names Ucm1, Ucm2, Ucm3 and so on (up to Ucm99).</p> <p> The UcmPath parameter specifies the directory where UCM files are located, and every UcmNN parameter specifies the name of a UCM file to load and a list of names that the encoding can be identified by the <?xml ... encoding="..." ?> XML preamble. A vertical bar character is used to delimit names in the list.</p> <a name="ucmdefininifile" /> <div class="example"> <div class="exampletitle">Sample [Ucms] Section</div> <div> <pre class="programlisting"> [Ucms] UcmPath = /usr/local/javalib/ucm Ucm1 = java-Cp933-1.3-P.ucm,Cp933 Ucm2 = java-Cp949-1.3-P.ucm,Cp949|Korean </pre> </div> <p>This section describes two UCM files located in /usr/local/javalib/ucm directory: data from java-Cp933-1.3-P.ucm will be used for documents in the 'Cp933' encoding; data from java-Cp949-1.3-P.ucm will be used for documents in the 'Cp949' encoding and for documents in the 'Korean' encoding (because these two names refers to the same encoding). </p> </div> <div class="note"> <div class="notetitle">Note:</div> <p>The encoding name specified inside the UCM file itself is not used.</p> </div> <p> The Virtuoso server will log the results of processing each UCM file specified in the Virtuoso INI file. If a UCM file specified is not found or contains syntax errors, the error is logged, otherwise only the type and name(s) of the encoding are logged.</p> <div class="note"> <div class="notetitle">Note:</div> <p>If the virtuoso.ini contains a misspelled name of a parameter or section, the parameter (or a whole section) is ignored without being reported as an error. It is always wise to verify that the log contains a record about the encoding(s) you load.</p> </div> <div class="tip"> <div class="tiptitle">See Also:</div> <p>UCM files can be found freely from various sites concerning the "International Components for Unicode" project, such as: <a href="http://www-124.ibm.com/icu">IBM ICU Homepage</a> or the <a href="http://oss.software.ibm.com/cvs/icu/charset/data/ucm/">IBM UCM files directory</a>.</p> <p>The <a href="cinterface.html">C Interface</a> chapter contains further information regarding user customizable support for new encodings and languages. For almost all tasks, it is enough to define a new charset or to load an additional UCM file, but some special tasks may require writing additional C code.</p> </div> <br /> <br /> <a name="encodingattr" /> <h4>13.4.9.2. The Encoding Attribute</h4> <p>If an XML document contains the <span class="computeroutput">encoding</span> parameter in its</p> <div> <pre class="programlisting"><?xml ... ?></pre> </div> <p>prolog declaration, it will be properly decoded and converted into UTF-8, so the application code is free from encoding problems. If the value of this attribute is the name of a pre-set or user-defined character set, that character set will be used. Virtuoso will recognize names such as <span class="computeroutput">UTF-8</span> and <span class="computeroutput">UTF8</span> as multi-character or special encodings. Virtuoso recognizes both official names and aliases.</p> <p>If an encoding is not specified in an XML prolog, or if the document contains no prolog, the default encoding will be used to read the document. If a built-in SQL function invokes the XML parser, it will have an optional argument <span class="computeroutput">parser_mode</span> to specify whether source text should be parsed as strict XML or as HTML. If the source text is 8-bit, then UTF-8 will be used as the default encoding for "XML mode", and ISO-8859-1 (Latin-1) will be the default for "HTML mode". If the source text is of some wide-character type, Unicode is the default. To make another encoding the default, you may specify its official or alias name as the <span class="computeroutput">content_encoding</span> argument of a built-in function you call.</p> <br /> <a name="encodingxpathexp" /> <h4>13.4.9.3. Encoding in XPath Expressions</h4> <p> Sometimes applications should perform XPath queries using the encoding specified by a client. For example, a search engine may ask a user to specify a pattern to search and use the browser's current encoding as a hint to parse the pattern properly. In such cases you may wish to use the <span class="computeroutput">__enc</span> XPath option to specify the encoding used for the rest of XPath string:</p> <a name="" /> <div class="example"> <div class="exampletitle">Specifying Search Encodings in XPath</div> <p>Create a sample table and store an XML with non-Latin-1 characters</p> <div> <pre class="programlisting"> create table ENC_XML_SAMPLE ( ID integer, XPER long varchar, primary key (ID) ); insert into ENC_XML_SAMPLE (ID, XPER) values ( 1, xml_persistent ('<?xml version="1.0" encoding="WINDOWS-1251" ?> <book><cit>Îí äîáàâèë êàðòîøêè, ïîñîëèë è ïîñòàâèë àêâàðèóì íà îãîíü (Ì.Æâàíåöêèé )</cit></book>' ) ); ... </pre> </div> <p>Find the IDs of all XML documents whose texts contain a specified phrase. Note that there are pairs of single quotes (not double quotes) around <span class="computeroutput">KOI8-R</span>. The encoding name should be in single quotes, but because it is inside a string constant the quotes must be duplicated. </p> <div> <pre class="programlisting"> select ID from ENC_XML_SAMPLE where xcontains (XPER, '[__enc ''KOI8-R''] //cit[text-contains(., "''ÐÏÓÔÁ×ÉÌ ÁË×ÁÒÉÕÍ ÎÁ ÏÇÏÎØ''")]'); </pre> </div> </div> <br /> <a name="encodinginfttsp" /> <h4>13.4.9.4. Encoding in Free Text Search Indexes & Patterns</h4> <p>Like XML applications, free text searching may have encoding problems, and Virtuoso offers a similar solution for them.</p> <p>Both the CREATE TEXT INDEX statement and vt_create_text_index() Virtuoso/PL procedure have an optional argument to specify the encoding of the indexed data. The specified encoding will be applied to all source text documents (if the TEXT INDEX was created), or to all XML documents that have no encoding attribute of the sort <?xml ... encoding="..." ?> (if the TEXT XML INDEX was created).</p> <p> The option <span class="computeroutput">__enc</span> may be specified at the beginning of free text search pattern, even if the pattern is inside an XPath statement:</p> <a name="" /> <div class="example"> <div class="exampletitle">Specifying an Encoding for Free Text Searching</div> <p> Create a sample table and store a sample of text with non-Latin-1 characters (assuming that client encoding is Windows-1251) </p> <div> <pre class="programlisting"> create table ENC_TEXT_SAMPLE ( ID integer, TEXT long nvarchar, primary key (ID) ); insert into ENC_TEXT_SAMPLE (ID, XPER) values ( 1, '<?xml version="1.0" encoding="WINDOWS-1251" ?> Îí äîáàâèë êàðòîøêè, ïîñîëèë è ïîñòàâèë àêâàðèóì íà îãîíü (Ì.Æâàíåöêèé') ); ... </pre> </div> <p>Find the IDs of all text documents whose texts contain a specified phrase. </p> <div> <pre class="programlisting"> select ID from ENC_SAMPLE where contains (TEXT, '[__enc ''KOI8-R''] "ÐÏÓÔÁ×ÉÌ ÁË×ÁÒÉÕÍ ÎÁ ÏÇÏÎØ"' ); </pre> </div> <p>Encoding may be applied locally to an argument of the text-search predicate. It may be used if the document contains citations in different encodings or if the XML document contains non-ASCII characters in names of tags or attributes, or if the encoding affects character codes of ASCII symbols such as '/' or '['. </p> <div> <pre class="programlisting"> select ID from ENC_XML_SAMPLE where xcontains (XPER, '//cit[text-contains(., "[__enc ''KOI8-R''] ''ÐÏÓÔÁ×ÉÌ ÁË×ÁÒÉÕÍ ÎÁ ÏÇÏÎØ''")]' ); </pre> </div> </div> <div class="note"> <div class="notetitle">Note:</div> <p> You may have free-text a expression written as a literal constant: e.g. if the argument of text-contains XPath function is a literal constant. Be careful to not declare the __enc twice, once in the beginning of the whole XPath expression and then again in the beginning of the free-text expression constant, because words of the text expression will thus be converted twice.</p> </div> <br /> <br /> <table border="0" width="90%" id="navbarbottom"> <tr> <td align="left" width="33%"> <a href="xmlservices.html" title="Virtuoso XML Services">Previous</a> <br />Virtuoso XML Services</td> <td align="center" width="34%"> <a href="webandxml.html">Chapter Contents</a> </td> <td align="right" width="33%"> <a href="updategrams.html" title="Using UpdateGrams to Modify Data">Next</a> <br />Using UpdateGrams to Modify Data</td> </tr> </table> </div> <div id="footer"> <div>Copyright© 1999 - 2009 OpenLink Software All rights reserved.</div> <div id="validation"> <a href="http://validator.w3.org/check/referer"> <img src="http://www.w3.org/Icons/valid-xhtml10" alt="Valid XHTML 1.0!" height="31" width="88" /> </a> <a href="http://jigsaw.w3.org/css-validator/"> <img src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!" height="31" width="88" /> </a> </div> </div> </body> </html>