<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head profile="http://internetalchemy.org/2003/02/profile"> <link rel="foaf" type="application/rdf+xml" title="FOAF" href="http://www.openlinksw.com/dataspace/uda/about.rdf" /> <link rel="schema.dc" href="http://purl.org/dc/elements/1.1/" /> <meta name="dc.title" content="14. RDF Data Access and Data Management" /> <meta name="dc.subject" content="14. RDF Data Access and Data Management" /> <meta name="dc.creator" content="OpenLink Software Documentation Team ; " /> <meta name="dc.copyright" content="OpenLink Software, 1999 - 2009" /> <link rel="top" href="index.html" title="OpenLink Virtuoso Universal Server: Documentation" /> <link rel="search" href="/doc/adv_search.vspx" title="Search OpenLink Virtuoso Universal Server: Documentation" /> <link rel="parent" href="rdfandsparql.html" title="Chapter Contents" /> <link rel="prev" href="rdfsparqlgeospat.html" title="RDF and Geometry" /> <link rel="next" href="rdfnativestorageproviders.html" title="RDF Data Access Providers (Drivers)" /> <link rel="shortcut icon" href="../images/misc/favicon.ico" type="image/x-icon" /> <link rel="stylesheet" type="text/css" href="doc.css" /> <link rel="stylesheet" type="text/css" href="/doc/translation.css" /> <title>14. RDF Data Access and Data Management</title> <meta http-equiv="Content-Type" content="text/xhtml; charset=UTF-8" /> <meta name="author" content="OpenLink Software Documentation Team ; " /> <meta name="copyright" content="OpenLink Software, 1999 - 2009" /> <meta name="keywords" content="" /> <meta name="GENERATOR" content="OpenLink XSLT Team" /> </head> <body> <div id="header"> <a name="rdfperformancetuning" /> <img src="../images/misc/logo.jpg" alt="" /> <h1>14. RDF Data Access and Data Management</h1> </div> <div id="navbartop"> <div> <a class="link" href="rdfandsparql.html">Chapter Contents</a> | <a class="link" href="rdfsparqlgeospat.html" title="RDF and Geometry">Prev</a> | <a class="link" href="rdfnativestorageproviders.html" title="RDF Data Access Providers (Drivers)">Next</a> </div> </div> <div id="currenttoc"> <form method="post" action="/doc/adv_search.vspx"> <div class="search">Keyword Search: <br /> <input type="text" name="q" /> <input type="submit" name="go" value="Go" /> </div> </form> <div> <a href="http://www.openlinksw.com/">www.openlinksw.com</a> </div> <div> <a href="http://docs.openlinksw.com/">docs.openlinksw.com</a> </div> <br /> <div> <a href="index.html">Book Home</a> </div> <br /> <div> <a href="contents.html">Contents</a> </div> <div> <a href="preface.html">Preface</a> </div> <br /> <div class="selected"> <a href="rdfandsparql.html">RDF Data Access and Data Management</a> </div> <br /> <div> <a href="rdfdatarepresentation.html">Data Representation</a> </div> <div> <a href="rdfsparql.html">SPARQL</a> </div> <div> <a href="sparqlextensions.html">Extensions</a> </div> <div> <a href="rdfgraphsecurity.html">RDF Graphs Security</a> </div> <div> <a href="rdfviews.html">Linked Data Views over RDBMS Data Source</a> </div> <div> <a href="rdfrdfviewgnr.html">Automated Generation of RDF Views over Relational Data Sources</a> </div> <div> <a href="rdfviewsenterpr.html">Examples of Linked Data Views</a> </div> <div> <a href="rdfinsertmethods.html">RDF Insert Methods in Virtuoso</a> </div> <div> <a href="virtuososponger.html">RDFizer Middleware (Sponger)</a> </div> <div> <a href="virtuosospongerfacetinstall.html">Virtuoso Faceted Browser Installation and configuration</a> </div> <div> <a href="virtuosospongerfacent.html">Virtuoso Faceted Web Service</a> </div> <div> <a href="rdfiridereferencing.html">Linked Data</a> </div> <div> <a href="rdfsparqlrule.html">Inference Rules & Reasoning</a> </div> <div> <a href="rdfsparqlgeospat.html">RDF and Geometry</a> </div> <div class="selected"> <a href="rdfperformancetuning.html">RDF Performance Tuning</a> <div> <a href="#rdfperfgeneral" title="General">General</a> <a href="#rdfperfrdfscheme" title="RDF Index Scheme">RDF Index Scheme</a> <a href="#rdfperfindexes" title="Index Scheme Selection">Index Scheme Selection</a> <a href="#rdfperfcost" title="Erroneous Cost Estimates and Explicit Join Order">Erroneous Cost Estimates and Explicit Join Order</a> <a href="#rdfperfgetallgraphs" title="Get All Graphs">Get All Graphs</a> <a href="#rdfperfdumpandreloadgraphs" title="Dump and Reload Graphs">Dump and Reload Graphs</a> <a href="#rdfperfdumpandreloadgraphsn3" title="Dump RDF View Graph to n3">Dump RDF View Graph to n3</a> <a href="#rdfperfloading" title="Loading RDF">Loading RDF</a> <a href="#rdfperfsparul" title="Using SPARUL">Using SPARUL</a> <a href="#rdfperfgeneraldbpedia" title="DBpedia Benchmark">DBpedia Benchmark</a> <a href="#rdfstorebenchmarks" title="RDF Store Benchmarks">RDF Store Benchmarks</a> <a href="#fastapproxdiffandpatch" title="Fast Approximate RDF Graph Diff and Patch">Fast Approximate RDF Graph Diff and Patch</a> <a href="#rdb2rdftriggers" title="RDB2RDF Triggers">RDB2RDF Triggers</a> </div> </div> <div> <a href="rdfnativestorageproviders.html">RDF Data Access Providers (Drivers)</a> </div> <div> <a href="rdfgraphreplication.html">RDF Graph Replication</a> </div> <br /> </div> <div id="text"> <a name="rdfperformancetuning" /> <h2>14.15. RDF Performance Tuning</h2> <p>For RDF query performance, we have the following possible questions:</p> <ul> <li>Is the Virtuoso process properly configured to handle big data sets?</li> <li>Is the graph always specified?</li> <li>Are public web service endpoints protected against bad queries?</li> <li>Are there patterns where only a predicate is given?</li> <li>Is there a bad query plan because of cost model error?</li> </ul> <a name="rdfperfgeneral" /> <h3>14.15.1. General</h3> <p>When running with large data sets, one should configure the Virtuoso process to use between 2/3 to 3/5 of system RAM and to stripe storage on all available disks. See <a href="databaseadmsrv.html#VIRTINI">NumberOfBuffers</a>, <a href="databaseadmsrv.html#VIRTINI">MaxDirtyBuffers</a>, and <a href="databaseadmsrv.html#VIRTINI">Striping</a> INI file parameters. </p> <div> <pre class="programlisting"> ; default installation NumberOfBuffers = 2000 MaxDirtyBuffers = 1200 </pre> </div> <p>Typical sizes for the <strong>NumberOfBuffers</strong> and <strong>MaxDirtyBuffers</strong> (3/4 of NumberOfBuffers) parameters in the Virtuoso configuration file (virtuoso.ini) for various memory sizes are as follows, with each buffer consisting of 8K bytes: </p> <table class="data"> <caption>Table: 14.15.1.1. recommended NumberOfBUffers and MaxDirtyBuffers</caption> <tr> <th class="data">System RAM</th> <th class="data">NumberOfBuffers</th> <th class="data">MaxDirtyBuffers</th> </tr> <tr> <td class="data">2 GB</td> <td class="data">170000</td> <td class="data">130000</td> </tr> <tr> <td class="data">4 GB</td> <td class="data">340000</td> <td class="data">250000</td> </tr> <tr> <td class="data">8 GB</td> <td class="data">680000</td> <td class="data">500000</td> </tr> <tr> <td class="data">16 GB</td> <td class="data">1360000</td> <td class="data">1000000</td> </tr> <tr> <td class="data">32 GB</td> <td class="data">2720000</td> <td class="data">2000000</td> </tr> <tr> <td class="data">48 GB</td> <td class="data">4000000</td> <td class="data">3000000</td> </tr> <tr> <td class="data">64 GB</td> <td class="data">5450000</td> <td class="data">4000000</td> </tr> </table> <br /> <br /> <a name="rdfperfrdfscheme" /> <h3>14.15.2. RDF Index Scheme</h3> <p> Starting with version 6.00.3126 the default RDF index scheme consists of 2 full indices over RDF quads plus 3 partial indices. This index scheme is generally adapted to all kinds of workloads, regardless of whether queries generally specify a graph. </p> <p> Alternate indexing schemes are possible but will not be generally needed. For upgrading old databases with a different index scheme see the corresponding documentation. </p> <p> The index scheme consists of the following indices: </p> <ul> <li> <strong>PSOG</strong> - primary key</li> <li> <strong>POGS</strong> - bitmap index for lookups on object value.</li> <li> <strong>SP</strong> - partial index for cases where only S is specified.</li> <li> <strong>OP</strong> - partial index for cases where only O is specified.</li> <li> <strong>GS</strong> - partial index for cases where only G is specified.</li> </ul> <p> This index scheme is created by the following statements: </p> <div> <pre class="programlisting"> create table DB.DBA.RDF_QUAD ( G IRI_ID_8, S IRI_ID_8, P IRI_ID_8, O any, primary key (P, S, O, G) ) alter index RDF_QUAD on DB.DBA.RDF_QUAD partition (S int (0hexffff00)); create distinct no primary key ref bitmap index RDF_QUAD_SP on RDF_QUAD (S, P) partition (S int (0hexffff00)); create bitmap index RDF_QUAD_POGS on RDF_QUAD (P, O, G, S) partition (O varchar (-1, 0hexffff)); create distinct no primary key ref bitmap index RDF_QUAD_GS on RDF_QUAD (G, S) partition (S int (0hexffff00)); create distinct no primary key ref index RDF_QUAD_OP on RDF_QUAD (O, P) partition (O varchar (-1, 0hexffff)); ; </pre> </div> <p> The idea is to favor queries where the predicate is specified in triple patterns. The entire quad can be efficiently accessed when P and either or both S and O are known. This has the advantage of clustering data by the predicate which improves working set: A page read from disk will only have entries pertaining to the same predicate, chances of accessing other entries of the page are thus higher than if the page held values for arbitrary predicates. For less frequent cases where only S is known, as in describe, the distinct P's of the S are found in the SP index. These SP pairs are then used for accessing the PSOG index to get the O and G. For cases where only the G is known, as in dropping a graph, the distinct S's of the G are found in the GS index. The P's of the S are then found in the SP index. After this the whole quad is found in the PSOG index. </p> <p> The SP, OP and GS indices do not store duplicates. If a S has many values of the P, there is only one entry. Extries are not deleted from SP, OP or GS. This does not lead to erroneous results since a full index, that is either POSG or PSOG is always consulted in order to know if a quad actually exists. For cases of updating data, most often a graph is entirely dropped and a substantially similar graph is inserted in its place. The SP, OP and GS indices get to stay relatively unaffected. </p> <p> Still over time specially if there are frequent updates and values do not repeat between consecutive states the SP, OP and GS indices will get polluted which may affect performance. Dropping and recreating the index will then remedy the situation. </p> <p> For cases where this is not practical the index scheme should only have full indices, i.e. each key holds all columns of the primary key of the quad. This will be the case if the distinct no primary key ref options are not specified in the create index statement. In such cases all indices remain in strict sync across deletes. </p> <p> Many RDF workloads have bulk load and read intensive access patterns with few deletes. The default index scheme is optimized for these. With these situations this scheme offers significant space savings resulting in better working set. Typically this layout takes 60-70% of the space of a layout with 4 full indices. </p> <br /> <a name="rdfperfindexes" /> <h3>14.15.3. Index Scheme Selection</h3> <p>The indexes in place on the RDF_QUAD table can greatly affect the performance of SPARQL queries, as can be determined by running the STATISTICS command on the table as follows: </p> <div> <pre class="programlisting"> SQL> STATISTICS DB.DBA.RDF_QUAD; Showing SQLStatistics of table(s) 'DB.DBA.RDF_QUAD' TABLE_QUALIFIER TABLE_OWNER TABLE_NAME NON_UNIQUE INDEX_QUALIFIER INDEX_NAME TYPE SEQ_IN_INDEX COLUMN_NAME COLLATION CARDINALITY PAGES FILTER_CONDITION VARCHAR VARCHAR VARCHAR SMALLINT VARCHAR VARCHAR SMALLINT SMALLINT VARCHAR VARCHAR INTEGER INTEGER VARCHAR _______________________________________________________________________________ DB DBA RDF_QUAD NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL DB DBA RDF_QUAD 0 DB RDF_QUAD 3 1 P NULL NULL NULL NULL DB DBA RDF_QUAD 0 DB RDF_QUAD 3 2 S NULL NULL NULL NULL DB DBA RDF_QUAD 0 DB RDF_QUAD 3 3 O NULL NULL NULL NULL DB DBA RDF_QUAD 0 DB RDF_QUAD 3 4 G NULL NULL NULL NULL DB DBA RDF_QUAD 1 DB RDF_QUAD_GS 3 1 G NULL NULL NULL NULL DB DBA RDF_QUAD 1 DB RDF_QUAD_GS 3 2 S NULL NULL NULL NULL DB DBA RDF_QUAD 1 DB RDF_QUAD_OP 3 1 O NULL NULL NULL NULL DB DBA RDF_QUAD 1 DB RDF_QUAD_OP 3 2 P NULL NULL NULL NULL DB DBA RDF_QUAD 1 DB RDF_QUAD_POGS 3 1 P NULL NULL NULL NULL DB DBA RDF_QUAD 1 DB RDF_QUAD_POGS 3 2 O NULL NULL NULL NULL DB DBA RDF_QUAD 1 DB RDF_QUAD_POGS 3 3 G NULL NULL NULL NULL DB DBA RDF_QUAD 1 DB RDF_QUAD_POGS 3 4 S NULL NULL NULL NULL DB DBA RDF_QUAD 1 DB RDF_QUAD_SP 3 1 S NULL NULL NULL NULL DB DBA RDF_QUAD 1 DB RDF_QUAD_SP 3 2 P NULL NULL NULL NULL 15 Rows. -- 24 msec. SQL> </pre> </div> <p>With only one index (OGPS) created by default, if the graph is always given, as with one or more <strong>FROM</strong> or <strong>FROM NAMED</strong> clauses, and there are no patterns where only graph and predicate are given, then the default indices should be sufficient. If predicate and graph are given but subject is not, then it is sometimes useful to add:</p> <div> <pre class="programlisting"> CREATE BITMAP INDEX RDF_QUAD_PGOS ON DB.DBA.RDF_QUAD (G, P, O, S) PARTITION (O VARCHAR (-1, 0hexffff)); </pre> </div> <div class="note"> <div class="notetitle">Note:</div> <p>Note: If the server version is pre-5.0.7, leave out the partitioning clause.</p> </div> <p>Making the PGOS index can help in some cases even if it is not readily apparent from the queries that one is needed. This is so, for example, if the predicate by itself is selective, i.e. there is a predicate that occurs in only a few triples.</p> <p>If the graph itself is not given in the queries, then the default index scheme will be unworkable. For this, the appropriate scheme is:</p> <div> <pre class="programlisting"> CREATE TABLE RDF_QUAD ( G IRI_ID_8, S IRI_ID_8, P IRI_ID_8, O ANY, PRIMARY KEY (S, P, O, G) ) ALTER INDEX RDF_QUAD ON RDF_QUAD PARTITION (S INT (0hexffff00)); CREATE BITMAP INDEX RDF_QUAD_OPGS ON DB.DBA.RDF_QUAD (O, P, G, S) PARTITION (O VARCHAR (-1, 0hexffff)); CREATE BITMAP INDEX RDF_QUAD_POGS ON DB.DBA.RDF_QUAD (P, O, G, S) PARTITION (O VARCHAR (-1, 0hexffff)); CREATE BITMAP INDEX RDF_QUAD_GPOS ON DB.DBA.RDF_QUAD (G, P, O, S) PARTITION (O VARCHAR (-1, 0hexffff)); </pre> </div> <div class="note"> <div class="notetitle">Note:</div> <p>Note: If the server version is pre-5.0.7, leave out the <strong>PARTITION</strong> clauses and the <strong>ALTER INDEX</strong> statement.</p> </div> <p>Public web service endpoints are proven to be sources of especially bad queries. While local application develpers can obtain instructions from database administrator and use ISQL access to the database in order to tune execution plans, "external" clients do not know details of configuration and/or lacks appropriate skills. The most common problem is that public endpoints usually get requests that does not mention the required graph, because that queries were initially written for use with triple stores. If the web service provides access to a single graph (or to a short list of graphs) then it is strongly recommended to configure it by adding a row into <strong>DB.DBA.SYS_SPARQL_HOST</strong>:</p> <div> <pre class="programlisting"> CREATE TABLE "DB"."DBA"."SYS_SPARQL_HOST" ( "SH_HOST" VARCHAR, -- host pattern "SH_GRAPH_URI" VARCHAR, -- default graph "SH_USER_URI" VARCHAR, -- reserved for any use in applications "SH_DEFINES" LONG VARCHAR, -- list of 'define xx:yy "zz" ' defines as a string PRIMARY KEY ("SH_HOST") ); </pre> </div> <p>You can find detailed descriptions of the table columns <a href="rdfsparql.html#rdfdefaultgraph">here</a>.</p> <p>The idea is that if the client specifies default graph in the request or uses named graphs and group graph patterns then he is probably smarter than average and will provide meaningful queries. If no graph names are specified then the query will benefit from preset graph because this will give the compiler some more indexes to choose from -- indexes that begin with G.</p> <p>Sometimes web service endpoint is used to access data of only one application, not all data in the system. In that case one may wish to declare a separate storage that consists of only RDF Views made by that application and define <strong>input:storage</strong> in appropriate row of <strong>DB.DBA.SYS_SPARQL_HOST</strong>.</p> <br /> <a name="rdfperfcost" /> <h3>14.15.4. Erroneous Cost Estimates and Explicit Join Order</h3> <p>The selectivity of triple patterns is determined at query compile time from sampling the data. It is possible that misleading data is produced. To see if the cardinality guesses are generally valid, look at the query plan with <a href="fn_explain.html">explain ()</a>.</p> <p>Below is a sample from the LUBM qualification data set in the Virtuoso distribution. After running <strong>make test</strong> in <strong>binsrc/test/lubm</strong>, there is a loaded database with the data. Start a server in the same directory to see the data.</p> <div> <pre class="programlisting"> SQL> EXPLAIN ('SPARQL PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT * FROM <lubm> WHERE { ?x rdf:type ub:GraduateStudent } '); REPORT VARCHAR _______________________________________________________________________________ { Precode: 0: $25 "callret" := Call __BOX_FLAGS_TWEAK (<constant (lubm)>, <constant (1)>) 5: $26 "lubm" := Call DB.DBA.RDF_MAKE_IID_OF_QNAME_SAFE ($25 "callret") 12: $27 "callret" := Call __BOX_FLAGS_TWEAK (<constant (http://www.w3.org/1999/02/22-rdf-syntax-ns#type)>, <constant (1)>) 17: $28 "-ns#type" := Call DB.DBA.RDF_MAKE_IID_OF_QNAME_SAFE ($27 "callret") 24: $29 "callret" := Call __BOX_FLAGS_TWEAK (<constant (http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#GraduateStudent)>, <constant (1)>) 29: $30 "owl#GraduateStudent" := Call DB.DBA.RDF_MAKE_IID_OF_QNAME_SAFE ($29 "callret") 36: BReturn 0 from DB.DBA.RDF_QUAD by RDF_QUAD_OGPS 1.9e+03 rows Key RDF_QUAD_OGPS ASC ($32 "s-3-1-t0.S") <col=415 O = $30 "owl#GraduateStudent"> , <col=412 G = $26 "lubm"> , <col=414 P = $28 "-ns#type"> row specs: <col=415 O LIKE <constant (T)>> Current of: <$34 "<DB.DBA.RDF_QUAD s-3-1-t0>" spec 5> After code: 0: $35 "x" := Call ID_TO_IRI ($32 "s-3-1-t0.S") 5: BReturn 0 Select ($35 "x", <$34 "<DB.DBA.RDF_QUAD s-3-1-t0>" spec 5>) } 22 Rows. -- 1 msec. </pre> </div> <p>This finds the graduate student instances in the LUBM graph. First the query converts the IRI literals to IDs. Then, using a match of OG on OGPS, it finds the IRIs of the graduate students. Then, it converts the IRI ID to return to the string form.</p> <p>The cardinality estimate of 1.9e+03 rows is on the FROM line.</p> <p>Doing an EXPLAIN() on the queries will show the cardinality estimates. To drill down further, one can split the query into smaller chunks and see the estimates for these, up to doing it at the triple pattern level. To indicate a variable that is bound but whose value is not a literal known at compile time, one can use the parameter marker <strong>??</strong>.</p> <div> <pre class="programlisting"> SQL> EXPLAIN (' SPARQL DEFINE sql:table-option "order" PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT * FROM <lubm> WHERE { ?x rdf:type ?? } '); </pre> </div> <p>This will not know the type but will know that a type will be provided. So instead of guessing 1900 matches, this will guess a smaller number, which is obviously less precise. Thus literals are generally better.</p> <p>In some cases, generally to work around an optimization error, one can specify an explicit <strong>JOIN</strong> order. This is done with the <strong>sql:select-option "order"</strong> clause in the SPARQL query prefix.</p> <div> <pre class="programlisting"> SQL> SELECT SPARQL_to_sql_text (' DEFINE sql:select-option "order" PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT * FROM <lubm> WHERE { ?x rdf:type ub:GraduateStudent . ?x ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0> } '); </pre> </div> <p>shows the SQL text with the order option at the end.</p> <p>If an estimate is radically wrong then this should be reported as a bug.</p> <p>If there is a FROM with a KEY on the next line and no column specs then this is a full table scan. The more columns are specified the less rows will be passed to the next operation in the chain. In the example above, there are three columns whose values are known before reading the table and these columns are leading columns of the index in use so column specs are:</p> <div> <pre class="programlisting"> <col=415 O = $30 "owl#GraduateStudent"> , <col=412 G = $26 "lubm"> , <col=414 P = $28 "-ns#type"> </pre> </div> <div class="note"> <div class="notetitle">Note:</div> <p>Note: A KEY with only a row spec is a full table scan with the row spec applied as a filter. This is usually not good unless this is specifically intended.</p> </div> <p>If queries are compiled to make full table scans when this is not specifically intended, this should be reported as a bug. The explain () output and the query text should be included in the report.</p> <p>Consider:</p> <div> <pre class="programlisting"> SQL> EXPLAIN (' SPARQL DEFINE sql:select-option "order, loop" PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT * FROM <lubm> WHERE { ?x ub:takesCourse ?c . ?x rdf:type ub:GraduateStudent } '); </pre> </div> <p>One will see in the output that the first table access is to retrieve all in the LUBM graph which take some course and then later to check if this is a graduate student. This is obviously not the preferred order but the <strong>sql:select-option "order"</strong> forces the optimizer to join from left to right.</p> <p>It is very easy to end up with completely unworkable query plans in this manner but if the optimizer really is in error, then this is the only way of overriding its preferences. The effect of <strong>sql:select-option</strong> is pervasive, extending inside unions, optionals, subqueries etc within the statement.</p> <p>We note that if, in the above query, both the course taken by the student and the type of the student are given, the query compilation will be, at least for all non-cluster cases, an index intersection. This is not overridden by the sql:select-option clause since an index intersection is always a safe guess, regardless of the correctness of the cardinality guesses of the patterns involved.</p> <br /> <a name="rdfperfgetallgraphs" /> <h3>14.15.5. Get All Graphs</h3> <p>In order to get all graphs URIs, one might use the Virtuoso <a href="fn_sparql_select_known_graphs.html">DB.DBA.SPARQL_SELECT_KNOWN_GRAPHS()</a> built-in function.</p> <br /> <a name="rdfperfdumpandreloadgraphs" /> <h3>14.15.6. Dump and Reload Graphs</h3> In order to dump all graphs with extension exclude those graphs of a certain type by using a sparql ask query, you may use the following script sequence: <div> <pre class="programlisting"> CREATE PROCEDURE dump_graphs ( IN dir VARCHAR := 'dumps' , IN file_length_limit INTEGER := 1000000000 ) { DECLARE inx INT; inx := 1; SET ISOLATION = 'uncommitted'; FOR ( SELECT * FROM ( SPARQL DEFINE input:storage "" SELECT DISTINCT ?g { GRAPH ?g { ?s ?p ?o } . FILTER ( ?g != virtrdf: ) } ) AS sub OPTION ( LOOP )) DO { dump_one_graph ( "g", sprintf ('%s/graph%06d_', dir, inx), file_length_limit ); inx := inx + 1; } } ; CREATE PROCEDURE dump_one_graph ( IN srcgraph VARCHAR , IN out_file VARCHAR , IN file_length_limit INTEGER := 1000000000 ) { DECLARE file_name varchar; DECLARE env, ses any; DECLARE ses_len, max_ses_len, file_len, file_idx integer; SET ISOLATION = 'uncommitted'; max_ses_len := 10000000; file_len := 0; file_idx := 1; file_name := sprintf ('%s%06d.ttl', out_file, file_idx); string_to_file ( file_name || '.graph', srcgraph, -2 ); string_to_file ( file_name, sprintf ( '# Dump of graph <%s>, as of %s\n', srcgraph, CAST (NOW() AS VARCHAR) ), -2 ); env := vector (dict_new (16000), 0, '', '', '', 0, 0, 0, 0); ses := string_output (); FOR (SELECT * FROM ( SPARQL DEFINE input:storage "" SELECT ?s ?p ?o { GRAPH `iri(?:srcgraph)` { ?s ?p ?o } } ) AS sub OPTION (LOOP)) DO { http_ttl_triple (env, "s", "p", "o", ses); ses_len := length (ses); IF (ses_len > max_ses_len) { file_len := file_len + ses_len; IF (file_len > file_length_limit) { http (' .\n', ses); string_to_file (file_name, ses, -1); file_len := 0; file_idx := file_idx + 1; file_name := sprintf ('%s%06d.ttl', out_file, file_idx); string_to_file ( file_name, sprintf ( '# Dump of graph <%s>, as of %s (part %d)\n', srcgraph, CAST (NOW() AS VARCHAR), file_idx), -2 ); env := vector (dict_new (16000), 0, '', '', '', 0, 0, 0, 0); } ELSE string_to_file (file_name, ses, -1); ses := string_output (); } } IF (LENGTH (ses)) { http (' .\n', ses); string_to_file (file_name, ses, -1); } } ; create procedure load_graphs (in dir varchar := 'dumps/') { declare arr any; declare g varchar; arr := sys_dirlist (dir, 1); log_enable (2, 1); foreach (varchar f in arr) do { if (f like '*.ttl') { declare continue handler for sqlstate '*' { log_message (sprintf ('Error in %s', f)); }; g := file_to_string (dir || '/' || f || '.graph'); DB.DBA.TTLP_MT (file_open (dir || '/' || f), g, g, 255); } } exec ('checkpoint'); } ; </pre> </div> <a name="rdfperfdumpandreloadgraphsexamples" /> <h4>14.15.6.1. Examples for Dump and Reload Graphs</h4> SQL>dump_one_graph ('http://daas.openlinksw.com/data#', './bbc_data_', 1000000000); SQL>dump_one_graph ('http://daas.openlinksw.com/data#', './bbc_data_'); <br /> <br /> <a name="rdfperfdumpandreloadgraphsn3" /> <h3>14.15.7. Dump RDF View Graph to n3</h3> <p>The RDF_QM_TREE_DUMP procedure and its associated procedures below are used for dumping one or more RDFView Graphs in a Virtuoso server to a set of turtle ttl dataset files in the specified dump directory. The dump generation is made as fast as possible by grouping mappings by underlying tables so many properties from neighbor database columns can be extracted in one table scan. The size of the generated files is limited to 5MB. The dump process creates internal stored procedures; their texts are saved in file .dump_procedures.sql in the directory of dump files for debugging purposes. </p> <p>Note that the dump directory must be included in the DirsAllowed parameter of the Virtuoso configuration file (e.g., virtuoso.ini), or the server will not be allowed to create nor access the dataset file(s). </p> <p> The <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoader">Virtuoso RDF bulk loader</a> scripts can then be used to load the dumped datasets for the RDFView graphs directly into a Virtuoso RDF QUAD store. </p> <a name="rdfperfdumpandreloadgraphsn3params" /> <h4>14.15.7.1. Parameters</h4> <ul> <li>in <strong>dest_dir</strong> VARCHAR - dump directory </li> <li>in <strong>graph_iri</strong> VARCHAR - IRI of the graph to be dumped; triples from other graphs will be excluded. If NULL, then there's no restriction by graph.</li> <li>in <strong>storage</strong> VARCHAR - IRI of the quad map storage to use. NULL means use default storage.</li> <li>in <strong>root</strong> VARCHAR - IRI of the quad map to use, e.g., an IRI of an RDF View (or its part). NULL means use all RDF Views of the storage (and the default mapping as well).</li> </ul> <br /> <a name="rdfperfdumpandreloadgraphsn3code" /> <h4>14.15.7.2. Procedure Code</h4> <div> <pre class="programlisting"> CREATE PROCEDURE DB.DBA.RDF_QM_TREE_DUMP ( in dest_dir VARCHAR, in graph_iri VARCHAR := NULL, in storage VARCHAR := NULL, in root VARCHAR := NULL ) { DECLARE all_qms, grouped_qmvs, launcher_text ANY; DECLARE grp_ctr, qm_ctr, qm_count INTEGER; DECLARE sql_file, launcher_name VARCHAR; IF (NOT (dest_dir LIKE '%/')) dest_dir := dest_dir || '/'; sql_file := dest_dir || '.dump_procedures.sql'; IF (storage IS NULL) storage := 'http://www.openlinksw.com/schemas/virtrdf#DefaultQuadStorage'; string_to_file ( sql_file, '-- This file contains procedure created by DB.DBA.RDF_QM_TREE_DUMP() for storage ' || COALESCE (storage, 'NULL') || ' and root quad map ' || COALESCE (root, 'NULL') || '\n\n', -2); all_qms := dict_list_keys (DB.DBA.RDF_QM_CONTENT_OF_QM_TREE (graph_iri, storage, root), 2); grouped_qmvs := DB.DBA.RDF_QM_GROUP_BY_SOURCE_TABLES (all_qms); launcher_name := 'RDF_QM_TREE_DUMP_BATCH_' || md5 (serialize (graph_iri) || storage || serialize (root)); launcher_text := string_output (); http ('CREATE PROCEDURE DB.DBA."' || launcher_name || '" (in dest_dir VARCHAR)\n{\n', launcher_text); FOR (grp_ctr := length (grouped_qmvs); grp_ctr > 0; grp_ctr := grp_ctr-2) { DECLARE tables, qms, proc_text ANY; DECLARE group_key, proc_name, dump_prefix, cmt VARCHAR; tables := grouped_qmvs [grp_ctr-2]; qms := grouped_qmvs [grp_ctr-1]; qm_count := length (qms); group_key := md5 (serialize (graph_iri) || storage || serialize (root) || serialize (tables)); proc_name := 'RDF_QM_TREE_DUMP_GRP_' || group_key; proc_text := string_output (); cmt := sprintf ('%d quad maps on join of', qm_count); FOREACH (VARCHAR t IN tables) DO cmt := cmt || ' ' || t; http (' -- ' || cmt || '\n', launcher_text); http (' DB.DBA."' || proc_name || '" (dest_dir);\n', launcher_text); http ('CREATE PROCEDURE DB.DBA."' || proc_name || '" (in dest_dir VARCHAR)\n', proc_text); http ('{\n', proc_text); http (' -- ' || cmt || '\n', proc_text); http (' DECLARE ses, env ANY;\n', proc_text); http (' DECLARE file_ctr, cmt_len INTEGER;\n', proc_text); http (' file_ctr := 0;\n', proc_text); http (' dbg_obj_princ (' || WS.WS.STR_SQL_APOS (cmt) || ', '', file '', file_ctr);\n', proc_text); http (' ses := string_output ();\n', proc_text); http (' http (' || WS.WS.STR_SQL_APOS ('#' || cmt || '\n') || ', ses);\n', proc_text); http (' env := VECTOR (dict_new (16000), 0, '''', '''', '''', 0, 0, 0, 0);\n', proc_text); http (' cmt_len := LENGTH (ses);\n', proc_text); http (' FOR (SPARQL DEFINE input:storage <' || storage || '>\n', proc_text); http (' SELECT ?s1, ?p1, ?o1\n', proc_text); IF (graph_iri IS NOT NULL) { http (' WHERE { GRAPH <', proc_text); http_escape (graph_iri, 12, proc_text, 1, 1); http ('> {\n', proc_text); } ELSE http (' WHERE { GRAPH ?g1 {\n', proc_text); FOR (qm_ctr := 0; qm_ctr < qm_count; qm_ctr := qm_ctr + 1) { IF (qm_ctr > 0) http (' UNION\n', proc_text); http (' { quad map <' || qms[qm_ctr] || '> { ?s1 ?p1 ?o1 } }\n', proc_text); } http (' } } ) DO {\n', proc_text); http (' http_ttl_triple (env, "s1", "p1", "o1", ses);\n', proc_text); http (' IF (LENGTH (ses) > 5000000)\n', proc_text); http (' {\n', proc_text); http (' http ('' .\\n'', ses);\n', proc_text); http (' string_to_file (sprintf (''%s' || group_key || '_%05d.ttl'', dest_dir, file_ctr), ses, -2);\n', proc_text); http (' file_ctr := file_ctr + 1;\n', proc_text); http (' dbg_obj_princ (' || WS.WS.STR_SQL_APOS (cmt) || ', '', file '', file_ctr);\n', proc_text); http (' ses := string_output ();\n', proc_text); http (' http (' || WS.WS.STR_SQL_APOS ('#' || cmt || '\n') || ', ses);\n', proc_text); http (' env := VECTOR (dict_new (16000), 0, '''', '''', '''', 0, 0, 0, 0);\n', proc_text); http (' }\n', proc_text); http (' }\n', proc_text); http (' IF (LENGTH (ses) > cmt_len)\n', proc_text); http (' {\n', proc_text); http (' http ('' .\\n'', ses);\n', proc_text); http (' string_to_file (sprintf (''%s' || group_key || '_%05d.ttl'', dest_dir, file_ctr), ses, -2);\n', proc_text); http (' }\n', proc_text); http ('}\n', proc_text); proc_text := string_output_string (proc_text); string_to_file (sql_file, proc_text || ';\n\n' , -1); EXEC (proc_text); } http ('}\n', launcher_text); launcher_text := string_output_string (launcher_text); string_to_file (sql_file, launcher_text || ';\n\n' , -1); EXEC (launcher_text); CALL ('DB.DBA.' || launcher_name)(dest_dir); } ; CREATE FUNCTION DB.DBA.RDF_QM_CONTENT_OF_QM_TREE ( in graph_iri VARCHAR := NULL, in storage VARCHAR := NULL, in root VARCHAR := NULL, in dict ANY := NULL ) returns ANY { DECLARE res, subqms any; DECLARE graphiri varchar; graphiri := DB.DBA.JSO_SYS_GRAPH(); IF (storage IS NULL) storage := 'http://www.openlinksw.com/schemas/virtrdf#DefaultQuadStorage'; DB.DBA.RDF_QM_ASSERT_STORAGE_FLAG (storage, 0); IF (dict IS NULL) dict := dict_new (); IF (root IS NULL) { subqms := ((SELECT DB.DBA.VECTOR_AGG (sub."qmiri") FROM ( SPARQL DEFINE input:storage "" SELECT DISTINCT (str(?qm)) AS ?qmiri WHERE { GRAPH `iri(?:graphiri)` { { `iri(?:storage)` virtrdf:qsUserMaps ?lst . ?lst ?p ?qm . FILTER (0 = bif:strstr (str(?p), str(rdf:_))) } UNION { `iri(?:storage)` virtrdf:qsDefaultMap ?qm . } } } ) AS sub ) ); FOREACH (varchar qmid IN subqms) DO DB.DBA.RDF_QM_CONTENT_OF_QM_TREE (graph_iri, storage, qmid, dict); RETURN dict; } DB.DBA.RDF_QM_ASSERT_JSO_TYPE (root, 'http://www.openlinksw.com/schemas/virtrdf#QuadMap'); IF (graph_iri IS NOT NULL AND EXISTS ((SPARQL DEFINE input:storage "" SELECT (1) WHERE { GRAPH `iri(?:graphiri)` { `iri(?:root)` virtrdf:qmGraphRange-rvrFixedValue ?g . FILTER (str (?g) != str(?:graph_iri)) } } ) ) ) RETURN dict; IF (NOT EXISTS ((SPARQL DEFINE input:storage "" SELECT (1) WHERE { GRAPH `iri(?:graphiri)` { `iri(?:root)` virtrdf:qmMatchingFlags virtrdf:SPART_QM_EMPTY . } } ) ) ) dict_put (dict, root, 1); subqms := ((SELECT DB.DBA.VECTOR_AGG (sub."qmiri") FROM ( SPARQL DEFINE input:storage "" SELECT DISTINCT (str(?qm)) as ?qmiri WHERE { GRAPH `iri(?:graphiri)` { `iri(?:root)` virtrdf:qmUserSubMaps ?lst . ?lst ?p ?qm . FILTER (0 = bif:strstr (str(?p), str(rdf:_))) } } ) AS sub ) ); FOREACH (VARCHAR qmid IN subqms) DO DB.DBA.RDF_QM_CONTENT_OF_QM_TREE (graph_iri, storage, qmid, dict); RETURN dict; } ; CREATE FUNCTION DB.DBA.RDF_QM_GROUP_BY_SOURCE_TABLES (in qms ANY) returns ANY { DECLARE res ANY; DECLARE ctr INTEGER; DECLARE graphiri VARCHAR; graphiri := DB.DBA.JSO_SYS_GRAPH(); res := dict_new (LENGTH (qms) / 20); FOREACH (VARCHAR qmiri IN qms) DO { DECLARE tbls, acc ANY; tbls := ((SELECT DB.DBA.VECTOR_AGG (sub."tbl") FROM (SELECT subsub."tbl" FROM ( SPARQL DEFINE input:storage "" SELECT DISTINCT ?tbl WHERE { GRAPH `iri(?:graphiri)` { { `iri(?:qmiri)` virtrdf:qmTableName ?tbl . } UNION { `iri(?:qmiri)` virtrdf:qmATables ?atbls . ?atbls ?p ?atbl . ?atbl virtrdf:qmvaTableName ?tbl } UNION { `iri(?:qmiri)` ?fldmap ?qmv . ?qmv virtrdf:qmvATables ?atbls . ?atbls ?p ?atbl . ?atbl virtrdf:qmvaTableName ?tbl . } } } ) subsub ORDER BY 1 ) AS sub ) ); acc := dict_get (res, tbls); IF (acc IS NULL) vectorbld_init (acc); vectorbld_acc (acc, qmiri); dict_put (res, tbls, acc); } res := dict_to_vector (res, 2); FOR (ctr := LENGTH (res); ctr > 0; ctr := ctr-2) { DECLARE acc ANY; acc := aref_set_0 (res, ctr-1); vectorbld_final (acc); aset_zap_arg (res, ctr-1, acc); } RETURN res; } ; --test dbg_obj_princ (DB.DBA.RDF_QM_GROUP_BY_SOURCE_TABLES (dict_list_keys (DB.DBA.RDF_QM_CONTENT_OF_QM_TREE (null), 2))); --test dbg_obj_princ (dict_list_keys (DB.DBA.RDF_QM_CONTENT_OF_QM_TREE (null), 2)); --test DB.DBA.RDF_QM_TREE_DUMP ('dump/demo', null, null, null); --test DB.DBA.RDF_QM_TREE_DUMP ('dump/tpch', 'http://localhost:8600/tpch', null, null); </pre> </div> <br /> <br /> <a name="rdfperfloading" /> <h3>14.15.8. Loading RDF</h3> <p>There are many functions for loading RDF text, in RDF/XML and Turtle.</p> <p>For loading RDF/XML, the best way is to split the data to be loaded into multiple streams and load these in parallel using <a href="fn_rdf_load_rdfxml.html">RDF_LOAD_RDFXML ()</a>. To avoid running out of rollback space for large files and in order to have multiple concurrent loads not interfere with each other, the row autocommit mode should be enabled.</p> <p>For example, </p> <div> <pre class="programlisting"> log_enable (2); -- switch row-by-row autocommit on and logging off for this session DB.DBA.RDF_LOAD_RDFXML (file_to_string_output ('file.xml'), 'base_uri', 'target_graph'); -- more files here ... checkpoint; </pre> </div> <p>Loading a file with text like the above with isql will load the data. Since the transaction logging is off, make a manual checkpoint at the end to ensure that data is persisted upon server restart since there is no roll forward log.</p> <p>If large amounts of data are to be loaded, run multiple such streams in parallel. One may have for example 6 streams for 4 cores. This means that if up to two threads wait for disk, there is still work for all cores.</p> <p>Having substantially more threads than processors or disks is not particularly useful.</p> <p>There exist multithreaded load functions which will load one file on multiple threads: <a href="rdfsparql.html#rdfapidataimportttlpmt">the DB.DBA.TTLP_MT() function</a> and <a href="rdfsparql.html#rdfapidataimportxmlttlpmt">the DB.DBA.RDF_LOAD_RDFXML_MT() function</a>. Experience shows that loading multiple files on one thread per file is better.</p> <p>For loading Turtle, some platforms may have a non-reentrant Turtle parser. This means that only one load may run at a time. One can try this by calling <a href="rdfsparql.html#rdfapidataimport">ttlp ()</a> from two sessions at the same time. If these do not execute concurrently, then the best way may be to try <a href="rdfsparql.html#rdfapidataimport">ttlp_mt</a> and see if this runs faster than a single threaded ttlp call.</p> <a name="rdfperfloadingutility" /> <h4>14.15.8.1. RDF Bulk Load Utility</h4> <p>The RDF loader utility facilitates parallel bulk loading of multiple RDF files. The utility maintains a database table containing a list of files to load and the status of each file, whether not loaded, loaded or loaded with error. The table also records load start and end times.</p> <p>One must have a dba group login for using this and the virtuoso.ini file access control list must be set up so that the Virtuoso server can open the files to load.</p> <p>Files are added to the load list with the function <a href="fn_ld_dir.html">ld_dir</a>:</p> <div> <pre class="programlisting"> ld_dir (in dir_path varchar, in file_mask varchar, in target_graph varchar); </pre> </div> <p>The file mask is a SQL like pattern to match against the files in the directory. For example:</p> <div> <pre class="programlisting"> ld_dir ('/data8/2848260', '%.gz', 'http://bsbm.org'); </pre> </div> <p>would load the RDF in all files ending in .gz from the directory given as first parameter. The RDF would be loaded in the http://bsbm.org graph.</p> <p>If NULL is given for the graph, each file may go to a different graph specified in a separate file with the name of the RDF source file plus the extension .graph.</p> <p>A .graph file contains the target graph URI without any other content or whitespace.</p> <p>The layout of the load_list table is as follows:</p> <div> <pre class="programlisting"> create table DB.DBA.LOAD_LIST ( ll_file varchar, ll_graph varchar, ll_state int default 0, -- 0 not started, 1 going, 2 done ll_started datetime, ll_done datetime, ll_host int, ll_work_time integer, ll_error varchar, primary key (ll_file)) alter index LOAD_LIST on DB.DBA.LOAD_LIST partition (ll_file varchar) create index LL_STATE on DB.DBA.LOAD_LIST (ll_state, ll_file, ll_graph) partition (ll_state int) ; </pre> </div> <p>This table may be checked at any time during bulk load for the progress of the load. ll_state is 1 for files being loaded and 2 for files whose loading has finished. ll_error is NULL if the load finished without error, else it is the error message.</p> <p>In order to load data from the files in load_list, run as dba:</p> <div> <pre class="programlisting"> DB.DBA.rdf_loader_run (); </pre> </div> <p>One may run several of these commands on parallel sessions for better throughput.</p> <p>On a cluster one can do:</p> <div> <pre class="programlisting"> cl_exec ('rdf_ld_srv ()'); </pre> </div> <p>This will start one rdf_loader_run on each node of the cluster. Note that in such a setting all the server processes must see the same files at the same path.</p> <p>On an isql session one may execute rdf_loader_run () & several times, forking a new isql for each such command, similarly to what a Unix shell does.</p> <p>Because this load is non-transactional and non-logged, one must do an explicit checkpoint after the load to guarantee a persistent state.</p> <p>On a single server do:</p> <div> <pre class="programlisting"> checkpoint; </pre> </div> <p>On a cluster do:</p> <div> <pre class="programlisting"> cl_exec ('checkpoint'); </pre> </div> <p>The server(s) are online and can process queries and transactions while a bulk load is in progress. Periodic checkpoints may occur during the load but the state is guaranteed to be consistent only after running a checkpoint after all the bulk load threads have finished.</p> <p>A bulk load should not be forcibly stopped. To make a controlled stop, run:</p> <div> <pre class="programlisting"> rdf_load_stop (); </pre> </div> <p>This will cause the files being loaded at the time to finish load but no new loads will start until explicitly started with <span class="computeroutput">rdf_loader_run ()</span>.</p> <p>Specially note that on a cluster the database will be inconsistent if one server process does a checkpoint and another does not. Thus guaranteeing a checkpoint on all is necessary. This is easily done with an isql script with the following content:</p> <div> <pre class="programlisting"> ld_dir ('/data8/2848260', '%.gz', 'http://bsbm.org'); -- Record CPU time select getrusage ()[0] + getrusage ()[1]; rdf_loader_run () & rdf_loader_run () & rdf_loader_run () & rdf_loader_run () & rdf_loader_run () & rdf_loader_run () & rdf_loader_run () & rdf_loader_run () & wait_for_children; checkpoint; -- Record CPU time select getrusage ()[0] + getrusage ()[1]; </pre> </div> <p>For a cluster, the equivalent is:</p> <div> <pre class="programlisting"> ld_dir ('/data8/2848260', '%.gz', 'http://bsbm.org'); cl_exec ('DB.DBA.RDF_LD_SRV (2)'); cl_exec ('checkpoint'); </pre> </div> <p> <span class="computeroutput">rdf_loader_run ()</span> recognizes several file types, including .ttl, .nt, .xml, .rdf, .owl, .nq, .n4, and others. Internally the function uses <a href="fn_ttlp.html">DB.DBA.ttlp()</a> or <a href="fn_rdf_load_rdfxml.html">DB.DBA.rdf_load_rdfxml</a>, as appropriate.</p> <p>See <a href="rdfperformancetuning.html#rdfperfloadinglod">the next section</a> for detailed description of the <span class="computeroutput">rdf_loader_run ()</span> function.</p> <br /> <a name="rdfperfloadinglod" /> <h4>14.15.8.2. Loading LOD RDF data</h4> <p>To load the rdf data to LOD instance, perform the following steps: </p> <ul> <li>Configure & start cluster</li> <li>Execute the file: <div> <pre class="programlisting"> -- -- $Id: rdfandsparql.xml,v 1.95.2.47 2011/10/26 00:21:16 source Exp $ -- -- Alternate RDF index scheme for cases where G unspecified -- -- This file is part of the OpenLink Software Virtuoso Open-Source (VOS) -- project. -- -- Copyright (C) 1998-2009 OpenLink Software -- -- This project is free software; you can redistribute it and/or modify it -- under the terms of the GNU General Public License as published by the -- Free Software Foundation; only version 2 of the License, dated June 1991. -- -- This program is distributed in the hope that it will be useful, but -- WITHOUT ANY WARRANTY; without even the implied warranty of -- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -- General Public License for more details. -- -- You should have received a copy of the GNU General Public License along -- with this program; if not, write to the Free Software Foundation, Inc., -- 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA -- -- drop index RDF_QUAD_OGPS; checkpoint; create table R2 (G iri_id_8, S iri_id_8, P iri_id_8, O any, primary key (S, P, O, G)) alter index R2 on R2 partition (S int (0hexffff00)); log_enable (2); insert into R2 (G, S, P, O) SELECT G, S, P, O from rdf_quad; drop table RDF_QUAD; alter table r2 rename RDF_QUAD; checkpoint; create bitmap index RDF_QUAD_OPGS on RDF_QUAD (O, P, G, S) partition (O varchar (-1, 0hexffff)); create bitmap index RDF_QUAD_POGS on RDF_QUAD (P, O, G, S) partition (O varchar (-1, 0hexffff)); create bitmap index RDF_QUAD_GPOS on RDF_QUAD (G, P, O, S) partition (O varchar (-1, 0hexffff)); checkpoint; </pre> </div> </li> <li>Execute: <div> <pre class="programlisting"> SQL>cl_exec ('checkpoint); </pre> </div> </li> <li>Execute ld_dir ('directory' , 'mask' , 'graph'), for ex: <div> <pre class="programlisting"> SQL>ld_dir ('/dbs/data', '*.gz', 'http://dbpedia.org'); </pre> </div> </li> <li>Execute on every node with separate client: <div> <pre class="programlisting"> SQL>rdf_loader_run(); </pre> </div> </li> </ul> <br /> <a name="rdfperfloadingunitpro" /> <h4>14.15.8.3. Loading UniProt RDF data</h4> <p>To load the uniprot data, create a function for example such as:</p> <div> <pre class="programlisting"> create function DB.DBA.UNIPROT_LOAD (in log_mode integer := 1) { DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename1'),'http://base_uri_1', 'destination_graph_1', log_mode, 3); DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename2'),'http://base_uri_2', 'destination_graph_2', log_mode, 3); ... DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename9'),'http://base_uri_9', 'destination_graph_9', log_mode, 3); } </pre> </div> <p>If you are starting from blank database and you can drop it and re-create in case of error signaled, use it this way:</p> <div> <pre class="programlisting"> checkpoint; checkpoint_interval(6000); DB.DBA.UNIPROT_LOAD (0), checkpoint; checkpoint_interval(60); </pre> </div> <p>If the database contains important data already and there's no way to stop it and backup before the load then use:</p> <div> <pre class="programlisting"> checkpoint; checkpoint_interval(6000); DB.DBA.UNIPROT_LOAD (), checkpoint; checkpoint_interval(60); </pre> </div> <p>Note that the 'number of threads' parameter of DB.DBA.RDF_LOAD_RDFXML() mentions threads used to process data from file, an extra thread will read the text and parse it, so for 4 CPU cores there's no need in parameter value greater than 3. Three processing threads per one parsing tread is usually good ratio because parsing is usually three times faster than the rest of loading so CPU loading is well balanced. If for example you are using 2 x Quad Xeon, then you can choose between 8 single-threaded parsers or 2 parsers with 3 processing threads each. With 4 cores you may simply load file after file with 3 processing threads. The most important performance tuning is to set the [Parameters] section of virtuoso configuration file:</p> <div> <pre class="programlisting"> NumberOfBuffers = 1000000 MaxDirtyBuffers = 800000 MaxCheckpointRemap = 1000000 DefaultIsolation = 2 </pre> </div> <p>Note: these numbers are reasonable for 16 GB RAM Linux box. Usually when there are no such massive operations as loading huge database, you can set up the values as:</p> <div> <pre class="programlisting"> NumberOfBuffers = 1500000 MaxDirtyBuffers = 1200000 MaxCheckpointRemap = 1500000 DefaultIsolation = 2 </pre> </div> <div class="tip"> <div class="tiptitle">See Also:</div> <p> <a href="" /> </p> <p> <a href="http://www.openlinksw.com/dataspace/dav/wiki/Main/VirtConfigScale#Configuration%20Options">Virtuoso Configuration Options</a> </p> </div> <div class="tip"> <div class="tiptitle">Tip:</div> <p>Thus after loading all data you may wish to shutdown, tweak and start server again. If you have ext2fs or ext3fs filesystem, then it's better to have enough free space on disk not to make it more than 80% full. When it's almost full it may allocate database file badly, resulting in measurable loss of disk access speed. That is not Virtuoso-specific fact, but a common hint for all database-like applications with random access to big files.</p> </div> <p>Here is an example of using awk file for splitting big file smaller ones:</p> <div> <pre class="programlisting"> BEGIN { file_part=1000 e_line = "</rdf:RDF>" cur=0 cur_o=0 file=0 part=file_part } { res_file_i="res/"FILENAME line=$0 s=$1 res_file=res_file_i"_"file".rdf" if (index (s, "</rdf:Description>") == 1) { cur=cur+1 part=part-1 } if (part > 0) { print line >> res_file } if (part == 0) { # print "===================== " cur print line >> res_file print e_line >> res_file close (res_file) file=file+1 part=file_part res_file=res_file_i"_"file".rdf" system ("cp beg.txt " res_file) } } END { } </pre> </div> <br /> <a name="rdfperfloadingdbpedia" /> <h4>14.15.8.4. Loading DBPedia RDF data</h4> <p>You can use the following script as an example for loading DBPedia RDF data in Virtuoso:</p> <div> <pre class="programlisting"> #!/bin/sh PORT=$1 USER=$2 PASS=$3 file=$4 g=$5 LOGF=`basename $0`.log if [ -z "$PORT" -o -z "$USER" -o -z "$PASS" -o -z "$file" -o -z "$g" ] then echo "Usage: `basename $0` [DSN] [user] [password] [ttl-file] [graph-iri]" exit fi if [ ! -f "$file" -a ! -d "$file" ] then echo "$file does not exists" exit 1 fi mkdir READY 2>/dev/null rm -f $LOGF $LOGF.* echo "Starting..." echo "Logging into: $LOGF" DOSQL () { isql $PORT $USER $PASS verbose=on banner=off prompt=off echo=ON errors=stdout exec="$1" > $LOGF } LOAD_FILE () { f=$1 g=$2 echo "Loading $f (`cat $f | wc -l` lines) `date \"+%H:%M:%S\"`" | tee -a $LOG DOSQL "ttlp_mt (file_to_string_output ('$f'), '', '$g', 17); checkpoint;" > $LOGF if [ $? != 0 ] then echo "An error occurred, please check $LOGF" exit 1 fi line_no=`grep Error $LOGF | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'` newf=$f.part inx=1 while [ ! -z "$line_no" ] do cat $f | awk "BEGIN { i = 1 } { if (i==$line_no) { print \$0; exit; } i = i + 1 }" >> bad.nt line_no=`expr $line_no + 1` echo "Retrying from line $line_no" echo "@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> ." > tmp.nt cat $f | awk "BEGIN { i = 1 } { if (i>=$line_no) print \$0; i = i + 1 }" >> tmp.nt mv tmp.nt $newf f=$newf mv $LOGF $LOGF.$inx DOSQL "ttlp_mt (file_to_string_output ('$f'), '', '$g', 17); checkpoint;" > $LOGF if [ $? != 0 ] then echo "An error occurred, please check $LOGF" exit 1 fi line_no=`grep Error $LOGF | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'` inx=`expr $inx + 1` done rm -f $newf 2>/dev/null echo "Loaded. " } echo "=======================================" echo "Loading started." echo "=======================================" if [ -f "$file" ] then LOAD_FILE $file $g mv $file READY 2>> /dev/null elif [ -d "$file" ] then for ff in `find $file -name '*.nt'` do LOAD_FILE $ff $g mv $ff READY 2>> /dev/null done else echo "The input is not file or directory" fi echo "=======================================" echo "Final checkpoint." DOSQL "checkpoint;" > temp.res echo "=======================================" echo "Check bad.nt file for skipped triples." echo "=======================================" exit 0 </pre> </div> <br /> <a name="rdfperfloadingbio2rdf" /> <h4>14.15.8.5. Loading Bio2RDF data</h4> <p>The shell script below was used to import files in n3 notation into OpenLink Virtuoso RDF storage.</p> <p>When an syntax error it will cut content from next line and will retry. This was used on ubuntu linux to import bio2rdf and freebase dumps.</p> <p>Note it uses gawk, so it must be available on system where is tried. Also for recovery additional disk space is needed at max the size of original file.</p> <div> <pre class="programlisting"> #!/bin/bash PASS=$1 f=$2 g=$3 # Usage if [ -z "$PASS" -o -z "$f" -o -z "$g" ] then echo "Usage: $0 [password] [ttl-file] [graph-iri]" exit fi if [ ! -f "$f" ] then echo "$f does not exists" exit fi # Your port here PORT=1111 #`inifile -f dbpedia.ini -s Parameters -k ServerPort` if test -z "$PORT" then echo "Cannot find INI and inifile command" exit fi # Initial run isql $PORT dba $PASS verbose=on banner=off prompt=off echo=ON errors=stdout exec="ttlp_mt (file_to_string_output ('$f'), '', '$g'); checkpoint;" > $0.log # If disconnect etc. if [ $? != 0 ] then echo "An error occurred, please check $0.log" exit fi # Check for error line_no=`grep Error $0.log | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'` newf=$f.part inx=1 # Error recovery while [ ! -z "$line_no" ] do cat $f | awk "BEGIN { i = 0 } { if (i==$line_no) { print \$0; exit; } i = i + 1 }" >> bad.nt line_no=`expr $line_no + 1` echo "Retrying from line $line_no" cat $f | awk "BEGIN { i = 0 } { if (i>=$line_no) print \$0; i = i + 1 }" > tmp.nt mv tmp.nt $newf f=$newf mv $0.log $0.log.$inx # Run the recovered part isql $PORT dba $PASS verbose=on banner=off prompt=off echo=ON errors=stdout exec="ttlp_mt (file_to_string_output ('$f'), '', '$g'); checkpoint;" > $0.log if [ $? != 0 ] then echo "An error occurred, please check $0.log" exit fi line_no=`grep Error $0.log | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'` inx=`expr $inx + 1` done </pre> </div> <br /> <br /> <a name="rdfperfsparul" /> <h3>14.15.9. Using SPARUL</h3> <p>Since SPARUL updates are generally not meant to be transactional, it is best to run these in <a href="fn_log_enable.html">log_enable (2)</a> mode, which commits every operation as it is done. This prevents one from running out of rollback space. Also for bulk updates, transaction logging can be turned off. If so, one should do a manual checkpoint after the operation to ensure persistence across server restart since there is no roll forward log.</p> <p>To have a roll forward log and row by row autocommit, one may use <a href="fn_log_enable.html">log_enable (3)</a>. This will write constantly into the log which takes extra time. Having no logging and doing a checkpoint when the whole work is finished is faster.</p> <p>Many SPARUL operations can be run in parallel in this way. If they are independent with respect to their input and output, they can run in parallel and row by row autocommit will ensure they do not end up waiting for each others' locks.</p> <br /> <a name="rdfperfgeneraldbpedia" /> <h3>14.15.10. DBpedia Benchmark</h3> <p>We ran the DBpedia benchmark queries again with different configurations of Virtuoso. Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full DBpedia 3, all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The Cluster configuration was with 4 processes in the same box. We ran the queries in two variants: </p> <ul> <li>With graph specified in the SPARQL FROM clause, using the default indices.</li> <li>With no graph specified anywhere, using an alternate indexing scheme.</li> </ul> <p>The times below are for the sequence of 5 queries. As there is a query in the set that specifies no condition on S or O and only P, thus cannot be done with the default indices With Virtuoso v5. With Virtuoso Cluster v6 it can, because v6 is more space efficient. So we added the index:</p> <div> <pre class="programlisting"> create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s); </pre> </div> <table class="data"> <tr> <th class="data" /> <th class="data">Virtuoso v5 with gspo, ogps, pogs</th> <th class="data">Virtuoso Cluster v6 with gspo, ogps</th> <th class="data">Virtuoso Cluster v6 with gspo, ogps, pogs</th> </tr> <tr> <td class="data">cold</td> <td class="data">210 s</td> <td class="data">136 s</td> <td class="data">33.4 s</td> </tr> <tr> <td class="data">warm</td> <td class="data">0.600 s</td> <td class="data">4.01 s</td> <td class="data">0.628 s</td> </tr> </table> <br /> <p>Now let us do it without a graph being specified. Note that alter index is valid for v6 or higher. For all platforms, we drop any existing indices, and:</p> <div> <pre class="programlisting"> create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g)) alter index R2 on R2 partition (s int (0hexffff00)); log_enable (2); insert into r2 (g, s, p, o) SELECT g, s, p, o from rdf_quad; drop table rdf_quad; alter table r2 rename RDF_QUAD; create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff)); create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff)); create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff)); </pre> </div> <p>The code is identical for v5 and v6, except that with v5 we use iri_id (32 bit) for the type, not iri_id_8 (64 bit). We note that we run out of IDs with v5 around a few billion triples, so with v6 we have double the ID length and still manage to be vastly more space efficient.</p> <p>With the above 4 indices, we can query the data pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This takes about 60% of the space of a non-bitmap index for data such as DBpedia.</p> <p>If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme.</p> <table class="data"> <tr> <th class="data" /> <th class="data">Virtuoso v5 with gspo, ogps, pogs</th> <th class="data">Virtuoso Cluster v6 with gspo, ogps, pogs</th> </tr> <tr> <td class="data">warm</td> <td class="data">0.595 s</td> <td class="data">0.617 s</td> </tr> </table> <br /> <p>The cold times were about the same as above, so not reproduced.</p> <p>It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default.</p> <p>On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant different setups of the database, simple as that.</p> <p>The latter use case is not really within the SPARQL spec, so implementations may or may not support this.</p> <p>Once the indices are right, there is no difference between specifying a graph and not specifying a graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with no graph specified. For example, bitmap intersections are possible only when all leading key parts are given.</p> <p>The best warm cache time is with v5; the five queries run under 600 ms after the first go. This is noted to show that all-in-memory with a single thread of execution is hard to beat.</p> <p>Cluster v6 performs the same queries in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, Cluster v6 leaves v5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. Cluster v6 has substantially better data layout on disk, as well as fewer pages to load for the same content.</p> <p>This makes it possible to run the queries without the pogs index on Cluster v6 even when v5 takes prohibitively long.</p> <p>The purpose is to have a lot of RAM and space-efficient data representation.</p> <p>For reference, the query texts specifying the graph are below. To run without specifying the graph, just drop the FROM <http://dbpedia.org> from each query. The returned row counts are indicated below each query's text.</p> <div> <pre class="programlisting"> SQL>SPARQL SELECT ?p ?o FROM <http://dbpedia.org> WHERE { <http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o . }; p o VARCHAR VARCHAR _______________________________________________________________________________ http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://umbel.org/umbel/ac/Artifact http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/class/yago/MuseumsInNewYorkCity http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/class/yago/ArtMuseumsAndGalleriesInTheUnitedStates http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/class/yago/Museum103800563 .. -- 335 rows SQL>SPARQL PREFIX p: <http://dbpedia.org/property/> SELECT ?film1 ?actor1 ?film2 ?actor2 FROM <http://dbpedia.org> WHERE { ?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> . ?film1 p:starring ?actor1 . ?film2 p:starring ?actor1 . ?film2 p:starring ?actor2 . }; film1 actor1 film2 ctor2 VARCHAR VARCHAR VARCHAR ARCHAR http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Kevin_Bacon http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Kevin_Bacon http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Kevin_Bacon http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Meryl_Streep http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Kevin_Bacon http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Joseph_Mazzello http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Kevin_Bacon http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/David_Strathairn http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Kevin_Bacon http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/John_C._Reilly ... -- 23910 rows SQL>SPARQL PREFIX p: <http://dbpedia.org/property/> SELECT ?artist ?artwork ?museum ?director FROM <http://dbpedia.org> WHERE { ?artwork p:artist ?artist . ?artwork p:museum ?museum . ?museum p:director ?director }; artist artwork museum director VARCHAR VARCHAR VARCHAR VARCHAR _______________________________________________ http://dbpedia.org/resource/Paul_C%C3%A9zanne http://dbpedia.org/resource/The_Basket_of_Apples http://dbpedia.org/resource/Art_Institute_of_Chicago James Cuno http://dbpedia.org/resource/Paul_Signac http://dbpedia.org/resource/Neo-impressionism http://dbpedia.org/resource/Art_Institute_of_Chicago James Cuno http://dbpedia.org/resource/Georges_Seurat http://dbpedia.org/resource/Neo-impressionism http://dbpedia.org/resource/Art_Institute_of_Chicago James Cuno http://dbpedia.org/resource/Edward_Hopper http://dbpedia.org/resource/Nighthawks http://dbpedia.org/resource/Art_Institute_of_Chicago James Cuno http://dbpedia.org/resource/Mary_Cassatt http://dbpedia.org/resource/The_Child%27s_Bath http://dbpedia.org/resource/Art_Institute_of_Chicago James Cuno .. -- 303 rows SQL>SPARQL PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> SELECT ?s ?homepage FROM <http://dbpedia.org> WHERE { <http://dbpedia.org/resource/Berlin> geo:lat ?berlinLat . <http://dbpedia.org/resource/Berlin> geo:long ?berlinLong . ?s geo:lat ?lat . ?s geo:long ?long . ?s foaf:homepage ?homepage . FILTER ( ?lat <= ?berlinLat + 0.03190235436 && ?long >= ?berlinLong - 0.08679199218 && ?lat >= ?berlinLat - 0.03190235436 && ?long <= ?berlinLong + 0.08679199218) }; s homepage VARCHAR VARCHAR _______________________________________________________________________________ http://dbpedia.org/resource/Berlin_University_of_the_Arts http://www.udk-berlin.de/ http://dbpedia.org/resource/Berlin_University_of_the_Arts http://www.udk-berlin.de/ http://dbpedia.org/resource/Berlin_Zoological_Garden http://www.zoo-berlin.de/en.html http://dbpedia.org/resource/Federal_Ministry_of_the_Interior_%28Germany%29 http://www.bmi.bund.de http://dbpedia.org/resource/Neues_Schauspielhaus http://www.goya-berlin.com/ http://dbpedia.org/resource/Bauhaus_Archive http://www.bauhaus.de/english/index.htm http://dbpedia.org/resource/Canisius-Kolleg_Berlin http://www.canisius-kolleg.de http://dbpedia.org/resource/Franz%C3%B6sisches_Gymnasium_Berlin http://www.fg-berlin.cidsnet.de .. -- 48 rows SQL>SPARQL PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX p: <http://dbpedia.org/property/> SELECT ?s ?a ?homepage FROM <http://dbpedia.org> WHERE { <http://dbpedia.org/resource/New_York_City> geo:lat ?nyLat . <http://dbpedia.org/resource/New_York_City> geo:long ?nyLong . ?s geo:lat ?lat . ?s geo:long ?long . ?s p:architect ?a . ?a foaf:homepage ?homepage . FILTER ( ?lat <= ?nyLat + 0.3190235436 && ?long >= ?nyLong - 0.8679199218 && ?lat >= ?nyLat - 0.3190235436 && ?long <= ?nyLong + 0.8679199218) }; s a homepage VARCHAR VARCHAR VARCHAR _______________________________________________________________________________ http://dbpedia.org/resource/GE_Building http://dbpedia.org/resource/Associated_Architects http://www.associated-architects.co.uk http://dbpedia.org/resource/Giants_Stadium http://dbpedia.org/resource/HNTB http://www.hntb.com/ http://dbpedia.org/resource/Fort_Tryon_Park_and_the_Cloisters http://dbpedia.org/resource/Frederick_Law_Olmsted http://www.asla.org/land/061305/olmsted.html http://dbpedia.org/resource/Central_Park http://dbpedia.org/resource/Frederick_Law_Olmsted http://www.asla.org/land/061305/olmsted.html http://dbpedia.org/resource/Prospect_Park_%28Brooklyn%29 http://dbpedia.org/resource/Frederick_Law_Olmsted http://www.asla.org/land/061305/olmsted.html http://dbpedia.org/resource/Meadowlands_Stadium http://dbpedia.org/resource/360_Architecture http://oakland.athletics.mlb.com/oak/ballpark/new/faq.jsp http://dbpedia.org/resource/Citi_Field http://dbpedia.org/resource/HOK_Sport_Venue_Event http://www.hoksve.com/ http://dbpedia.org/resource/Citigroup_Center http://dbpedia.org/resource/Hugh_Stubbins_Jr. http://www.klingstubbins.com http://dbpedia.org/resource/150_Greenwich_Street http://dbpedia.org/resource/Fumihiko_Maki http://www.pritzkerprize.com/maki2.htm http://dbpedia.org/resource/Freedom_Tower http://dbpedia.org/resource/David_Childs http://www.som.com/content.cfm/www_david_m_childs http://dbpedia.org/resource/7_World_Trade_Center http://dbpedia.org/resource/David_Childs http://www.som.com/content.cfm/www_david_m_childs http://dbpedia.org/resource/The_New_York_Times_Building http://dbpedia.org/resource/Renzo_Piano http://www.rpbw.com/ http://dbpedia.org/resource/Trump_World_Tower http://dbpedia.org/resource/Costas_Kondylis http://www.kondylis.com 13 Rows. -- 2183 msec. </pre> </div> <br /> <a name="rdfstorebenchmarks" /> <h3>14.15.11. RDF Store Benchmarks</h3> <a name="rdfstorebenchmarksintroduction" /> <h4>14.15.11.1. Introduction</h4> <p>In a particular RDF Store Benchmarks there is difference if the queries are executed with specified graph or with specified multiple graphs. As Virtuoso is quad store, not triple store with many tables, it runs queries inefficiently if graphs are specified and there are no additional indexes except pre-set GSPO and OGPS. Proper use of the FROM clause or adding indexes with graph column will contribute for better results. </p> <br /> <a name="rdfstorebenchmarksindexusage" /> <h4>14.15.11.2. Using bitmap indexes</h4> <p>If is known in advance for the current RDF Store Benchmarks that some users will not indicate specific graphs then should be done: </p> <ul> <li>either create indexes with graph in last position</li> <li>or load everything into single graph and specify it somewhere in querying application.</li> </ul> <p>Both methods do not require any changes in query texts</p> <ul> <li>For users using Virtuoso 5 is strongly recommended is the usage of additional bitmap indexes: <div> <pre class="programlisting"> SQL> create bitmap index RDF_QUAD_POGS on DB.DBA.RDF_QUAD (P,O,G,S); SQL> create bitmap index RDF_QUAD_PSOG on DB.DBA.RDF_QUAD (P,S,O,G); </pre> </div> </li> <li>For users using Virtuoso 6 or higher, see the new layout <a href="rdfperformancetuning.html#rdfperfrdfscheme">here</a>.</li> </ul> <p>You can create other indexes as well. Bitmap indexes are preferable, but if O is the last column, then the index can not be bitmap, so it could be, for e.g.:</p> <div> <pre class="programlisting"> create index RDF_QUAD_PSGO on DB.DBA.RDF_QUAD (P, S, G, O); </pre> </div> <p>but cannot be:</p> <div> <pre class="programlisting"> create bitmap index RDF_QUAD_PSGO on DB.DBA.RDF_QUAD (P, S, G, O); </pre> </div> <br /> <br /> <a name="fastapproxdiffandpatch" /> <h3>14.15.12. Fast Approximate RDF Graph Diff and Patch</h3> <p>Two algorithms described below resemble "unified diff" and "patch by unified diff" but they work on RDF graphs, not on plain texts. </p> <p>They work reasonably for graphs composed from CBDs (concise bounded descriptions) of some subjects, if these subjects are either "named" IRIs or can be identified by values of their inverse functional properties. </p> <p>Many sorts of commonly used graphs match these restrictions, including all graphs without blank nodes, most of FOAF files, graphs that can be "pretty-printed" in JSON, most of dumps of relational databases etc. </p> <p>The basic idea is as simple as zipper: </p> <ul> <li>Place one graph at the left and one to the right,</li> <li>Find a retainer box at the right and a matching pin at the left,</li> <li>Join them</li> <li>Pull the slider as long as possible.</li> <li>Repeat this while there are pins and boxes that can be matched and sliders that can be moved.</li> </ul> <p>An IRI in left graph (say, G1) matches to same IRI in right graph (G2) as pin to box. The same is true for literals too. </p> <p>Functional and inverse functional properties are teeth that form chains, algorithm "moves sliders" along these chains, incrementally connecting more and more nodes. </p> <p>If there is a match of this sort (O1 in G1 matches O2 in G2) and the matched nodes are values of same inverse functional property P (there are { S1 P O1 } in G1 and { S2 P O2 } in G2) then we guess that S1 matches S2. </p> <p>If S1 in G1 matches S2 in G2 and the matched nodes are subjects of same functional property P ( there are { S1 P N1 } in G1 and { S2 P N2 } in G2 ) then we guess that N1 matches N2, now it's possible to try same interaction on triples where N1 and N2 are in subject position, that's how slides move. A typical example of a long zipper is closed list with matched heads. </p> <a name="fastapproxdiffandpatchhow" /> <h4>14.15.12.1. Make a Diff And Use It</h4> <ul> <li>Using <a href="fn_rdf_graph_diff.html">DB.DBA.RDF_GRAPH_DIFF</a> </li> <li>Using <a href="fn_rdf_suo_diff_ttl.html">DB.DBA.RDF_SUO_DIFF_TTL</a> </li> <li>Using <a href="fn_rdf_suo_apply_patch.html">DB.DBA.RDF_SUO_APPLY_PATCH</a> </li> </ul> <br /> <a name="fastapproxdiffandpatchclt" /> <h4>14.15.12.2. Collect Functional And Inverse Functional Properties</h4> <p>Lists of functional properties can be retrieved from an ontology graph by query like:</p> <div> <pre class="programlisting"> SPARQL define output:valmode "LONG" SELECT (<LONG::sql:VECTOR_AGG(?s)) FROM <my-ontology-graph> WHERE { ?s a owl:functionalProperty } </pre> </div> <p>Inverse functional properties could be retrieved by a similar query, but unfortunately the ontology may mention so called NULL values that can be property values for many subjects. Current implementation of diff and patch does not recognize NULL values so they can cause patch with "false alarm" errors. The workaround is to retrieve only properties that have no NULL values declared: </p> <div> <pre class="programlisting"> SPARQL define output:valmode "LONG" SELECT (<LONG::sql:VECTOR_AGG(?s)) FROM <my-ontology-graph> WHERE { ?s a owl:inverseFunctionalProperty . OPTIONAL { ?s owl:nullIFPValue ?v } FILTER (!Bound(?v)) } </pre> </div> <p>If no ontology is available then appropriate predicates can be obtained from sample graphs using <a href="fn_rdf_graph_collect_fp_list.html">DB.DBA.RDF_GRAPH_COLLECT_FP_LIST</a>. </p> <br /> <a name="fastapproxdiffandpatchimpl" /> <h4>14.15.12.3. Implementation-Specific Extensions of GUO Ontology</h4> <p> <strong>Note</strong>: This section contains implementation details that are needed only if you want to write your own patch or diff procedure, you don't have to worry about internals if you want to use existing procedures. </p> <p>Basic GUO ontology is not expressive enough to work with blank nodes, so some custom extensions $ are needed. </p> <p>In the rest of the description:</p> <div> <pre class="programlisting"> @prefix guo: <http://webr3.org/owl/guo#> </pre> </div> <p>is assumed.</p> <p>The diff contains one node of rdf:type guo:diff. </p> <p>For debugging purpose it has properties guo:graph1 and guo:graph2 that corespond to gfrom and gto arguments of <a href="fn_rdf_suo_diff_ttl.html">DB.DBA.RDF_SUO_DIFF_TTL</a>. </p> <p>The diff also contains zero or more nodes of rdf:type guo:UpdateInstruction. These nodes are as described in basic GUO ontology, but guo:target_graph is now optional, guo:target_subject can be a blank node and objects of predicates "inside" values of guo:insert and guo:delete can also be blank nodes. These blank nodes are "placeholders" for values, calculated according to the most important GUO extension - rule nodes. </p> <p>There are eight sorts of rule nodes, four for gfrom side of diff and four similar for gto side. Out of four sorts related to one side, two are for functional properties and two similar are for inverse functional properties. Thus rdf:type-s of these nodes are: </p> <div> <pre class="programlisting"> guo:from-rule-FP0, guo:from-rule-FP1, guo:from-rule-IFP0, guo:from-rule-IFP1 </pre> </div> <p>and </p> <div> <pre class="programlisting"> guo:to-rule-FP0, guo:to-rule-FP1, guo:to-rule-IFP , guo:to-rule-IFP1 . </pre> </div> <p>Each rule node has property guo:order that is an non-negative integer. </p> <p>These integers enumerate all guo:from-rule-... nodes, starting from zero. </p> <p>When patch procedure works, these rules are used in this order, the result of each rule is a blank node that either exists in the graph or just created. </p> <p>All results are remembered for use in the rest of the patch procedure. </p> <p>Similarly, other sequence of these integers enumerate all guo:to-rule-... nodes, also starting from zero. </p> <p>Consider a sequence of guo:from-rule-... nodes, because guo:to-rule- nodes have identical properties. </p> <p>A rule node can have zero or more values of guo:dep property, each value is a bnode that is rule node that should be calculated before the current one. </p> <p>Every rule has exactly one predicate guo:path that is a blank node. Each property of this blank node describes one possible "move of slider": predicate to follow is in predicate position and a node to start from is in object position. An IRI or a literal in object position is used as is, a blank node in object position should be of type guo:from-rule-... and have smaller guo:order so it refers to already calculated result bnode of some preceding rule. </p> <p>Rule of form: </p> <div> <pre class="programlisting"> R a guo:from-rule-IFP1 ; guo:path [ P1 O1 ; P2 O2 ; ... ; Pn On ] . </pre> </div> <p>searches for a unique blank node _:Rres that is a common subject of triples: </p> <div> <pre class="programlisting"> _:Rres P1 O1 _:Rres P2 O2 . . . _:Rres Pn On </pre> </div> <p>in the gfrom graph. </p> <p>If subjects differ in these triples or some triples are not found or the subject is not a blank node then an appropriate error is logged and rule fails, otherwise _:Rres is remembered as the result of the rule. </p> <p>Similarly, rule of form: </p> <div> <pre class="programlisting"> R a guo:from-rule-FP1 ; guo:path [ P1 O1 ; P2 O2 ; ... ; Pn On ] . </pre> </div> <p>searches for a unique blank node _:Rres that is a common object of triples: </p> <div> <pre class="programlisting"> O1 P1 _:Rres O2 P2 _:Rres . . . On Pn _:Rres </pre> </div> <p>in the gfrom graph. </p> <p>Rule of form: </p> <div> <pre class="programlisting"> R a guo:from-rule-IFP0 ; guo:path [ P1 O1 ; P2 O2 ; ... ; Pn On ] . </pre> </div> <p>ensures that the gfrom graph does not contain any triple like: </p> <div> <pre class="programlisting"> _:Rres P1 O1 _:Rres P2 O2 </pre> </div> <p>or </p> <div> <pre class="programlisting"> _:Rres Pn On </pre> </div> <p>It is an error if something exists. If nothing found then the result of the rule is newly created unique blank node. That's how patch procedure creates new blank nodes when it inserts "totally new" data. </p> <p>Similarly, rule of form: </p> <div> <pre class="programlisting"> R a guo:from-rule-IFP0 ; guo:path [ P1 O1 ; P2 O2 ; ... ; Pn On ] . </pre> </div> <p>ensures that the gfrom graph does not contain any triple like: </p> <div> <pre class="programlisting"> O1 P1 _:Rres O2 P2 _:Rres </pre> </div> <p>or </p> <div> <pre class="programlisting"> On Pn _:Rres </pre> </div> <p>Current version of patch procedure does not use rules guo:to-rule-... , however they can be used by custom procedure of few sorts. First, these rules can be used to produce a "reversed diff". Next, these rules can be used to validate the result of the patch - if the patch can not be reverted then the result is "suspicious". </p> <br /> <br /> <a name="rdb2rdftriggers" /> <h3>14.15.13. RDB2RDF Triggers</h3> <p>RDF Views have many advantages, if compared to static dumps of the database in RDF triples. However, they does not solve few problems. First, inference is supported only for physically stored triples, so one had to chose between convenience of inference and convenience of RDF Views. Next, algorithms that selects triples with non-constant graphs and predicates tend to produce enormous texts of SQL queries if RDF Views are complicated enough. Finally, there may be a need in export of big and fresh static RDF dump but preparing this dump would take too much time via both RDF Views and traditional methods. </p> <p>The solution is set of triggers on source tables of an RDF View that edit parts of physical dump on each change of source data. Unlike RDF Views that cost nothing while not queried, these triggers add a significant overhead on any data manipulation on sources, continuously. To compensate this, the dump should be in an intensive use and not replaceable by RDF Views. In other cases, do not add these triggers. </p> <p>It is next to impossible to write such triggers by hands so a small API is provided to generate SQL texts from metadata of RDF Views. </p> <p>First of all, views in an RDF storage does not work in full isolation from each other. Some of them may partially disable others due to OPTION(EXCLUSIVE) and some may produce one triple in different ways. As a result, triggers are not made on per-view basis. Instead, a special RDF storage is introduced, namely virtrdf:SyncToQuads , all required triples are added to it and triggers are created for the whole storage. Typically an RDF View is created in some other storage, e.g., virtrdf:DefaultQuadStorage and then added to virtrdf:SyncToQuads via: </p> <div> <pre class="programlisting"> sparql alter quad storage virtrdf:SyncToQuads { create <my_rdf_view> using storage virtrdf:DefaultQuadStorage }; </pre> </div> <p>The following example procedure copies all user-defined RDF Views from default quad storage to virtrdf:SyncToQuads: </p> <div> <pre class="programlisting"> create procedure DB.DBA.RDB2RDF_COPY_ALL_RDF_VIEWS_TO_SYNC () { for (sparql define input:storage "" select (bif:aref(bif:sprintf_inverse (str(?idx), bif:concat (str(rdf:_), "%d"), 0), 0)) ?qm from virtrdf: where { virtrdf:DefaultQuadStorage-UserMaps ?idx ?qm . ?qm a virtrdf:QuadMap } order by asc (bif:sprintf_inverse (bif:concat (str(rdf:_), "%d"), str (?idx), 1)) ) do exec (sprintf ('sparql alter quad storage virtrdf:SyncToQuads { create <%s> using storage virtrdf:DefaultQuadStorage }', "qm")); } ; </pre> </div> <p>When the virtrdf:SyncToQuads storage is fully prepared, two API functions can be used: </p> <ul> <li> <a href="fn_sparql_rdb2rdf_list_tables.html">DB.DBA.SPARQL_RDB2RDF_LIST_TABLES</a>: The function returns a vector of names of tables that are used as sources for RDF Views. Application developer should decide what to do with each of them - create triggers or do some application-specific workarounds. <p>Note that if some SQL views are used as sources for RDF Views and these views does not have INSTEAD triggers then workarounds become mandatory for them, not just a choice, because BEFORE or AFTER triggers on views are not allowed if there is no appropriate INSTEAD trigger. The mode argument should be zero in current version. </p> </li> <li> <a href="fn_sparql_rdb2rdf_codegen.html">DB.DBA.SPARQL_RDB2RDF_CODEGEN</a>: The function creates an SQL text for a given table and an operation specified by an opcode. </li> </ul> <p>In some cases, RDF Views are complicated enough so that BEFORE UPDATE and AFTER DELETE triggers are required in additional to the minimal set. In this case, sparql_rdb2rdf_codegen calls will return a vector of two string sessions, not single string session, and both sessions are sql texts to inspect or execute. In this case, the BEFORE trigger will not delete obsolete quads from RDF_QUAD table, instead it will create records in a special table RDF_QUAD_DELETE_QUEUE as guesses what can be deleted. The AFTER trigger will re-check these guesses, delete related quads if needed and shorten the RDF_QUAD_DELETE_QUEUE. </p> <p>The extra activity of triggers on RDF_QUAD, RDF_OBJ, RDF_QUAD_DELETE_QUEUE and other tables and indexes of the storage of "physical" triples may cause deadlocks so the application should be carefully checked for proper support of deadlocks if they were very seldom before turning RDB2RDF triggers on. In some cases, the whole processing of RDB2RDF can be moved to a separate server and connected to the main workhorse server via replication. </p> <p>The following example functions create texts of all triggers, save them to files in for further studying and try to load them. That's probably quite bad scenario for a production database, because it's better to read procedures before loading them, especially if they're triggers, especially if some of them may contain errors. </p> <div> <pre class="programlisting"> -- This creates one or two files with one or two triggers or other texts and try to load the generated sql texts. create procedure DB.DBA.RDB2RDF_EXEC_CODEGEN1_FOR_TABLE ( in dump_prefix varchar, in tbl varchar, in dump_id any, in txt any ) { declare fname varchar; declare stat, msg varchar; if (isinteger (dump_id)) dump_id := cast (dump_id as varchar); if (__tag of vector = __tag (txt)) { DB.DBA.RDB2RDF_EXEC_CODEGEN1_FOR_TABLE (dump_prefix, tbl, dump_id, txt[0]); DB.DBA.RDB2RDF_EXEC_CODEGEN1_FOR_TABLE (dump_prefix, tbl, dump_id || 'p' , txt[1]); return; } if (__tag of varchar <> __tag (txt)) txt := string_output_string (txt); fname := sprintf ('%s_Rdb2Rdf.%s.%s.sql', dump_prefix, tbl, dump_id); string_to_file (fname, txt || '\n;\n', -2); if ('0' = dump_id) return; stat := '00000'; msg := ''; exec (txt, stat, msg); if ('00000' <> stat) { string_to_file (fname, '\n\n- - - - - 8< - - - - -\n\nError ' || stat || ' ' || msg, -1); if (not (subseq (msg, 0, 5) in ('SQ091'))) signal (stat, msg); } } ; -- This creates and loads all triggers, init procedure and debug dump related to one table. create procedure DB.DBA.RDB2RDF_PREPARE_TABLE (in dump_prefix varchar, in tbl varchar) { declare ctr integer; for (ctr := 0; ctr <= 4; ctr := ctr+1 ) DB.DBA.RDB2RDF_EXEC_CODEGEN1_FOR_TABLE (dump_prefix, tbl, ctr, sparql_rdb2rdf_codegen (tbl, ctr)); } ; -- This creates and loads all triggers, init procedure and debug dump related to all tables used by and RDF View. create procedure DB.DBA.RDB2RDF_PREPARE_ALL_TABLES (in dump_prefix varchar) { declare tbl_list any; tbl_list := sparql_rdb2rdf_list_tables (0); foreach (varchar tbl in tbl_list) do { DB.DBA.RDB2RDF_PREPARE_TABLE (dump_prefix, tbl); } } ; </pre> </div> <p>The following combination of calls prepares all triggers for all RDF Views of the default storage: </p> <div> <pre class="programlisting"> DB.DBA.RDB2RDF_COPY_ALL_RDF_VIEWS_TO_SYNC (); DB.DBA.RDB2RDF_PREPARE_ALL_TABLES (cast (now() as varchar)); </pre> </div> <p>This does not copy the initial state of RDB2RDF graphs to the physical storage, because this can be dangerous for existing RDF data and even if all procedures will work as expected then they may produce huge amounts of RDF data, run out of transaction log limits and thus require application-specific precautions. It is also possible to make initial loading by a SPARUL statements like: </p> <div> <pre class="programlisting"> SPARQL INSERT IN <snapshot-graph> { ?s ?p ?o } FROM <snapshot-htaph> WHERE { quad map <id-of-rdf-view> { ?s ?p ?o } }; </pre> </div> <br /> <table border="0" width="90%" id="navbarbottom"> <tr> <td align="left" width="33%"> <a href="rdfsparqlgeospat.html" title="RDF and Geometry">Previous</a> <br />RDF and Geometry</td> <td align="center" width="34%"> <a href="rdfandsparql.html">Chapter Contents</a> </td> <td align="right" width="33%"> <a href="rdfnativestorageproviders.html" title="RDF Data Access Providers (Drivers)">Next</a> <br />RDF Data Access Providers (Drivers)</td> </tr> </table> </div> <div id="footer"> <div>Copyright© 1999 - 2009 OpenLink Software All rights reserved.</div> <div id="validation"> <a href="http://validator.w3.org/check/referer"> <img src="http://www.w3.org/Icons/valid-xhtml10" alt="Valid XHTML 1.0!" height="31" width="88" /> </a> <a href="http://jigsaw.w3.org/css-validator/"> <img src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!" height="31" width="88" /> </a> </div> </div> </body> </html>