Sophie

Sophie

distrib > Fedora > 14 > x86_64 > media > updates > by-pkgid > 71d40963b505df4524269198e237b3e3 > files > 77

virtuoso-opensource-doc-6.1.4-2.fc14.noarch.rpm

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
 <head profile="http://internetalchemy.org/2003/02/profile">
  <link rel="foaf" type="application/rdf+xml" title="FOAF" href="http://www.openlinksw.com/dataspace/uda/about.rdf" />
  <link rel="schema.dc" href="http://purl.org/dc/elements/1.1/" />
  <meta name="dc.title" content="18. Free Text Search" />
  <meta name="dc.subject" content="18. Free Text Search" />
  <meta name="dc.creator" content="OpenLink Software Documentation Team ;&#10;" />
  <meta name="dc.copyright" content="OpenLink Software, 1999 - 2009" />
  <link rel="top" href="index.html" title="OpenLink Virtuoso Universal Server: Documentation" />
  <link rel="search" href="/doc/adv_search.vspx" title="Search OpenLink Virtuoso Universal Server: Documentation" />
  <link rel="parent" href="freetext.html" title="Chapter Contents" />
  <link rel="prev" href="txtidxquickstart.html" title="Basic Concepts" />
  <link rel="next" href="queryingftcols.html" title="Querying Free Text Indexes" />
  <link rel="shortcut icon" href="../images/misc/favicon.ico" type="image/x-icon" />
  <link rel="stylesheet" type="text/css" href="doc.css" />
  <link rel="stylesheet" type="text/css" href="/doc/translation.css" />
  <title>18. Free Text Search</title>
  <meta http-equiv="Content-Type" content="text/xhtml; charset=UTF-8" />
  <meta name="author" content="OpenLink Software Documentation Team ;&#10;" />
  <meta name="copyright" content="OpenLink Software, 1999 - 2009" />
  <meta name="keywords" content="" />
  <meta name="GENERATOR" content="OpenLink XSLT Team" />
 </head>
 <body>
  <div id="header">
    <a name="creatingtxtidxs" />
    <img src="../images/misc/logo.jpg" alt="" />
    <h1>18. Free Text Search</h1>
  </div>
  <div id="navbartop">
   <div>
      <a class="link" href="freetext.html">Chapter Contents</a> | <a class="link" href="txtidxquickstart.html" title="Basic Concepts">Prev</a> | <a class="link" href="queryingftcols.html" title="Querying Free Text Indexes">Next</a>
   </div>
  </div>
  <div id="currenttoc">
   <form method="post" action="/doc/adv_search.vspx">
    <div class="search">Keyword Search: <br />
        <input type="text" name="q" /> <input type="submit" name="go" value="Go" />
    </div>
   </form>
   <div>
      <a href="http://www.openlinksw.com/">www.openlinksw.com</a>
   </div>
   <div>
      <a href="http://docs.openlinksw.com/">docs.openlinksw.com</a>
   </div>
    <br />
   <div>
      <a href="index.html">Book Home</a>
   </div>
    <br />
   <div>
      <a href="contents.html">Contents</a>
   </div>
   <div>
      <a href="preface.html">Preface</a>
   </div>
    <br />
   <div class="selected">
      <a href="freetext.html">Free Text Search</a>
   </div>
    <br />
   <div>
      <a href="txtidxquickstart.html">Basic Concepts</a>
   </div>
   <div class="selected">
      <a href="creatingtxtidxs.html">Creating Free Text Indexes</a>
    <div>
        <a href="#createtxtidxstmt" title="The CREATE TEXT INDEX statement">The CREATE TEXT INDEX statement</a>
        <a href="#appspecificdocid" title="Choosing An Application Specific Document ID">Choosing An Application Specific Document ID</a>
        <a href="#compositedatatype" title="The composite Data Type">The composite Data Type</a>
        <a href="#fttexamples" title="Free Text Index Examples">Free Text Index Examples</a>
        <a href="#preprocessingandext" title="Pre-processing and Extending the Content Being Indexed">Pre-processing and Extending the Content Being Indexed</a>
        <a href="#hitscores" title="Hit Scores">Hit Scores</a>
        <a href="#wordranges" title="Word Ranges">Word Ranges</a>
        <a href="#offbanddata" title="Using Offband Data for Faster Filtering">Using Offband Data for Faster Filtering</a>
        <a href="#orderofhits" title="Order of Hits">Order of Hits</a>
        <a href="#noisewords" title="Noise Words">Noise Words</a>
    </div>
   </div>
   <div>
      <a href="queryingftcols.html">Querying Free Text Indexes</a>
   </div>
   <div>
      <a href="txttrig.html">Text Triggers</a>
   </div>
   <div>
      <a href="tablesandinternals.html">Generated Tables and Internals</a>
   </div>
   <div>
      <a href="droptxtindex.html">Removing A Text Index</a>
   </div>
   <div>
      <a href="droptxttrig.html">Removing A Text Trigger</a>
   </div>
   <div>
      <a href="ftinternationalization.html">Internationalization &amp; Unicode</a>
   </div>
   <div>
      <a href="ftperformance.html">Performance</a>
   </div>
   <div>
      <a href="fttfuncs.html">Free Text Functions</a>
   </div>
    <br />
  </div>
  <div id="text">
    <a name="creatingtxtidxs" />
    <h2>18.2. Creating Free Text Indexes</h2>
  <a name="createtxtidxstmt" />
    <h3>18.2.1. The CREATE TEXT INDEX statement</h3>

	<p>
Define and optionally initialize a text index on a column.
</p>

	<div>
      <pre class="programlisting">
create_freetext_index
	: CREATE TEXT [XML] INDEX ON q_table_name &#39;(&#39; column &#39;)&#39;
	[WITH KEY column]
	[NOT INSERT]
	[CLUSTERED WITH &#39;(&#39; column_commalist &#39;)&#39; ]
	[USING FUNCTION]
	[LANGUAGE STRING]
  [ENCODING STRING]
	;
</pre>
    </div>
<p>
<strong>XML</strong> - The XML keyword specifies that the data is to be indexed as XML, hence
element names and attributes will be processed separately for use with the
XCONTAINS predicate.
</p>
	<p>
The <strong>q_table_name</strong> is a qualified table name on which the index is created.
The generated procedures and auxiliary tables will be with the owner and
qualifier of this table.
</p>
	<p>
The <strong>column</strong> must be a column of the above table, of the VARCHAR, NVARCHAR,
LONG VARCHAR, LONG NVARCHAR, LONG XML or XMLTYPE  data type.  The column may additionally have
the IDENTIFIED BY option if the content is XML.  This will be used to
provide a base URI for traversing relative references.  The XML option for the index has to be specified if the content is LONG XML or XMLTYPE.
</p>
	<p>
<strong>WITH KEY column</strong> - This allows optionally specifying a uniquely identifying
column which will be used as a foreign key for referencing the table from
the text index.  If this is not specified and there is a single part integer
primary key, this primary key is used as the key.
If there is no suitable primary key and the option is not present, an
integer  column is added and a sequence object is used to supply distinct
values.
</p>
	<p>
When specified, the column must be non-NULL and its run time value must
either be an integer or a composite (See composite data type).
The length of the values of this column is crucially important since it is
repeated  for each distinct word of each  row.
This column is called the free text document id column in the rest of this
documentation.
</p>
	<p>
<strong>NOT INSERT</strong> - If present, specifies that the index is not filled when
created.  The VT_INDEX_&lt;table&gt; function is still created but not run.
You can run this manually or scheduled at an appropriate time.  If batching is enabled
then the index will be filled up at that time.
</p>
	<p>
<strong>CLUSTERED WITH</strong> - The column list must consist of columns of the table.
Their values are stored in the text index so that the values can be more
efficiently located for filtering than if they had to be retrieved from the
table itself.  The combined length of the columns should be relatively
small, not much over 200 bytes for this to be effective.
There is no hard upper limit but long blobs are
not advisable.
</p>
	<p>
<strong>USING FUNCTION</strong> - This allows specifying a hook function for indexing an
unindexing a document.  The index hook is called before processing the words
of the column to be indexed.  This can be used to index extra text in
addition to the column value or to modify the text before indexing. If the
hook function returns 1 the column is not additionally processed. If hook
returns 0 the column is processed normally, in addition to the words the
hook may have inserted. See the section on hook functions. The hook
functions are always named &lt;table&gt;_&lt;column&gt;_INDEX_HOOK and
&lt;table&gt;_&lt;column&gt;_UNINDEX_HOOK, in the owner and qualifier of the table,
where &lt;table&gt; is the unqualified name of the table.</p>
	<p>
<strong>LANGUAGE</strong> - The string literal following this keyword may be a language name.
See <a href="ftinternationalization.html">Internationalization</a> for details.
</p>
	<p>
<strong>ENCODING</strong> - The string literal following this keyword may be an encoding name.
</p>
<p>
Creating the text index may take a long time.  All this time is in &#39;atomic&#39;
mode, so no other database activity is allowed during it and no log is
generated.  If there is an error, e.g. out of disk the created index is
dropped and the error is reported.  A checkpoint should be made after the
index is complete.  If the index should be created under a different
qualifier or the generated tables altered after the index is created the NOT
INSERT option should be used to delay filling the index.
</p>
	<p>
The <a href="creatingtxtidxs.html#createtxtidxstmt">CREATE TEXT INDEX</a> statement
will automatically make a unique index based
on the free text document id, if this is not the PK of the table.  If one
wishes to modify this index, it can be found and dropped with DROP INDEX and
reconstructed, but make sure that the reference in SYS_VT_INDEX matches and
that the new index has the same name as the previous version.
</p>
	<p>
A freshly created text index is in synchronous mode. This means that that
changes to the table are immediately reflected on the index as they occur.
This is done through a set of automatically generated triggers.
If large changes to data will be performed, the <a href="ftperformance.html">batch mode</a>
is far more efficient.
</p>
<div class="tip">
      <div class="tiptitle">See Also:</div>
<p>Although it is recommended to use the methods described above, the
<a href="fn_vt_create_text_index.html">vt_create_text_index</a> function
can also be used</p>
</div>

<br />

	
		<a name="appspecificdocid" />
    <h3>18.2.2. Choosing An Application Specific Document ID</h3>
		<p>
The free text index conceptually works by making an index entry for each
distinct word of each indexed column value which references back to the row
containing the data being indexed.
Therefore the table must have a unique ID that will be stored in conjunction
with each distinct word in the indexed column in the text index.  For space
efficiency this should be as short as possible.  If nothing else is
specified the <a href="creatingtxtidxs.html#createtxtidxstmt">CREATE TEXT INDEX</a> statement makes such a unique column and
fills it automatically from a sequence producing unique numbers.  If a single
part integer primary key exists then this key is used as the free text
index document ID.  Note However that the values &lt;= 0 are prohibited.
This is however not always optimal, hence the application may specify what
column is used to identify the row for text indexing.  Such a unique column
is referred to as the <strong>Free Text Document ID</strong>.
	</p>
		<p>
Suppose that a table contains news articles that should most frequently be
retrieved latest first, in descending order of a datetime field.  This can
be achieved by just selecting the
matching articles and sorting them with a SQL ORDER BY clause but this can
be very inefficient.  The reason for this is that all hits will first have to be found, then
sorted and only then can the first hit be returned to the user.  Further,
the sort key will have to be retrieved from the table, causing a random
access for each text hit.  The sorting can be totally avoided if the document ID that is used
to refer to the table from the index is itself ordered by date.  This has
several advantages:
	</p>
		<ul>
		<li>
To retrieve the n latest, one just takes the n first hits produced by the
contains search, no sorting required.
	</li>
		<li>
To get the next n hits, one repeats the search but now specifying that the
start ID is the ID of the last row of the previous set.  No sorting and no
scrollable cursors are required and the first hits can be returned before
generating all hits.  This is specially useful if the search criteria match
many articles.
	</li>
	</ul>
		<p>
This has a disadvantage in that a longer document ID will have to be
stored for each distinct word of each distinct article.  This may result in
a 60% increase in the index size but largely offsets the penalties of
sorting.  One should however exercise the utmost care in making this ID as
short as possible.  The maximum length of the ID is 30 bytes, but with this
length the storage is extremely wasteful, so an ID with fields adding up to
some 10 bytes is much better.
	</p>
		<p>
We will note that the document ID can be an aggregate of several scalars.
In the news article example, it could be a datetime, ID number pair.  This
is so because the datetime typically would not be unique and the ID is
required itself to be unique.
	</p>
		<p>
However, rather than storing the datetime and an integer article number, it
is advisable to compress the datetime into a number, e.g. a count of minutes
after a given date.  This maintains the temporal order to within a minute
and takes less than half the space taken by the datetime with all its
fractions, time zones etc.
	</p>
		<p>
For handling multi-part ID&#39;s like scalars there is a special data type,
composite.  Thus, if an application specific document ID is not an integer, it must be a
composite totaling less than 30 bytes of content divided among its members.
	</p>

<br />

	
		<a name="compositedatatype" />
    <h3>18.2.3. The composite Data Type</h3>
		<p>
A composite is like a heterogeneous array, except that it is limited in
length, may be stored as a column value for a column declared as ANY, and
may be a key part in a SQL index.  Thus, comparison is defined for
composites as follows:
	</p>
	<ul>
    <li>Composites are equal if all parts are equal.
      </li>
    <li>A composite is less than another if the first part of it which
is not equal to the corresponding part of the other composite is less than that part.
      </li>
    <li>If a composite has less parts than another and all of its parts
are equal to the corresponding parts of the longer composite, then the shorter
is considered less.
      </li>
      </ul>
  <p>
The collation of composites is just like that of strings, except that in the place of
characters, arbitrary run time typed scalars are compared.
	</p>
		<p>
When strings are compared inside composites, they are compared as binary,
without any specific collation.
Normal numeric coercion applies to comparison of composites.
Narrow and wide strings are compared with binary collation.
If two elements are of different types, e.g. a number and a string and are
compared, the data type will decide the outcome.  For example any integer is
always less than any string.
	</p>
		<p>
The composite SQL function makes a composite.  It takes a variable number of
arguments and returns a composite.  The composite_ref function takes a
composite and a zero based index and returns the value.  The serialized
length of a composite is limited to 255 characters.  If a composite is used
in a free text index it is limited to 30 characters.
		</p>
<ul>
  <li>
        <a href="fn_composite.html">composite()</a>
      </li>
  <li>
        <a href="fn_composite_ref.html">composite_ref()</a>
      </li>
</ul>

<div class="tip">
      <div class="tiptitle">See Also:</div>
<p>See the <a href="coredbengine.html#DataTypes">Data Types</a>
section for the storage requirement of each data type.
	</p>
    </div>

<br />


<a name="fttexamples" />
    <h3>18.2.4. Free Text Index Examples</h3>
<div>
      <pre class="programlisting">
composite (1, 2) = composite (1, 2)  is true
composite (1, 2) &lt; composite (1, 3) is true.
composite (1, 0) &gt; composite (1) is true.
</pre>
    </div>
<div>
      <pre class="programlisting">
composite_ref (composite (1, 2), 1) = 2
</pre>
    </div>
<p>
Example of a Composite Application Specific Document ID
</p>
<p>
The below code creates a table for news articles and defines a text index
with a composite document id.
</p>

<div>
      <pre class="programlisting">
create table article(author_name varchar(255),
		     description long varchar,
		     pub_date datetime,
		     id integer,
		     title varchar(255),
		     dtid any not null,
		     primary key(id));
</pre>
    </div>
<div>
      <pre class="programlisting">
create text index on article (description)
	with key dtid clustered with (pub_date, author_name);
</pre>
    </div>

<div class="note">
      <div class="notetitle">Note:</div>
<p>The dtid must be unique, non null and must be a composite  since
it is used as a custom text index id.
</p>
</div>
		<p>
Next we define a mapping between dates and integers.  These will be minutes
consecutive of 1990-1-1 0:00.  With 525600 minutes per 365 day year we will
not run out of values for a long time, the positive integer limit being
2**31.
	</p>
		<div>
      <pre class="programlisting">
create procedure date2short (in dt datetime)
{
  return (1440 * datediff (&#39;day&#39;, {d&#39;1990-1-1&#39;}, dt)
	+ hour (dt) * 60 + minute (dt));
}
	</pre>
    </div>
		<div>
      <pre class="programlisting">
create procedure short2date (in n integer)
{
  return (dateadd (&#39;minute&#39;, mod (n, 1440),
	dateadd (&#39;day&#39;, n / 1440, {d&#39;1990-1-1&#39;})));
}
	</pre>
    </div>
		<p>
Now we can insert an article:
	</p>
		<div>
      <pre class="programlisting">
insert into article (id, drtid, description, pub_date)
	values (1, composite (date2short ({dt &#39;2001-1-15 12:44&#39;}), 1),
		&#39;sample news article&#39;, {dt &#39;2001-1-15 12:44&#39;});
</pre>
    </div>
<div class="note">
      <div class="notetitle">Note:</div>
<p>The composite is the date2short of the datetime and the id.  The
dtid must be specified and
cannot be generated by a trigger, since the free text index related triggers
must have access to the value.
</p>
</div>
		<p>
The text index will be in synchronous mode by default so we can now query
the data:
	</p>
		<div>
      <pre class="programlisting">
select id from article where contains (description, &#39;sample&#39;);
select id from article where contains (description, &#39;sample&#39;, descending);
	</pre>
    </div>
		<p>
The first query will return the oldest hits first, the second the newest
first.  Note that inserting in ascending order of the document ID is
incomparably more efficient than in descending order.  There is no great
speed difference between reading in ascending or descending order.
	</p>
		<p>
If no application specific ID were specified the order would reflect the
insertion order. Note that in this example articles do not have to be
received in publication order, although insertions will naturally tend to
follow this.
	</p>
		<p>
Now since the ID has an application semantic, we can use it for filtering
based on date:
	</p>
		<p>
Consider:
	</p>
<div>
      <pre class="programlisting">
select id from article where contains (description, &#39;sample&#39;, descending,
	start_id, composite (date2short ({dt&#39;2001-1-5&#39;})));
</pre>
    </div>

		<p>
Since the search goes in descending order of id and starts at an id
beginning with the numeric value corresponding to 2001-1-5 0:00, we
may only get hits where the id date component is less than this, newest
first.
	</p>
		<p>
There is no time penalty for the start_id option.  This is therefore
incomparably faster than the query:
	</p>

		<div>
      <pre class="programlisting">
select id from article where contains (description, &#39;sample&#39;, descending)
and pub_date &lt; {dt &#39;2001-1-5&#39;};
	</pre>
    </div>

<div class="tip">
      <div class="tiptitle">See Also:</div>
<p>The reference section for <a href="queryingftcols.html#containspredicate">contains</a>
 for a definition of these options.</p>
    </div>
<br />

	
		<a name="preprocessingandext" />
    <h3>18.2.5. Pre-processing and Extending the Content Being Indexed</h3>
		<p>
Let us consider the news application.
Assume now a many to many relationship between articles and numbered news
channels.
	</p>
		<div>
      <pre class="programlisting">
create table article_channel (
	a_id integer references article,
	c_id integer,
	primary key (a_id, c_id);
	</pre>
    </div>
		<p>
Assume further that free text search criteria be combined to channel
membership tests.
	</p>
		<p>
This could be expressed as follows
	</p>
		<div>
      <pre class="programlisting">
select * from article
	where contains (description, &#39;sample&#39;)
	and exists
		(select 1 from article_channel
		where a_id = id and c_id = ?);
	</pre>
    </div>
  <p>We have a random access per each hit to a table with at least the
population of the article table for each hit. The situation is yet worse if
there is an OR of multiple channel id&#39;s to which the article may belong.
</p>
		<p>
To optimize this, we may choose to add an extra word for each channel in
which the article appears.  Likewise, we may add the text of the title of the
article to the text being index.
	</p>
		<p>
This can be done with the index hook feature.
	</p>
		<div>
      <pre class="programlisting">
create text index on article (description)
	with key dtid clustered with (pub_date, author_name)
	using function ;
	</pre>
    </div>
		<p>
We then define the hook functions
	</p>
		<div>
      <pre class="programlisting">
create procedure
	article_description_index_hook (inout vtb any, inout d_id any)
{
  for (select c_id from article_channel
	where a_id = composite_ref (d_id, 0))
  do
    {
      vt_batch_feed (vtb, sprintf (&#39;ch%d&#39;, c_id), 0);
    }
  vt_batch_feed (vtb, coalesce ((select title from article
	where dtid = d_id), &#39;&#39;), 0);
  return 0;
}
	</pre>
    </div>
		<p>
This function gets all channel id&#39;s where the article appears and adds the
word ch&lt;nnnn&gt; where &lt;nnnnn&gt; is the channel id.  Thus to look for &#39;xx&#39; on
channel 1 or 10 one can use the text expression xx and (ch1 or ch10).
Additionally, the text of the title is added to the text being indexed.
Note that the d_id supplied is the free text document id and that the second
part of it is the article id.
	</p>
		<p>
To reverse the effect, the unindex function works as follows:
	</p>
	<div>
      <pre class="programlisting">
create procedure
	article_description_unindex_hook (inout vtb any, inout d_id any)
{
  for (select c_id from article_channel
	where a_id = composite_ref (d_id, 0))
  do
    {
      vt_batch_feed (vtb, sprintf (&#39;ch%d&#39;, c_id), 1);
    }
  vt_batch_feed (vtb, coalesce ((select title
	from article where dtid = d_id), &#39;&#39;), 1);
  return 0;
}
	</pre>
    </div>

		<p>
These hooks accept 2 inout parameters, the so called vt batch and the free
text document ID of the row at hand.
	</p>
		<p>
The function returns 1 to indicate that it has processed all words of the
row to be indexed and 0 to indicate that it expects the default text to be
processed by the caller as normally.  Returning 1 is useful for example if
extra word normalization is applied by the hook.
	</p>
If the resulting index is used by xcontains() special
predicate then hook functions should not alter the indexing of XML documents.
xcontains() reads both free-text index and the 
actual document in order to locate particular fragments and it may miss
search hits or get false hits if free-text index of a column does not match to the
actual content of the column.
It is still safe to call vt_batch_feed more than once
during a single call of a hook function: first call for an
unmodified XML document in the column plus calls for additional data.
		<p>
	</p>
		<p>
If offband columns are declared then any call of a hook function should
either return 0 or call <span class="computeroutput">vt_batch_feed_offband()</span>
before returning a non-zero value. If this condition is violated for a document
then NULL is returned instead of correct offband value for the document without
signalling any error.
	</p>
		<p>
The vt batch is an opaque data structure that accumulates  words that will
be added to the text index entry for a given row.  The <span class="computeroutput">vt_batch_feed</span>
function adds words to the batch, the first argument is the vt batch, the
second is the text and the third is a flag 0 for insert and 1 for delete.
The text to be associated to the d_id in the index is the concatenation of
all the text supplied by successive calls to <span class="computeroutput">vt_batch_feed</span>. Word proximity
is defined as if all text were a single string in the order of calling
<span class="computeroutput">vt_batch_feed</span>.
	</p>
		<p>
It is possible to partially alter the rtext associated with an
existing document.  This could be done when adding channels to an article
which already exists.  This could be done with the procedure:
</p>

		<div>
      <pre class="programlisting">
create procedure
	ch_add_article (in cid integer, in aid integer)
{
  declare vtb, _dtid, cname any;
  if (exists (select 1 from article_channel
	where c_id = cid and a_id = aid))
    return;

  insert into article_channel (c_id, a_id) values (cid, aid);
  select dtid into _dtid from article where id = aid;

  vtb := vt_batch (1);
  vt_batch_d_id (vtb, _dtid);
  vt_batch_feed (vtb, sprintf (&#39;ch%d&#39;, cid), 0);
  vt_batch_process_db_dba_article (vtb);
}
	</pre>
    </div>
		<p>
This first checks if the article is already on the channel, and if not, it
makes a vt batch, gets the free text document id of the article, associates
it to the vt batch and then adds a single word, ch&lt;nnnn&gt;.  The vt batch is
applied by calling the generated procedure vt_batch_process_&lt;table&gt;_&lt;column&gt;
with the vt batch as only argument.
	</p>
		<p>
If multiple documents should be processed in one batch, it is possible to
call vt_batch_d_id multiple times to feed data about multiple documents.
In this case the successive document ids must be given in ascending order.
The batch can be processed (applied to the words table) by calling
T_BATCH_PROCESS_&lt;table&gt;_&lt;column&gt;, generated by the index creation.
	</p>
	<p>
The sensitive columns of the UPDATE trigger generated are the free text
document id, the CLUSTERED WITH columns and the main text column.  If more
columns are needed for hook functions etc., the triggers should be manually
edited.
</p>
	<p>
If an explicit integer document ID column is specified, its value may not be
0 or negative.
</p>
<div class="tip">
      <div class="tiptitle">See Also</div>
<p>
        <a href="fn_vt_batch.html">vt_batch</a>,
<a href="fn_vt_batch_feed.html">vt_batch_feed</a>,
<a href="fn_vt_batch_feed_offband.html">vt_batch_feed_offband</a>,
<a href="fn_vt_batch_d_id.html">vt_batch_d_id</a>.
</p>
</div>
<br />


    <a name="hitscores" />
    <h3>18.2.6. Hit Scores</h3>
      <p>
When a document satisfies a text search expression a score is computed to
reflect the closeness of the match. This is a positive integer, with a higher
value meaning a closer match.  The scores are only comparable between results
of the same query against the same database. Moreover, the scoring rules are
different for different versions of Virtuoso server, due to the progress in
information retrieval theory. Thus the only really safe thing to do with scores
is to sort the list of hits by descending
score to produce more relevant hits first.
In addition, the CONTAINS predicate allows specifying
a lower limit to the score of produced hits so a smart application can reduce
a number of retrieved documents by filtering out the less relevant documents.
	</p>
			<p>
While the real scoring rules of the server are too sophisticated to be listed here,
the basic concept is simple: hits for restrictive terms are most important
than hits for generic terms, frequent hits in same document are more important
than occasional, hits that are probably concentrated in same sentence are more
important than hits here and there in a long text:
	</p>
<ul>
      <li>
The score of a single word term is proportional to the frequency of the word in
the document.
	</li>
      <li>
The score of a phrase is proportional to both the frequency of the phrase in
the document and the number of words in the phrase.
	</li>
      <li>
The score of a proximity term &#39;A NEAR B&#39; depends on the distance
between A and B in words - greater distances result in smaller scores.
E.g. the score is 100 for zero distance, 80 for 5 word distance,
50 for distance 10, 20 for distance 20; then it slowly decreases to 1 for distance 100.
If the distance is 100 words or greater then the condition A NEAR B is considered not to be satisfied.
If the A-B pair occurs several times throughout the document the scores of each pair are added.
Virtuoso searches for shortest possible pairs, so if there is more than one word A around B then
the nearest A is selected for scoring.
	</li>
      <li>
Score of a term like A AND B AND C... AND Z is a minimum of scores of A, B,... Z plus some
bonuses. Longer list of AND&#39;ed terms get higher score. A special bonus for
term A AND B is added if hits of subterm A are close to the hits of subterm B.
	</li>
      <li>
Score of a term like A OR B OR C... OR Z is a maximum of scores of A, B,... Z, with some
additional corrections. Longer list of OR&#39;ed terms get lower score.
A special bonus for term A OR B is added if hits of subterm A are close to the hits of subterm B.
	</li>
      <li>
Score of a term like A AND NOT B is equal to the score of subterm A. The subterm B is
used for filtering out redundant hits but it does not affect the scoring.
	</li>
    </ul>
			<p>
The XCONTAINS predicate can also return scores. These scores are somewhat similar to scores
made by CONTAINS but rules for them are too complicated to be explained here.
	</p>
<br />
	
		<a name="wordranges" />
    <h3>18.2.7. Word Ranges</h3>
		<p>
This feature allows returning the positions of matches of a query inside the
indexed text.
This is done by creating a virtual column which gets as its value an array
of arrays, one array for each top level term of the text query expression.
The component arrays in turn contain word positions, expressed as the
ordinal number of the first and last word of each match of the term in
question.  The ordinal numbers are counted from 0, skipping noise words.
	</p>

		<a name="" />
    <div class="example">
      <div class="exampletitle">Word Ranges</div>

		<div>
        <pre class="programlisting">
select dbg_obj_print (r), * from ftt
	where contains (dt, &#39;foo&#39;, ranges, r);
	</pre>
      </div>

		<p>
Prints the array
	</p>
		<div>
        <pre class="screen">
(L(0 0 ) )
	</pre>
      </div>
		<p>
to the server standard output, indicating that the match starts and ends at
the 0&#39;th word, inclusive.
	</p>
	</div>

		<p>
This feature can be used to show specific portions of matching documents in
applications.  This is internally used as part of the xcontains predicate
for XML text.
Also see the function search_excerpt ().
	</p>
<br />


	
		<a name="offbanddata" />
    <h3>18.2.8. Using Offband Data for Faster Filtering</h3>
		<p>
When evaluating a select where there is a contains predicate and filtering
conditions on columns of the table on which the text index is defined
it is useful to store the most frequently used columns in the free text
index instead of the table itself.
	</p>
		<p>
The rationale is that in order to access the filtering data the engine will
do a merge join with the text index table instead of a nested loop join with
the actual table.  Further note that if the columns to be accessed are not
in the index that begins with the free text document id actually 2 random
accesses will be needed: 1. to get the primary key based on the document id
and 2. to get the filtering criterion based on the primary key.  It is
vastly more efficient to do a merge join in the text index to get frequently
needed non-text filtering or sorting keys.
	</p>
		<p>
If the text index is maintained in background mode the offband data will
also be maintained with a delay.  This should not be a problem however since
this is no more delayed than the text data itself.
	</p>
		<p>
If a select with a contains does not reference any columns from the indexed
table besides the document id, then no access to the actual table will be
generated in the compiled query.  Likewise, if only columns found in the
index used to link the document id to the table are referenced, only that
index will be accessed. A special case of the latter situation is where the
document id is the primary key itself.  This will speed up
retrieving the row for free text hits.
	</p>
		<p>
Let us consider a query for getting articles where the author name is at a
specific value:
	</p>
		<div>
      <pre class="programlisting">
select id from article where contains (description, &#39;sample&#39;)
	and author_name = &#39;John Pumpkin&#39;;
	</pre>
    </div>
		<p>
This can be alternately written as
	</p>
		<div>
      <pre class="programlisting">
select id from article
	where contains (description, &#39;sample&#39;, offband, author_name)
	and author_name = &#39;John Pumpkin&#39;;
	</pre>
    </div>
		<p>
In the latter case the author_name will be retrieved from the text index,
saving 2 random accesses, one to the index on dtid and the other to the
table itself per each free text hit.
	</p>
		<p>
The notation is different because the semantic is slightly different.  The
author_name in the latter case is the name at the time of indexing the
article and in the latter case it is the name at the time of evaluating the
query. There can be a difference if the index is maintained with a delay.
For most applications this is however irrelevant.  offband data should not
be used for often changing, transactional data.
	</p>
		<p>
Now consider
	</p>
		<div>
      <pre class="programlisting">
select * from (select top 10  dtid from article
	where contains (description, &#39;sample&#39;, offband, author_name)
	order by author_name) a, article b where a.dtid = b.dtid;
	</pre>
    </div>
		<p>
The derived table select the 10 first articles matching the text condition
in order of author_name.  This does not itself access the article table at
all.  The outer select will then select the full row for these 10 articles.
This is possible since the inner select only references dtid, which is the
free text index document id and author_name which is declared an offband
column in the text index.
	</p>
		<p>
This does less disk access than
	</p>
		<div>
      <pre class="programlisting">
select top 10 * from article
	where contains (description, sample&#39;, offband, author_name)
	order by author_name;
	</pre>
    </div>
		<p>
In this case, all matches are fetched, including the row in the article
table and all are sorted and the top 10 are returned.
This differs from the first by the fact that this accesses the article table
for each of the text hits, not only the top 10.  This can easily make a 3x
speed difference when running in memory and much greater when I/O is
involved, not to mention the adverse impact of more I/O on the working set.
	</p>
	<br />


	<a name="orderofhits" />
    <h3>18.2.9. Order of Hits</h3>
<p>
Rows from a select where there is a contains predicate and no exact match of the primary key of the
table are produced in the order of the document id of the free text index if there is no ORDER BY.
If there is an ORDER BY matching an index, the query is evaluated in the order of that index and each
consecutive row in the order of the index is compared against the free text expression.  This is
practically never desirable.
</p>
<p>
If results are desired in the order of the free text document id, there MUST BE NO ORDER BY.  The
DESCENDING option of contains should be used to produce the reverse order, see contains reference section.
If another sorting order is desired, an ORDER BY can be used but to avoid using an index the ordering
columns should be expressions or ordinal numbers of result columns.
</p>
  <p>
Therefore:
</p>
  <div>
      <pre class="programlisting">
select * from article where contains (description, &#39;sample&#39;)
  order by id;
</pre>
    </div>
  <p>
Will have the effect of traversing the table in the order of id and
checking each row for free text match. This is practically NEVER good.
</p>
  <p>
To produce the results in order of id instead of dtid it is better to write
</p>
  <div>
      <pre class="programlisting">
select id, * from article where contains (description, &#39;sample&#39;)
  order by 1;
</pre>
    </div>
  <p>or</p>
  <div>
      <pre class="programlisting">
select * from article where contains (description, &#39;sample&#39;)
  order by id + 0;
</pre>
    </div>
<br />
<a name="noisewords" />
    <h3>18.2.10. Noise Words</h3>
	<p>
Noise words are often occurring words which can be skipped to save space in
the indexing, such as &#39;the&#39;, &#39;of&#39;, and&#39; etc.  These are ignored when they
occur in queries or documents to be indexed.  The set of noise words is
configurable and is read from the noise.txt file, in the server&#39;s working
directory, at server start up.
Words mentioned in that file will be ignored for both indexing and
querying.
</p>
    <p>
The file noise.txt consists of control lines and text lines.
A text line is just a string of one or more words to be declared as noise.
Please keep them shorter than 1000 characters.
</p>
    <p>
Control lines are those starting with  &quot;Language:&quot; or &quot;Encoding:&quot; (case is important)
&quot;Language: lang-id&quot; tells the system to use rules for language &quot;lang-id&quot; for subsequent  text
lines, until either another &quot;Language:&quot; control line or end of file.
Similarly, &quot;Encoding: enc-id&quot; tells the system to use rules for encoding &quot;enc-id&quot;.
Control lines are always in plain ASCII, no matter which encoding is active
for text lines.
By default, the server default language and &quot;UTF-8&quot; encoding will be used.
</p>
    <p>
The simplest way of composing noise.txt is to place every word on a separate line
and save the file in UTF-8 encoding; this will work fine for most  European languages.
</p>
<a name="" />
    <div class="example">
      <div class="exampletitle">Example</div>
	<div>
        <pre class="programlisting">
AND
OF
THIS
THE
</pre>
      </div>
</div>
<p>
Noise words seem to be case-insensitive, but this is not so. If you
enter a word on a text line, up to four noise words will be registered:</p>
<ul>
      <li>the word exactly as it was entered;</li>
      <li>an uppercased form of this word, if it is defined for active language;</li>
      <li>an lowercased form of this word, if it is defined for active language;</li>
      <li>a capitalized form, with one (or more) first chars in upper case and the rest in lower case.</li>
    </ul>
<p>
  An error is signalled for a free text query consisting exclusively of noise words.
</p>
<p>
It is important to understand that changes in noise.txt may invalidate
free text indexes that were filled with the previous version of noise.txt.
For example, if a text contains a phrase &#39;A B C&#39; and after indexing the word
&#39;B&#39; is added to the noise.txt then &#39;contains&#39; predicate will properly search for
words &#39;A&#39; and &#39;C&#39; but will fail to find the phrase &#39;A B C&#39;
or &#39;A C&#39; due to differences in counting of word positions.
The &#39;xcontains&#39; predicate is even more sensitive to changes in word positions,
because any change in word counting will corrupt the index for element names.
In addition, &quot;persistent XML&quot; documents may contain pre-calculated
word positions for all elements and these positions may become out of sync with positions in
free text index, so it is best not to change noise.txt if the database
contains any free text indexes on persistent XMLs.
</p>
<br />

<table border="0" width="90%" id="navbarbottom">
    <tr>
        <td align="left" width="33%">
          <a href="txtidxquickstart.html" title="Basic Concepts">Previous</a>
          <br />Basic Concepts</td>
     <td align="center" width="34%">
          <a href="freetext.html">Chapter Contents</a>
     </td>
        <td align="right" width="33%">
          <a href="queryingftcols.html" title="Querying Free Text Indexes">Next</a>
          <br />Querying Free Text Indexes</td>
    </tr>
    </table>
  </div>
  <div id="footer">
    <div>Copyright© 1999 - 2009 OpenLink Software All rights reserved.</div>
   <div id="validation">
    <a href="http://validator.w3.org/check/referer">
        <img src="http://www.w3.org/Icons/valid-xhtml10" alt="Valid XHTML 1.0!" height="31" width="88" />
    </a>
    <a href="http://jigsaw.w3.org/css-validator/">
        <img src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!" height="31" width="88" />
    </a>
   </div>
  </div>
 </body>
</html>