Sophie

Sophie

distrib > Mandriva > 2008.1 > x86_64 > media > main-testing > by-pkgid > bab02a23fa9f3df8d66a9a3231b50245 > files > 135

postgresql8.3-docs-8.3.6-2mdv2008.1.x86_64.rpm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML
><HEAD
><TITLE
>Text Search Types</TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK
REV="MADE"
HREF="mailto:pgsql-docs@postgresql.org"><LINK
REL="HOME"
TITLE="PostgreSQL 8.3.6 Documentation"
HREF="index.html"><LINK
REL="UP"
TITLE="Data Types"
HREF="datatype.html"><LINK
REL="PREVIOUS"
TITLE="Bit String Types"
HREF="datatype-bit.html"><LINK
REL="NEXT"
TITLE="UUID Type"
HREF="datatype-uuid.html"><LINK
REL="STYLESHEET"
TYPE="text/css"
HREF="stylesheet.css"><META
HTTP-EQUIV="Content-Type"
CONTENT="text/html; charset=ISO-8859-1"><META
NAME="creation"
CONTENT="2009-02-03T04:34:16"></HEAD
><BODY
CLASS="SECT1"
><DIV
CLASS="NAVHEADER"
><TABLE
SUMMARY="Header navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TH
COLSPAN="5"
ALIGN="center"
VALIGN="bottom"
>PostgreSQL 8.3.6 Documentation</TH
></TR
><TR
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="top"
><A
HREF="datatype-bit.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="top"
><A
HREF="datatype.html"
>Fast Backward</A
></TD
><TD
WIDTH="60%"
ALIGN="center"
VALIGN="bottom"
>Chapter 8. Data Types</TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="top"
><A
HREF="datatype.html"
>Fast Forward</A
></TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="top"
><A
HREF="datatype-uuid.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
></TABLE
><HR
ALIGN="LEFT"
WIDTH="100%"></DIV
><DIV
CLASS="SECT1"
><H1
CLASS="SECT1"
><A
NAME="DATATYPE-TEXTSEARCH"
>8.11. Text Search Types</A
></H1
><A
NAME="AEN5853"
></A
><A
NAME="AEN5856"
></A
><P
>    <SPAN
CLASS="PRODUCTNAME"
>PostgreSQL</SPAN
> provides two data types that
    are designed to support full text search, which is the activity of
    searching through a collection of natural-language <I
CLASS="FIRSTTERM"
>documents</I
>
    to locate those that best match a <I
CLASS="FIRSTTERM"
>query</I
>.
    The <TT
CLASS="TYPE"
>tsvector</TT
> type represents a document in a form suited
    for text search, while the <TT
CLASS="TYPE"
>tsquery</TT
> type similarly represents
    a query.
    <A
HREF="textsearch.html"
>Chapter 12</A
> provides a detailed explanation of this
    facility, and <A
HREF="functions-textsearch.html"
>Section 9.13</A
> summarizes the
    related functions and operators.
   </P
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="DATATYPE-TSVECTOR"
>8.11.1. <TT
CLASS="TYPE"
>tsvector</TT
></A
></H2
><A
NAME="AEN5870"
></A
><P
>     A <TT
CLASS="TYPE"
>tsvector</TT
> value is a sorted list of distinct
     <I
CLASS="FIRSTTERM"
>lexemes</I
>, which are words that have been
     <I
CLASS="FIRSTTERM"
>normalized</I
> to make different variants of the same word look
     alike (see <A
HREF="textsearch.html"
>Chapter 12</A
> for details).  Sorting and
     duplicate-elimination are done automatically during input, as shown in
     this example:

</P><PRE
CLASS="PROGRAMLISTING"
>SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
                      tsvector
----------------------------------------------------
 'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'</PRE
><P>

     (As the example shows, the sorting is first by length and then
     alphabetically, but that detail is seldom important.)  To represent
     lexemes containing whitespace or punctuation, surround them with quotes:

</P><PRE
CLASS="PROGRAMLISTING"
>SELECT $$the lexeme '    ' contains spaces$$::tsvector;
                 tsvector                  
-------------------------------------------
 'the' '    ' 'lexeme' 'spaces' 'contains'</PRE
><P>

     (We use dollar-quoted string literals in this example and the next one,
     to avoid confusing matters by having to double quote marks within the
     literals.)  Embedded quotes and backslashes must be doubled:

</P><PRE
CLASS="PROGRAMLISTING"
>SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
                    tsvector                    
------------------------------------------------
 'a' 'the' 'Joe''s' 'quote' 'lexeme' 'contains'</PRE
><P>

     Optionally, integer <I
CLASS="FIRSTTERM"
>position(s)</I
>
     can be attached to any or all of the lexemes:

</P><PRE
CLASS="PROGRAMLISTING"
>SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
                                  tsvector
-------------------------------------------------------------------------------
 'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4</PRE
><P>

     A position normally indicates the source word's location in the
     document.  Positional information can be used for
     <I
CLASS="FIRSTTERM"
>proximity ranking</I
>.  Position values can
     range from 1 to 16383; larger numbers are silently clamped to 16383.
     Duplicate positions for the same lexeme are discarded.
    </P
><P
>     Lexemes that have positions can further be labeled with a
     <I
CLASS="FIRSTTERM"
>weight</I
>, which can be <TT
CLASS="LITERAL"
>A</TT
>,
     <TT
CLASS="LITERAL"
>B</TT
>, <TT
CLASS="LITERAL"
>C</TT
>, or <TT
CLASS="LITERAL"
>D</TT
>.
     <TT
CLASS="LITERAL"
>D</TT
> is the default and hence is not shown on output:

</P><PRE
CLASS="PROGRAMLISTING"
>SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
          tsvector          
----------------------------
 'a':1A 'cat':5 'fat':2B,4C</PRE
><P>

     Weights are typically used to reflect document structure, for example
     by marking title words differently from body words.  Text search
     ranking functions can assign different priorities to the different
     weight markers.
    </P
><P
>     It is important to understand that the
     <TT
CLASS="TYPE"
>tsvector</TT
> type itself does not perform any normalization;
     it assumes that the words it is given are normalized appropriately
     for the application.  For example,

</P><PRE
CLASS="PROGRAMLISTING"
>select 'The Fat Rats'::tsvector;
      tsvector      
--------------------
 'Fat' 'The' 'Rats'</PRE
><P>

     For most English-text-searching applications the above words would
     be considered non-normalized, but <TT
CLASS="TYPE"
>tsvector</TT
> doesn't care.
     Raw document text should usually be passed through
     <CODE
CLASS="FUNCTION"
>to_tsvector</CODE
> to normalize the words appropriately
     for searching:

</P><PRE
CLASS="PROGRAMLISTING"
>SELECT to_tsvector('english', 'The Fat Rats');         
   to_tsvector   
-----------------
 'fat':2 'rat':3</PRE
><P>

     Again, see <A
HREF="textsearch.html"
>Chapter 12</A
> for more detail.
    </P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="DATATYPE-TSQUERY"
>8.11.2. <TT
CLASS="TYPE"
>tsquery</TT
></A
></H2
><A
NAME="AEN5901"
></A
><P
>     A <TT
CLASS="TYPE"
>tsquery</TT
> value stores lexemes that are to be
     searched for, and combines them using the boolean operators
     <TT
CLASS="LITERAL"
>&amp;</TT
> (AND), <TT
CLASS="LITERAL"
>|</TT
> (OR), and
     <TT
CLASS="LITERAL"
>!</TT
> (NOT).  Parentheses can be used to enforce grouping
     of the operators:

</P><PRE
CLASS="PROGRAMLISTING"
> SELECT 'fat &amp; rat'::tsquery;
    tsquery    
---------------
 'fat' &amp; 'rat'

SELECT 'fat &amp; (rat | cat)'::tsquery;
          tsquery          
---------------------------
 'fat' &amp; ( 'rat' | 'cat' )

SELECT 'fat &amp; rat &amp; ! cat'::tsquery;
        tsquery         
------------------------
 'fat' &amp; 'rat' &amp; !'cat'</PRE
><P>

     In the absence of parentheses, <TT
CLASS="LITERAL"
>!</TT
> (NOT) binds most tightly,
     and <TT
CLASS="LITERAL"
>&amp;</TT
> (AND) binds more tightly than
     <TT
CLASS="LITERAL"
>|</TT
> (OR).
    </P
><P
>     Optionally, lexemes in a <TT
CLASS="TYPE"
>tsquery</TT
> can be labeled with
     one or more weight letters, which restricts them to match only
     <TT
CLASS="TYPE"
>tsvector</TT
> lexemes with one of those weights:

</P><PRE
CLASS="PROGRAMLISTING"
>SELECT 'fat:ab &amp; cat'::tsquery;
    tsquery
------------------
 'fat':AB &amp; 'cat'</PRE
><P>
    </P
><P
>     Quoting rules for lexemes are the same as described above for
     lexemes in <TT
CLASS="TYPE"
>tsvector</TT
>; and, as with <TT
CLASS="TYPE"
>tsvector</TT
>,
     any required normalization of words must be done before putting
     them into the <TT
CLASS="TYPE"
>tsquery</TT
> type.  The <CODE
CLASS="FUNCTION"
>to_tsquery</CODE
>
     function is convenient for performing such normalization:

</P><PRE
CLASS="PROGRAMLISTING"
>SELECT to_tsquery('Fat:ab &amp; Cats');
    to_tsquery    
------------------
 'fat':AB &amp; 'cat'</PRE
><P>
    </P
></DIV
></DIV
><DIV
CLASS="NAVFOOTER"
><HR
ALIGN="LEFT"
WIDTH="100%"><TABLE
SUMMARY="Footer navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
><A
HREF="datatype-bit.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="index.html"
ACCESSKEY="H"
>Home</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><A
HREF="datatype-uuid.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
>Bit String Types</TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="datatype.html"
ACCESSKEY="U"
>Up</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><ACRONYM
CLASS="ACRONYM"
>UUID</ACRONYM
> Type</TD
></TR
></TABLE
></DIV
></BODY
></HTML
>