<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <HTML ><HEAD ><TITLE >Text Search Types</TITLE ><META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK REV="MADE" HREF="mailto:pgsql-docs@postgresql.org"><LINK REL="HOME" TITLE="PostgreSQL 8.3.6 Documentation" HREF="index.html"><LINK REL="UP" TITLE="Data Types" HREF="datatype.html"><LINK REL="PREVIOUS" TITLE="Bit String Types" HREF="datatype-bit.html"><LINK REL="NEXT" TITLE="UUID Type" HREF="datatype-uuid.html"><LINK REL="STYLESHEET" TYPE="text/css" HREF="stylesheet.css"><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"><META NAME="creation" CONTENT="2009-02-03T04:34:16"></HEAD ><BODY CLASS="SECT1" ><DIV CLASS="NAVHEADER" ><TABLE SUMMARY="Header navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TH COLSPAN="5" ALIGN="center" VALIGN="bottom" >PostgreSQL 8.3.6 Documentation</TH ></TR ><TR ><TD WIDTH="10%" ALIGN="left" VALIGN="top" ><A HREF="datatype-bit.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="10%" ALIGN="left" VALIGN="top" ><A HREF="datatype.html" >Fast Backward</A ></TD ><TD WIDTH="60%" ALIGN="center" VALIGN="bottom" >Chapter 8. Data Types</TD ><TD WIDTH="10%" ALIGN="right" VALIGN="top" ><A HREF="datatype.html" >Fast Forward</A ></TD ><TD WIDTH="10%" ALIGN="right" VALIGN="top" ><A HREF="datatype-uuid.html" ACCESSKEY="N" >Next</A ></TD ></TR ></TABLE ><HR ALIGN="LEFT" WIDTH="100%"></DIV ><DIV CLASS="SECT1" ><H1 CLASS="SECT1" ><A NAME="DATATYPE-TEXTSEARCH" >8.11. Text Search Types</A ></H1 ><A NAME="AEN5853" ></A ><A NAME="AEN5856" ></A ><P > <SPAN CLASS="PRODUCTNAME" >PostgreSQL</SPAN > provides two data types that are designed to support full text search, which is the activity of searching through a collection of natural-language <I CLASS="FIRSTTERM" >documents</I > to locate those that best match a <I CLASS="FIRSTTERM" >query</I >. The <TT CLASS="TYPE" >tsvector</TT > type represents a document in a form suited for text search, while the <TT CLASS="TYPE" >tsquery</TT > type similarly represents a query. <A HREF="textsearch.html" >Chapter 12</A > provides a detailed explanation of this facility, and <A HREF="functions-textsearch.html" >Section 9.13</A > summarizes the related functions and operators. </P ><DIV CLASS="SECT2" ><H2 CLASS="SECT2" ><A NAME="DATATYPE-TSVECTOR" >8.11.1. <TT CLASS="TYPE" >tsvector</TT ></A ></H2 ><A NAME="AEN5870" ></A ><P > A <TT CLASS="TYPE" >tsvector</TT > value is a sorted list of distinct <I CLASS="FIRSTTERM" >lexemes</I >, which are words that have been <I CLASS="FIRSTTERM" >normalized</I > to make different variants of the same word look alike (see <A HREF="textsearch.html" >Chapter 12</A > for details). Sorting and duplicate-elimination are done automatically during input, as shown in this example: </P><PRE CLASS="PROGRAMLISTING" >SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector; tsvector ---------------------------------------------------- 'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'</PRE ><P> (As the example shows, the sorting is first by length and then alphabetically, but that detail is seldom important.) To represent lexemes containing whitespace or punctuation, surround them with quotes: </P><PRE CLASS="PROGRAMLISTING" >SELECT $$the lexeme ' ' contains spaces$$::tsvector; tsvector ------------------------------------------- 'the' ' ' 'lexeme' 'spaces' 'contains'</PRE ><P> (We use dollar-quoted string literals in this example and the next one, to avoid confusing matters by having to double quote marks within the literals.) Embedded quotes and backslashes must be doubled: </P><PRE CLASS="PROGRAMLISTING" >SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector; tsvector ------------------------------------------------ 'a' 'the' 'Joe''s' 'quote' 'lexeme' 'contains'</PRE ><P> Optionally, integer <I CLASS="FIRSTTERM" >position(s)</I > can be attached to any or all of the lexemes: </P><PRE CLASS="PROGRAMLISTING" >SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector; tsvector ------------------------------------------------------------------------------- 'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4</PRE ><P> A position normally indicates the source word's location in the document. Positional information can be used for <I CLASS="FIRSTTERM" >proximity ranking</I >. Position values can range from 1 to 16383; larger numbers are silently clamped to 16383. Duplicate positions for the same lexeme are discarded. </P ><P > Lexemes that have positions can further be labeled with a <I CLASS="FIRSTTERM" >weight</I >, which can be <TT CLASS="LITERAL" >A</TT >, <TT CLASS="LITERAL" >B</TT >, <TT CLASS="LITERAL" >C</TT >, or <TT CLASS="LITERAL" >D</TT >. <TT CLASS="LITERAL" >D</TT > is the default and hence is not shown on output: </P><PRE CLASS="PROGRAMLISTING" >SELECT 'a:1A fat:2B,4C cat:5D'::tsvector; tsvector ---------------------------- 'a':1A 'cat':5 'fat':2B,4C</PRE ><P> Weights are typically used to reflect document structure, for example by marking title words differently from body words. Text search ranking functions can assign different priorities to the different weight markers. </P ><P > It is important to understand that the <TT CLASS="TYPE" >tsvector</TT > type itself does not perform any normalization; it assumes that the words it is given are normalized appropriately for the application. For example, </P><PRE CLASS="PROGRAMLISTING" >select 'The Fat Rats'::tsvector; tsvector -------------------- 'Fat' 'The' 'Rats'</PRE ><P> For most English-text-searching applications the above words would be considered non-normalized, but <TT CLASS="TYPE" >tsvector</TT > doesn't care. Raw document text should usually be passed through <CODE CLASS="FUNCTION" >to_tsvector</CODE > to normalize the words appropriately for searching: </P><PRE CLASS="PROGRAMLISTING" >SELECT to_tsvector('english', 'The Fat Rats'); to_tsvector ----------------- 'fat':2 'rat':3</PRE ><P> Again, see <A HREF="textsearch.html" >Chapter 12</A > for more detail. </P ></DIV ><DIV CLASS="SECT2" ><H2 CLASS="SECT2" ><A NAME="DATATYPE-TSQUERY" >8.11.2. <TT CLASS="TYPE" >tsquery</TT ></A ></H2 ><A NAME="AEN5901" ></A ><P > A <TT CLASS="TYPE" >tsquery</TT > value stores lexemes that are to be searched for, and combines them using the boolean operators <TT CLASS="LITERAL" >&</TT > (AND), <TT CLASS="LITERAL" >|</TT > (OR), and <TT CLASS="LITERAL" >!</TT > (NOT). Parentheses can be used to enforce grouping of the operators: </P><PRE CLASS="PROGRAMLISTING" > SELECT 'fat & rat'::tsquery; tsquery --------------- 'fat' & 'rat' SELECT 'fat & (rat | cat)'::tsquery; tsquery --------------------------- 'fat' & ( 'rat' | 'cat' ) SELECT 'fat & rat & ! cat'::tsquery; tsquery ------------------------ 'fat' & 'rat' & !'cat'</PRE ><P> In the absence of parentheses, <TT CLASS="LITERAL" >!</TT > (NOT) binds most tightly, and <TT CLASS="LITERAL" >&</TT > (AND) binds more tightly than <TT CLASS="LITERAL" >|</TT > (OR). </P ><P > Optionally, lexemes in a <TT CLASS="TYPE" >tsquery</TT > can be labeled with one or more weight letters, which restricts them to match only <TT CLASS="TYPE" >tsvector</TT > lexemes with one of those weights: </P><PRE CLASS="PROGRAMLISTING" >SELECT 'fat:ab & cat'::tsquery; tsquery ------------------ 'fat':AB & 'cat'</PRE ><P> </P ><P > Quoting rules for lexemes are the same as described above for lexemes in <TT CLASS="TYPE" >tsvector</TT >; and, as with <TT CLASS="TYPE" >tsvector</TT >, any required normalization of words must be done before putting them into the <TT CLASS="TYPE" >tsquery</TT > type. The <CODE CLASS="FUNCTION" >to_tsquery</CODE > function is convenient for performing such normalization: </P><PRE CLASS="PROGRAMLISTING" >SELECT to_tsquery('Fat:ab & Cats'); to_tsquery ------------------ 'fat':AB & 'cat'</PRE ><P> </P ></DIV ></DIV ><DIV CLASS="NAVFOOTER" ><HR ALIGN="LEFT" WIDTH="100%"><TABLE SUMMARY="Footer navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" ><A HREF="datatype-bit.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="index.html" ACCESSKEY="H" >Home</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" ><A HREF="datatype-uuid.html" ACCESSKEY="N" >Next</A ></TD ></TR ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" >Bit String Types</TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="datatype.html" ACCESSKEY="U" >Up</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" ><ACRONYM CLASS="ACRONYM" >UUID</ACRONYM > Type</TD ></TR ></TABLE ></DIV ></BODY ></HTML >