<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <HTML ><HEAD ><TITLE >Extended indexing features</TITLE ><META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK REL="HOME" TITLE="mnoGoSearch 3.3.9 reference manual" HREF="index.html"><LINK REL="PREVIOUS" TITLE="Cached copies " HREF="msearch-stored.html"><LINK REL="NEXT" TITLE=" mnoGoSearch HTML parser " HREF="msearch-htmlparser.html"><LINK REL="STYLESHEET" TYPE="text/css" HREF="mnogo.css"><META NAME="Description" CONTENT="mnoGoSearch - Full Featured Web site Open Source Search Engine Software over the Internet and Intranet Web Sites Based on SQL Database. It is a Free search software covered by GNU license."><META NAME="Keywords" CONTENT="shareware, freeware, download, internet, unix, utilities, search engine, text retrieval, knowledge retrieval, text search, information retrieval, database search, mining, intranet, webserver, index, spider, filesearch, meta, free, open source, full-text, udmsearch, website, find, opensource, search, searching, software, udmsearch, engine, indexing, system, web, ftp, http, cgi, php, SQL, MySQL, database, php3, FreeBSD, Linux, Unix, mnoGoSearch, MacOS X, Mac OS X, Windows, 2000, NT, 95, 98, GNU, GPL, url, grabbing"></HEAD ><BODY CLASS="chapter" BGCOLOR="#EEEEEE" TEXT="#000000" LINK="#000080" VLINK="#800080" ALINK="#FF0000" ><!--#include virtual="body-before.html"--><DIV CLASS="NAVHEADER" ><TABLE SUMMARY="Header navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TH COLSPAN="3" ALIGN="center" ><SPAN CLASS="application" >mnoGoSearch</SPAN > 3.3.9 reference manual: Full-featured search engine software</TH ></TR ><TR ><TD WIDTH="10%" ALIGN="left" VALIGN="bottom" ><A HREF="msearch-stored.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="80%" ALIGN="center" VALIGN="bottom" ></TD ><TD WIDTH="10%" ALIGN="right" VALIGN="bottom" ><A HREF="msearch-htmlparser.html" ACCESSKEY="N" >Next</A ></TD ></TR ></TABLE ><HR ALIGN="LEFT" WIDTH="100%"></DIV ><DIV CLASS="chapter" ><H1 ><A NAME="extended-indexing" ></A >Chapter 4. Extended indexing features</H1 ><DIV CLASS="TOC" ><DL ><DT ><B >Table of Contents</B ></DT ><DT ><A HREF="msearch-extended-indexing.html#news" >News extensions <A NAME="AEN2161" ></A ></A ></DT ><DT ><A HREF="msearch-extended-indexing.html#mp3" >Creating an MP3 search engine <A NAME="AEN2165" ></A ></A ></DT ><DT ><A HREF="msearch-extended-indexing.html#htdb" >Indexing <ACRONYM CLASS="acronym" >SQL</ACRONYM > tables (<TT CLASS="literal" >htdb:/</TT > virtual <ACRONYM CLASS="acronym" >URL</ACRONYM > scheme) <A NAME="AEN2227" ></A ></A ></DT ><DT ><A HREF="msearch-extended-indexing.html#exec" >Indexing a program output (<TT CLASS="literal" >exec:/</TT > and <TT CLASS="literal" >cgi:/</TT > virtual URL schemes) <A NAME="AEN2568" ></A ></A ></DT ><DT ><A HREF="msearch-extended-indexing.html#mirror" >Mirroring <A NAME="AEN2650" ></A ></A ></DT ><DT ><A HREF="msearch-extended-indexing.html#dump-restore" >Dumping and restoring the search database <A NAME="AEN2735" ></A ></A ></DT ></DL ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="news" >News extensions <A NAME="AEN2161" ></A ></A ></H2 ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="mp3" >Creating an MP3 search engine <A NAME="AEN2165" ></A ></A ></H2 ><P > <SPAN CLASS="application" >mnoGoSearch</SPAN > has a built-in parser for MP3 files. It can extract the <TT CLASS="literal" >Album</TT >, the <TT CLASS="literal" >Artist</TT >, the <TT CLASS="literal" >Song</TT > as well as the <TT CLASS="literal" >Year</TT > <SPAN CLASS="emphasis" ><I CLASS="emphasis" >MP3 tags</I ></SPAN > from an MP3 file. You can create a full-featured MP3 search engine using <SPAN CLASS="application" >mnoGoSearch</SPAN >. </P ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="mp3-indexer" >MP3 <TT CLASS="filename" >indexer.conf</TT > commands</A ></H3 ><A NAME="AEN2178" ></A ><P > To activate indexing of MP3 tags, you can use the <B CLASS="command" ><A HREF="msearch-cmdref-checkmp3.html" >CheckMP3</A ></B > and <B CLASS="command" ><A HREF="msearch-cmdref-checkmp3only.html" >CheckMP3Only</A ></B > commands into <TT CLASS="filename" >indexer.conf</TT >, as well as activate processing of MP3 sections (they are disabled by default). This is an example of an <TT CLASS="filename" >indexer.conf</TT > file with MP3 related commands: <PRE CLASS="programlisting" > Section MP3.Song 21 128 Section MP3.Album 22 128 Section MP3.Artist 23 128 Section MP3.Year 24 128 CheckMP3 *.mp3 Hrefonly * </PRE > With the above configuration, <SPAN CLASS="application" >indexer</SPAN > will check all <TT CLASS="filename" >*.mp3</TT > files for MP3 tags, and will collect new links from other file types without indexing. </P ><P > <A NAME="AEN2192" ></A > When you use the <B CLASS="command" ><A HREF="msearch-cmdref-checkmp3.html" >CheckMP3</A ></B > command, <SPAN CLASS="application" >indexer</SPAN > downloads only <TT CLASS="literal" >128</TT > bytes from the files with the given extension(s) to detect and parse MP3 tags. <DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > <SPAN CLASS="application" >indexer</SPAN > downloads MP3 files efficiently from FTP servers, as well as from HTTP servers supporting HTTP/1.1 protocol with the <TT CLASS="literal" >Range</TT > request header, to request partial content. Old HTTP servers not supporting the <TT CLASS="literal" >Range</TT > HTTP header may not work well together with <SPAN CLASS="application" >mnoGoSearch</SPAN >. </P ></BLOCKQUOTE ></DIV > </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="mp3-search" >Restricting search to a certain MP3 section</A ></H3 ><P > If you want to restrict searches by <TT CLASS="literal" >Author</TT >, <TT CLASS="literal" >Album</TT >, <TT CLASS="literal" >Song</TT > or <TT CLASS="literal" >Year</TT >, you can use the standard <SPAN CLASS="application" >mnoGoSearch</SPAN > ways to restrict searches described in <A HREF="msearch-doingsearch.html#search-changeweight" >the Section called <I >Changing weights of the different document parts at search time</I > in Chapter 10</A > and <A HREF="msearch-doingsearch.html#search-secnoref" >the Section called <I >Restricting search words to a section <A NAME="AEN5056" ></A ></I > in Chapter 10</A >. For example, if you want to restrict search by song and author name, you use the standard <SPAN CLASS="application" >mnoGoSearch</SPAN > way to specify sections: <TT CLASS="literal" >Song: help Author:Beatles</TT >. </P ><P > With the default sections given in <TT CLASS="filename" >indexer.conf-dist</TT >, you may find useful to add this <ACRONYM CLASS="acronym" >HTML</ACRONYM > form element into <TT CLASS="filename" >search.htm</TT > to restrict search area: <PRE CLASS="programlisting" > Search in: <SELECT NAME="wf"> <OPTION VALUE="111100000000000000000000" SELECTED="$(wf)">All MP3 sections</OPTION> <OPTION VALUE="000100000000000000000000" SELECTED="$(wf)">MP3 Song name</OPTION> <OPTION VALUE="001000000000000000000000" SELECTED="$(wf)">MP3 Album</OPTION> <OPTION VALUE="010000000000000000000000" SELECTED="$(wf)">MP3 Artist</OPTION> <OPTION VALUE="100000000000000000000000" SELECTED="$(wf)">MP3 Year</OPTION> </SELECT> </PRE > </P ></DIV ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="htdb" >Indexing <ACRONYM CLASS="acronym" >SQL</ACRONYM > tables (<TT CLASS="literal" >htdb:/</TT > virtual <ACRONYM CLASS="acronym" >URL</ACRONYM > scheme) <A NAME="AEN2227" ></A ></A ></H2 ><P > <SPAN CLASS="application" >mnoGoSearch</SPAN > can index <ACRONYM CLASS="acronym" >SQL</ACRONYM > tables with long text columns with help of so called <TT CLASS="literal" >htdb:/</TT > virtual <ACRONYM CLASS="acronym" >URL</ACRONYM > scheme. </P ><P >Using the <TT CLASS="literal" >htdb:/</TT > virtual scheme, you can build a full-text index for your <ACRONYM CLASS="acronym" >SQL</ACRONYM > tables as well as index your database driven Web servers. </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B >You have to have a <TT CLASS="literal" >PRIMARY KEY</TT > or an <TT CLASS="literal" >UNIQUE INDEX</TT > on the table you want to index with <ACRONYM CLASS="acronym" >HTDB</ACRONYM >. </P ></BLOCKQUOTE ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="htdb-indexer" ><ACRONYM CLASS="acronym" >HTDB</ACRONYM > <TT CLASS="filename" >indexer.conf</TT > commands</A ></H3 ><P > <ACRONYM CLASS="acronym" >HTDB</ACRONYM > is implemented using the following <TT CLASS="filename" >indexer.conf</TT > commands: <A HREF="msearch-cmdref-htdbaddr.html" >HTDBAddr</A >, <A HREF="msearch-cmdref-htdblist.html" >HTDBList</A >, <A HREF="msearch-cmdref-htdblimit.html" >HTDBLimit</A >, <A HREF="msearch-cmdref-htdbdoc.html" >HTDBDoc</A >. </P ><P >The purposes of the <A HREF="msearch-cmdref-htdbaddr.html" >HTDBAddr</A > command is to specify a database connection string. It uses the same syntax to <A HREF="msearch-cmdref-dbaddr.html" >DBAddr</A >. If no <A HREF="msearch-cmdref-htdbaddr.html" >HTDBAddr</A > command is specified, the data will be fetched using the same connection specified in <A HREF="msearch-cmdref-dbaddr.html" >DBAddr</A >. </P ><P > <A NAME="AEN2261" ></A > The <A HREF="msearch-cmdref-htdblist.html" >HTDBList</A > command is used to specify an <ACRONYM CLASS="acronym" >SQL</ACRONYM > query which generates a list of documents using either absolute or relative <ACRONYM CLASS="acronym" >URL</ACRONYM > notation, for example: <PRE CLASS="programlisting" > HTDBList "SELECT CONCAT('htdb:/',id) FROM messages" </PRE > or <PRE CLASS="programlisting" > HTDBList "SELECT id FROM messages" </PRE > <DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > <A HREF="msearch-cmdref-htdblist.html" >HTDBList</A > allows to fetch non-<ACRONYM CLASS="acronym" >htdb</ACRONYM > <ACRONYM CLASS="acronym" >URL</ACRONYM >s as well. So it gives another options to use <ACRONYM CLASS="acronym" >HTDB</ACRONYM >: you can store the list of "<SPAN CLASS="emphasis" ><I CLASS="emphasis" >real URLs</I ></SPAN >" (e.g. <ACRONYM CLASS="acronym" >HTTP</ACRONYM >-style <ACRONYM CLASS="acronym" >URL</ACRONYM >s) in the database and fetch them with help of <ACRONYM CLASS="acronym" >HTDB</ACRONYM >. <PRE CLASS="programlisting" > HTDBList "SELECT url FROM mytable" Server urllist htdb:/ Realm page * </PRE > </P ></BLOCKQUOTE ></DIV > </P ><P >The <ACRONYM CLASS="acronym" >SQL</ACRONYM > query given in <A HREF="msearch-cmdref-htdblist.html" >HTDBList</A > is used for all documents having the '<TT CLASS="literal" >/</TT >' sign in the end of <ACRONYM CLASS="acronym" >URL</ACRONYM >. This query is an analog for a <SPAN CLASS="emphasis" ><I CLASS="emphasis" >file system directory listing</I ></SPAN >. </P ><P ><A NAME="AEN2287" ></A > The <A HREF="msearch-cmdref-htdblimit.html" >HTDBLimit</A > command is used to specify the maximum number of records fetched by a single <B CLASS="command" >SELECT</B > query given in the <A HREF="msearch-cmdref-htdblist.html" >HTDBList</A > command. <A HREF="msearch-cmdref-htdblimit.html" >HTDBLimit</A > helps to reduce memory consumption when indexing large <ACRONYM CLASS="acronym" >SQL</ACRONYM > tables. For example: <PRE CLASS="programlisting" > HTDBLimit 512 </PRE > </P ><P > <A NAME="AEN2297" ></A > The <A HREF="msearch-cmdref-htdbdoc.html" >HTDBDoc</A > command specifies an <ACRONYM CLASS="acronym" >SQL</ACRONYM > query to get a single document from the database using its <TT CLASS="literal" >PRIMARY KEY</TT > value. The <A HREF="msearch-cmdref-htdbdoc.html" >HTDBDoc</A > query is executed for all <ACRONYM CLASS="acronym" >HTDB</ACRONYM > documents not having the '<TT CLASS="literal" >/</TT >' in the end of their <ACRONYM CLASS="acronym" >URL</ACRONYM >. </P ><P >An <ACRONYM CLASS="acronym" >SQL</ACRONYM > query given in the <A HREF="msearch-cmdref-htdbdoc.html" >HTDBDoc</A > command must return a <SPAN CLASS="emphasis" ><I CLASS="emphasis" >single row</I ></SPAN > result. If the <A HREF="msearch-cmdref-htdbdoc.html" >HTDBDoc</A > query returns an empty set or multiple records, the <ACRONYM CLASS="acronym" >HTDB</ACRONYM > retrieval system generates a <TT CLASS="literal" >HTTP 404 Not Found</TT > response. This can happen at re-indexing time if the record was deleted from the table since last re-indexing. You can use <A HREF="msearch-cmdref-holdbadhrefs.html" ><B CLASS="command" >HoldBadHrefs 0</B ></A > to remove the deleted records from the <SPAN CLASS="application" >mnoGoSearch</SPAN > tables as well. </P ><P ><SPAN CLASS="application" >mnoGoSearch</SPAN > understands three types of <A HREF="msearch-cmdref-htdbdoc.html" >HTDBDoc</A > <ACRONYM CLASS="acronym" >SQL</ACRONYM > queries. <P ></P ><UL ><LI ><P > A single-column result with a fully formatted <ACRONYM CLASS="acronym" >HTTP</ACRONYM > response, including standard <ACRONYM CLASS="acronym" >HTTP</ACRONYM > response status line. Take a look into <A HREF="msearch-http-codes.html" >the Section called <I >HTTP response codes <SPAN CLASS="application" >mnoGoSearch</SPAN > understands</I > in Chapter 3</A > to know how <SPAN CLASS="application" >indexer</SPAN > handles various <ACRONYM CLASS="acronym" >HTTP</ACRONYM > status codes. A <A HREF="msearch-cmdref-htdbdoc.html" >HTDBDoc</A > <ACRONYM CLASS="acronym" >SQL</ACRONYM > query can also optionally include <ACRONYM CLASS="acronym" >HTTP</ACRONYM > headers understood by <SPAN CLASS="application" >indexer</SPAN >, such as <TT CLASS="literal" >Content-Type</TT >, <TT CLASS="literal" >Last-Modified</TT >, <TT CLASS="literal" >Content-Encoding</TT > and other headers. So you can build a very flexible indexing system by returning different <ACRONYM CLASS="acronym" >HTTP</ACRONYM > status codes and headers. </P ><P ><P ><B >Example:</B ></P > <PRE CLASS="programlisting" > HTDBDoc "SELECT CONCAT(\ 'HTTP/1.0 200 OK\\r\\n',\ 'Content-type: text/plain\\r\\n',\ '\\r\\n',\ msg) \ FROM messages WHERE id='$1'" </PRE > </P ></LI ><LI ><P > A multiple-column result, with the status line starting from the "<TT CLASS="literal" >HTTP/</TT >" substring in the beginning of the first column. All columns are concatenated using the <TT CLASS="literal" >Carriage-Return + New-Line</TT > (<TT CLASS="literal" >\r\n</TT >) delimiters to generate a <ACRONYM CLASS="acronym" >HTTP</ACRONYM >-alike response. The first column returning an empty string is considered as a delimiter between the headers and the content part of the <ACRONYM CLASS="acronym" >HTTP</ACRONYM > response, and is replaced to "<TT CLASS="literal" >\r\n\r\n</TT >". This type of queries is a simpler way of the previous type. It helps to avoid using concatenation operators and functions, and the "<TT CLASS="literal" >\r\n</TT >" header delimiters. </P ><P ><P ><B >Example:</B ></P > <PRE CLASS="programlisting" > HTDBDoc "SELECT 'HTTP/1.0 200 OK','Content-type: text/plain','',msg \ FROM messages WHERE id='$1'" </PRE > </P ></LI ><LI ><P > A single- or a multiple-column result without the "<TT CLASS="literal" >HTTP/</TT >" header. This is the simplest <A HREF="msearch-cmdref-htdbdoc.html" >HTDBDoc</A > response type. The <ACRONYM CLASS="acronym" >SQL</ACRONYM > column names returned by the query are associated with the <A HREF="msearch-cmdref-section.html" >Section</A > names configured in <TT CLASS="filename" >indexer.conf</TT >. </P ><P ><P ><B >Example:</B ></P > <PRE CLASS="programlisting" > Section body 1 256 Section title 2 256 HTDBDoc "SELECT title, body FROM messages WHERE id='$1'" </PRE > </P ><P > In this example, the values of the columns <CODE CLASS="varname" >title</CODE > and <CODE CLASS="varname" >body</CODE > are associated with the sections <TT CLASS="literal" >title</TT > and <TT CLASS="literal" >body</TT > respectively. </P ><P > The columns with the names <CODE CLASS="varname" >status</CODE > and <CODE CLASS="varname" >last_mod_time</CODE > have a special meaning - the <ACRONYM CLASS="acronym" >HTTP</ACRONYM > status code, and the document modification time respectively. <CODE CLASS="varname" >Status</CODE > should be an integer code according to <ACRONYM CLASS="acronym" >HTTP</ACRONYM > notation, and the modification time should be in Unix timestamp format - the number of seconds since <TT CLASS="literal" >January, 1, 1970</TT >. </P ><P ><P ><B >Example:</B ></P > <PRE CLASS="programlisting" > HTDBDoc "SELECT title, body, \ CASE WHEN messages.deleted THEN 404 ELSE 200 END as status,\ timestamp as last_mod_time FROM messages WHERE id='$1'" </PRE > </P ><P > The above example demonstrates how to use the special columns. The <ACRONYM CLASS="acronym" >SQL</ACRONYM > query will return status "<TT CLASS="literal" >404 Not found</TT >" for all documents marked as deleted, which will make <SPAN CLASS="application" >indexer</SPAN > remove these documents from the search database when re-indexing the data. Also, this query makes <SPAN CLASS="application" >indexer</SPAN > use the column <CODE CLASS="varname" >timestamp</CODE > as the document modification time. </P ><P > If a column contains data in <ACRONYM CLASS="acronym" >HTML</ACRONYM > format, you can specify the <TT CLASS="literal" >html</TT > keyword in the corresponding <A HREF="msearch-cmdref-section.html" >Section</A > command, which will make <SPAN CLASS="application" >indexer</SPAN > apply the <ACRONYM CLASS="acronym" >HTML</ACRONYM > parser to this column and therefore remove all <ACRONYM CLASS="acronym" >HTML</ACRONYM > tags and comments: </P ><P ><P ><B >Example:</B ></P > <PRE CLASS="programlisting" > Section title 1 256 Section wiki_text 2 16000 html HTDBDoc "SELECT title, wiki_text FROM messages WHERE id='$1'" </PRE > </P ></LI ></UL > </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="htdb-var" ><ACRONYM CLASS="acronym" >HTDB</ACRONYM > variables <A NAME="AEN2396" ></A ></A ></H3 ><P >The <SPAN CLASS="emphasis" ><I CLASS="emphasis" >path</I ></SPAN > parts of an <ACRONYM CLASS="acronym" >URL</ACRONYM > can be passed as parameters to the <A HREF="msearch-cmdref-htdblist.html" >HTDBList</A > and <A HREF="msearch-cmdref-htdbdoc.html" >HTDBDoc</A > <ACRONYM CLASS="acronym" >SQL</ACRONYM > queries. All parts are to be used as <CODE CLASS="varname" >$1</CODE >, <CODE CLASS="varname" >$2</CODE >, ... <CODE CLASS="varname" >$N</CODE >, where the number represents the <SPAN CLASS="emphasis" ><I CLASS="emphasis" >N-th path part</I ></SPAN >, that is the part of <ACRONYM CLASS="acronym" >URL</ACRONYM > after the <TT CLASS="literal" >N-th</TT > slash sign: <PRE CLASS="programlisting" > htdb:/part1/part2/part3/part4/part5 $1 $2 $3 $4 $5 </PRE > </P ><P >For example, you have this <TT CLASS="filename" >indexer.conf</TT > command: <PRE CLASS="programlisting" > HTDBList "SELECT id FROM catalog WHERE category='$1'" </PRE > </P ><P >When <SPAN CLASS="application" >mnoGoSearch</SPAN > prepares to fetch a document with the <ACRONYM CLASS="acronym" >URL</ACRONYM > <TT CLASS="literal" >htdb:/cars/</TT >, <CODE CLASS="varname" >$1</CODE > will be replaced to "<TT CLASS="literal" >cars</TT >": <PRE CLASS="programlisting" > SELECT id FROM catalog WHERE category='cars' </PRE > </P ><P > You can use long <ACRONYM CLASS="acronym" >URLs</ACRONYM > to pass multiple parameters into both <A HREF="msearch-cmdref-htdblist.html" >HTDBList</A > and <A HREF="msearch-cmdref-htdbdoc.html" >HTDBDoc</A > queries. For example: <PRE CLASS="programlisting" > HTDBList "SELECT column4 FROM table WHERE column1='$1' AND column2='$2' and column3='$3'" HTDBDoc "SELECT title, body FROM table WHERE column1='$1' AND column2='$2' and column3='$3' column4='$4'" Server htdb:/path1/path2/path3/ </PRE > Using multiple parameters helps to refer to a certain record using parts of a compound <TT CLASS="literal" >PRIMARY KEY</TT > or <TT CLASS="literal" >UNIQUE INDEX</TT >. </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="htdb-many" >Using multiple <ACRONYM CLASS="acronym" >HTDB</ACRONYM > sources <A NAME="AEN2432" ></A ></A ></H3 ><P >It's possible to index multiple <ACRONYM CLASS="acronym" >HTDB</ACRONYM > sources using multiple <A HREF="msearch-cmdref-htdblist.html" >HTDBList</A >, <A HREF="msearch-cmdref-htdbdoc.html" >HTDBDoc</A > and <A HREF="msearch-cmdref-server.html" >Server</A > commands in the same <TT CLASS="filename" >indexer.conf</TT >. </P ><P > <PRE CLASS="programlisting" > Section body 1 256 Section title 2 256 HTDBList "SELECT id FROM t1" HTDBDoc "SELECT title, body FROM t1 WHERE id=$2" Server htdb:/t1/ HTDBList "SELECT id FROM t2" HTDBDoc "SELECT title, body FROM t2 WHERE id=$2" Server htdb:/t2/ HTDBList "SELECT id FROM t3" HTDBDoc "SELECT title, body FROM t3 WHERE id=$2" Server htdb:/t3/ </PRE > </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="htdb-fulltext" >Using <SPAN CLASS="application" >mnoGoSearch</SPAN > as an external <ACRONYM CLASS="acronym" >SQL</ACRONYM > full-text engine</A ></H3 ><P >With help of the <TT CLASS="literal" >htdb:/</TT > scheme you can quickly create a full-text index and use it further in your <ACRONYM CLASS="acronym" >SQL</ACRONYM > application. Imagine you have a large <ACRONYM CLASS="acronym" >SQL</ACRONYM > table which stores a Web board messages in plain text format, and you want to add search functionality to your Web board. Say, the messages are stored in the table <CODE CLASS="varname" >messages</CODE > with two columns <CODE CLASS="varname" >id</CODE > and <CODE CLASS="varname" >msg</CODE >, where <CODE CLASS="varname" >id</CODE > is an integer <TT CLASS="literal" >PRIMARY KEY</TT > and <CODE CLASS="varname" >msg</CODE > is a long text column containing messages. Using a usual <ACRONYM CLASS="acronym" >SQL</ACRONYM > <B CLASS="command" >LIKE</B > search may take a very long time to return a result: <PRE CLASS="programlisting" > SELECT id, message FROM messages WHERE message LIKE '%someword%' </PRE > </P ><P >With help of the <TT CLASS="literal" >htdb:/</TT > scheme provided by <SPAN CLASS="application" >mnoGoSearch</SPAN > you can create a full-text index on the table <CODE CLASS="varname" >messages</CODE >. In order to do so you can edit your <TT CLASS="filename" >indexer.conf</TT > as follows: <PRE CLASS="programlisting" > DBAddr mysql://foo:bar@localhost/mnogosearch/?dbmode=single Section msg 1 256 HTDBAddr mysql://foofoo:barbar@localhost/database/ HTDBList "SELECT id FROM messages" HTDBDoc "SELECT msg FROM messages WHERE id='$1'" Server htdb:/ </PRE > </P ><P >When started, <SPAN CLASS="application" >indexer</SPAN > will insert the <ACRONYM CLASS="acronym" >URL</ACRONYM > <TT CLASS="literal" >htdb:/</TT > into the database and will execute the <ACRONYM CLASS="acronym" >SQL</ACRONYM > query given in <A HREF="msearch-cmdref-htdblist.html" >HTDBList</A >, which will produce the values <TT CLASS="literal" >1</TT >, <TT CLASS="literal" >2</TT >, <TT CLASS="literal" >3</TT >,..., <TT CLASS="literal" >N</TT > in the result. The values will be interpreted as links relative to <TT CLASS="literal" >htdb:/</TT >. A list of new <ACRONYM CLASS="acronym" >URLs</ACRONYM > in the form <TT CLASS="literal" >htdb:/1</TT >, <TT CLASS="literal" >htdb:/2</TT >, ..., <TT CLASS="literal" >htdb:/N</TT > will be added into the database. Then the <A HREF="msearch-cmdref-htdbdoc.html" >HTDBDoc</A > <ACRONYM CLASS="acronym" >SQL</ACRONYM > query will be executed for every added <ACRONYM CLASS="acronym" >URL</ACRONYM >. <A HREF="msearch-cmdref-htdbdoc.html" >HTDBDoc</A > will return the column <CODE CLASS="varname" >msg</CODE > as a document content, which will be associated with the section <CODE CLASS="varname" >mgs</CODE > and parsed. Word information will be stored in the table <CODE CLASS="varname" >dict</CODE > (assuming the <TT CLASS="literal" >single</TT > storage mode). </P ><P >After indexing is done, you can use <SPAN CLASS="application" >mnoGoSearch</SPAN > tables to perform search: <PRE CLASS="programlisting" > SELECT url.url FROM url,dict WHERE dict.url_id=url.rec_id AND dict.word='someword'; </PRE > </P ><P >The table <CODE CLASS="varname" >dict</CODE > has an index on the column <CODE CLASS="varname" >word</CODE >, so the above query will be executed much faster than the queries using the <B CLASS="command" >LIKE</B > operator on the table <CODE CLASS="varname" >messages</CODE >. </P ><P >You can also use multiple words in search: <PRE CLASS="programlisting" > SELECT url.url, count(*) as c FROM url,dict WHERE dict.url_id=url.rec_id AND dict.word IN ('some','word') GROUP BY url.url ORDER BY c DESC; </PRE > </P ><P >Both queries will return <TT CLASS="literal" >htdb:/XXX</TT > values from the <CODE CLASS="varname" >url.url</CODE > field. Then your application can cut the "<TT CLASS="literal" >htdb:/</TT >" prefix from the returned values to get the <TT CLASS="literal" >PRIMARY KEY</TT > values from the table <CODE CLASS="varname" >messages</CODE >. </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="htdb-web" >Indexing a database driven Web server</A ></H3 ><P >You can also use <ACRONYM CLASS="acronym" >HTDB</ACRONYM > to index your database driven Web server. It allows to index your documents without having to invoke your the Web server at indexing time, which should require less <ACRONYM CLASS="acronym" >CPU</ACRONYM > resources than direct <ACRONYM CLASS="acronym" >HTTP</ACRONYM > indexing and therefore should offload the Web server machine. </P ><P >The main idea of indexing a database driven Web server is to map <ACRONYM CLASS="acronym" >HTTP</ACRONYM > requests into <ACRONYM CLASS="acronym" >HTDB</ACRONYM > requests at indexing time. So <SPAN CLASS="application" >indexer</SPAN > will fetch the source data directly from the <ACRONYM CLASS="acronym" >SQL</ACRONYM > database, meanwhile <SPAN CLASS="application" >search.cgi</SPAN > will return real <ACRONYM CLASS="acronym" >URLs</ACRONYM > in usual <ACRONYM CLASS="acronym" >HTTP</ACRONYM > notation. This can be achieved using the aliasing mechanisms provided by <SPAN CLASS="application" >mnoGoSearch</SPAN >. </P ><P >Take a look at a sample file <TT CLASS="filename" >doc/samples/htdb.conf</TT >, which is included into <SPAN CLASS="application" >mnoGoSearch</SPAN > source distribution. It is the <TT CLASS="filename" >indexer.conf</TT > file used to index the Web board at the <A HREF="http://www.mnogosearch.org/" TARGET="_top" > <SPAN CLASS="application" >mnoGoSearch</SPAN > site </A >. </P ><P >The <A HREF="msearch-cmdref-htdblist.html" >HTDBList</A > command generates <ACRONYM CLASS="acronym" >URLs</ACRONYM > in the form: <PRE CLASS="programlisting" > http://www.mnogosearch.org/board/message.php?id=XXX </PRE > </P ><P >where <TT CLASS="literal" >XXX</TT > is a <TT CLASS="literal" >PRIMARY KEY</TT > value from the table <CODE CLASS="varname" >messages</CODE >. </P ><P > For every <TT CLASS="literal" >PRIMARY KEY</TT > value a fully formatted <ACRONYM CLASS="acronym" >HTTP</ACRONYM > response is generated, containing a <TT CLASS="literal" >text/html</TT > document with headers and this content: <PRE CLASS="programlisting" > <HTML> <HEAD> <TITLE><SPAN CLASS="emphasis" ><I CLASS="emphasis" >Subject goes here</I ></SPAN ></TITLE> <META NAME="Description" Content="<SPAN CLASS="emphasis" ><I CLASS="emphasis" >Author name goes here</I ></SPAN >"> </HEAD> <BODY> <SPAN CLASS="emphasis" ><I CLASS="emphasis" >Message text goes here</I ></SPAN > </BODY> </PRE > </P ><P >At the end of <TT CLASS="filename" >doc/samples/htdb.conf</TT > you can find these commands: <PRE CLASS="programlisting" > Server htdb:/ Realm http://www.mnogosearch.org/board/message.php?id=* Alias http://www.mnogosearch.org/board/message.php?id= htdb:/ </PRE > </P ><P > The first command tells <SPAN CLASS="application" >indexer</SPAN > to execute the <A HREF="msearch-cmdref-htdblist.html" >HTDBList</A > query, which generates a list of messages in the form: <PRE CLASS="programlisting" > http://www.mnogosearch.org/board/message.php?id=XXX </PRE > </P ><P > The second command tells <SPAN CLASS="application" >indexer</SPAN > to allow messages matching the given <ACRONYM CLASS="acronym" ></ACRONYM > pattern using string match with the '<TT CLASS="literal" >*</TT >' wildcard at the end. </P ><P >The third command replaces the substring <TT CLASS="literal" >http://www.mnogosearch.org/board/message.php?id=</TT > in the <ACRONYM CLASS="acronym" >URL</ACRONYM > to <TT CLASS="literal" >htdb:/</TT > before a message is downloaded, which forces <SPAN CLASS="application" >indexer</SPAN > to use the <ACRONYM CLASS="acronym" >SQL</ACRONYM > table as the data source for a document instead of sending an <ACRONYM CLASS="acronym" >HTTP</ACRONYM > request to the Web server. </P ><P > After indexing is done, <SPAN CLASS="application" >search.cgi</SPAN > will display search result using the usual <ACRONYM CLASS="acronym" >HTTP</ACRONYM > notation, for example: <TT CLASS="literal" >http://www.mnogosearch.org/board/message.php?id=1000</TT > </P ></DIV ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="exec" >Indexing a program output (<TT CLASS="literal" >exec:/</TT > and <TT CLASS="literal" >cgi:/</TT > virtual URL schemes) <A NAME="AEN2568" ></A ></A ></H2 ><P ><SPAN CLASS="application" >mnoGoSearch</SPAN > offers special virtual URL methods <TT CLASS="literal" >exec:/</TT > and <TT CLASS="literal" >cgi:/</TT >. These methods allow to use output of an external program as a source for indexing. <SPAN CLASS="application" >mnoGoSearch</SPAN > can work with any executable program that returns results to <TT CLASS="filename" >STDOUT</TT >. The result must be conform to the <ACRONYM CLASS="acronym" >HTTP</ACRONYM > standard and return full <ACRONYM CLASS="acronym" >HTTP</ACRONYM > response headers (including <ACRONYM CLASS="acronym" >HTTP</ACRONYM > status line and at least the <TT CLASS="literal" >Content-Type</TT > <ACRONYM CLASS="acronym" >HTTP</ACRONYM > response header) followed by the document content. </P ><P >For example, when indexing both <TT CLASS="literal" >cgi:/usr/local/bin/myprog</TT > and <TT CLASS="literal" >exec:/usr/local/bin/myprog</TT >, <SPAN CLASS="application" >indexer</SPAN > will execute the <TT CLASS="filename" >/usr/local/bin/myprog</TT > program. </P ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="exec-cgi" >Passing parameters to the <TT CLASS="literal" >cgi:/</TT > virtual scheme</A ></H3 ><P >When executing a program given in a <TT CLASS="literal" >cgi:/</TT > URL, <SPAN CLASS="application" >indexer</SPAN > emulates environment in the way this program would run in when executed under a <ACRONYM CLASS="acronym" >HTTP</ACRONYM > server. It creates the <CODE CLASS="varname" >REQUEST_METHOD=GET</CODE > environment variable, and the <CODE CLASS="varname" >QUERY_STRING</CODE > variable according to the HTTP standards. For example, if <TT CLASS="literal" >cgi:/usr/local/apache/cgi-bin/test-cgi?a=b&d=e</TT > is being indexed, <SPAN CLASS="application" >indexer</SPAN > creates <TT CLASS="literal" >QUERY_STRING</TT > with <TT CLASS="literal" >a=b&d=e</TT > value. <TT CLASS="literal" >cgi:/</TT > virtual URL scheme allows indexing your site without having to invoke web servers even if you want to index CGI scripts. For example, you have a web site with static documents under <TT CLASS="filename" >/usr/local/apache/htdocs/</TT > and with CGI scripts under <TT CLASS="filename" >/usr/local/apache/cgi-bin/</TT >. You can use the following configuration: <PRE CLASS="programlisting" > Server http://localhost/ Alias http://localhost/cgi-bin/ cgi:/usr/local/apache/cgi-bin/ Alias http://localhost/ file:///usr/local/apache/htdocs/ </PRE > </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="exec-exec" >Passing parameters to the <TT CLASS="literal" >exec:/</TT > virtual scheme</A ></H3 ><P > In case of an <TT CLASS="literal" >exec:/</TT > URL, <SPAN CLASS="application" >indexer</SPAN > does not create the <TT CLASS="literal" >QUERY_STRING</TT > variable, instead it passes all parameters in the command line. For example, when indexing <TT CLASS="literal" >exec:/usr/local/bin/myprog?a=b&d=e</TT >, this command will be executed: <PRE CLASS="programlisting" > /usr/local/bin/myprog "a=b&d=e" </PRE > </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="exec-ext" >Using the <TT CLASS="literal" >exec:/</TT > virtual scheme as an external retrieval system</A ></H3 ><P >The <TT CLASS="literal" >exec:/</TT > virtual scheme can be used as an external retrieval system. It allows using protocols which are not supported natively by <SPAN CLASS="application" >mnoGoSearch</SPAN >. For example, you can use <SPAN CLASS="application" >curl</SPAN > program which is available from <A HREF="http://curl.haxx.se/" TARGET="_top" >http://curl.haxx.se/</A > to index HTTPS sites when <SPAN CLASS="application" >mnoGoSearch</SPAN > is compiled without built-in HTTPS support. </P ><P >Put this short script to <TT CLASS="literal" >/usr/local/mnogosearch/bin/</TT > under name <TT CLASS="filename" >curl.sh</TT >. <PRE CLASS="programlisting" > #!/bin/sh /usr/local/bin/curl -i $1 2>/dev/null </PRE > </P ><P >This script takes an URL given as a command line parameter and executes <SPAN CLASS="application" >curl</SPAN > to download the given URL. The <TT CLASS="literal" >-i</TT > argument tells <SPAN CLASS="application" >curl</SPAN > to output result together with <ACRONYM CLASS="acronym" >HTTP</ACRONYM > response headers. </P ><P >Add these commands into <TT CLASS="filename" >indexer.conf</TT >: <PRE CLASS="programlisting" > Server https://some.https.site/ Alias https:// exec:/usr/local/mnogosearch/etc/curl.sh?https:// </PRE > </P ><P >When indexing <TT CLASS="filename" >https://some.https.site/path/to/page.html</TT >, <SPAN CLASS="application" >indexer</SPAN > will translate this URL to <PRE CLASS="programlisting" > exec:/usr/local/mnogosearch/etc/curl.sh?https://some.https.site/path/to/page.html </PRE > </P ><P >then execute the <TT CLASS="filename" >curl.sh</TT > script: <PRE CLASS="programlisting" > /usr/local/mnogosearch/etc/curl.sh "https://some.https.site/path/to/page.html" </PRE > </P ><P >and load its output for indexing. <DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > <SPAN CLASS="application" >indexer</SPAN > loads up to <B CLASS="command" ><A HREF="msearch-cmdref-maxdocsize.html" >MaxDocSize</A ></B > bytes when executing an <TT CLASS="literal" >exec:/</TT > or <TT CLASS="literal" >cgi:/</TT >. </P ></BLOCKQUOTE ></DIV > </P ></DIV ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="mirror" >Mirroring <A NAME="AEN2650" ></A ></A ></H2 ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="mirror-creating" >Creating a mirror</A ></H3 ><P > <A NAME="AEN2655" ></A > <SPAN CLASS="application" >mnoGoSearch</SPAN > supports some mirroring functionality. To enable mirroring, you can specify the path where <SPAN CLASS="application" >indexer</SPAN > will create the mirrors of your sites with help of the <A HREF="msearch-cmdref-mirrorroot.html" >MirrorRoot</A > command. For example: <PRE CLASS="programlisting" > MirrorRoot /path/to/mirror </PRE > </P ><P > <A NAME="AEN2663" ></A > You can also configure <SPAN CLASS="application" >indexer</SPAN > to store <ACRONYM CLASS="acronym" >HTTP</ACRONYM > headers on the disk. This can be helpful if you want to use the local mirror for quick reindexing of the remote site. Use the <A HREF="msearch-cmdref-mirrorroot.html" >MirrorRoot</A > command to activate storing the <ACRONYM CLASS="acronym" >HTTP</ACRONYM > headers. For example: <PRE CLASS="programlisting" > MirrorHeadersRoot /path/to/headers </PRE > </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > <A HREF="msearch-cmdref-mirrorroot.html" >MirrorRoot</A > and <A HREF="msearch-cmdref-mirrorheadersroot.html" >MirrorHeadersRoot</A > can point to the same directory. </P ></BLOCKQUOTE ></DIV ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B ><SPAN CLASS="application" >indexer</SPAN > does not download more than <A HREF="msearch-cmdref-maxdocsize.html" >MaxDocSize</A > bytes from every documents. If a document is larger, it will be only partially downloaded. Make sure that <A HREF="msearch-cmdref-maxdocsize.html" >MaxDocSize</A > is large enough if you want to use the mirror created by <SPAN CLASS="application" ></SPAN > as a real site mirror. </P ></BLOCKQUOTE ></DIV ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="mirror-as-cache" >Using a mirror as crawler cache. <A NAME="AEN2683" ></A ></A ></H3 ><P > <A NAME="AEN2686" ></A > <SPAN CLASS="application" >mnoGoSearch</SPAN > can use a previously created mirror as a crawler cache. It can be useful when you do experiments with <SPAN CLASS="application" >mnoGoSearch</SPAN > to find the best configuration: you modify your <TT CLASS="filename" >indexer.conf</TT >, then clear the database and index the same sites again. To reduce Internet traffic you can activate loading documents from the mirror using the <A HREF="msearch-cmdref-mirrorperiod.html" >MirrorPeriod</A > command. For example: <PRE CLASS="programlisting" > MirrorPeriod 2h </PRE > </P ><P > <A HREF="msearch-cmdref-mirrorperiod.html" >MirrorPeriod</A > specify the period of time when <SPAN CLASS="application" >indexer</SPAN > considers the local mirrored copy of a document as valid. If <SPAN CLASS="application" >indexer</SPAN > finds that the local mirrored copy is fresh enough, it will not download the same document again and use the local copy instead. If the local is older than <A HREF="msearch-cmdref-mirrorperiod.html" >MirrorPeriod</A > says, then <SPAN CLASS="application" >indexer</SPAN > will download the document from its original location again, and update the locally mirrored copy. </P ><P > If <A HREF="msearch-cmdref-mirrorheadersroot.html" >MirrorHeadersRoot</A > is not specified and therefore the original <ACRONYM CLASS="acronym" >HTTP</ACRONYM > headers are not available, then <SPAN CLASS="application" >indexer</SPAN > will detect <TT CLASS="literal" >Content-Type</TT > of a document using the <A HREF="msearch-cmdref-addtype.html" >AddType</A > commands. </P ><P >The parameter <A HREF="msearch-cmdref-mirrorperiod.html" >MirrorPeriod</A > should be in the form: <TT CLASS="literal" >xxxA[yyyB[zzzC]]</TT >, where <TT CLASS="literal" >xxx</TT >, <TT CLASS="literal" >yyy</TT >, <TT CLASS="literal" >zzz</TT > are numbers (can be negative!). Spaces are allowed between <TT CLASS="literal" >xxx</TT > and <TT CLASS="literal" >A</TT > and <TT CLASS="literal" >yyy</TT > and so on. <TT CLASS="literal" >A</TT >, <TT CLASS="literal" >B</TT >, <TT CLASS="literal" >C</TT > can be one of the following: <PRE CLASS="programlisting" > s - second M - minute h - hour d - day m - month y - year </PRE > </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B >The letters are similar to the descriptors understood by the <CODE CLASS="function" >strptime()</CODE > and <CODE CLASS="function" >strftime()</CODE > C functions. </P ></BLOCKQUOTE ></DIV ><P >Examples: <PRE CLASS="programlisting" > 15s - 15 seconds 4h30M - 4 hours and 30 minutes 1y6m-15d - 1 year and six month minus 15 days 1h-10M+1s - 1 hour minus 10 minutes plus 1 second </PRE > </P ><P >If you specify only a number without any characters, it is assumed that the time is given in seconds. </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B >If you start mirroring in a already existing database, <SPAN CLASS="application" >indexer</SPAN > will refuse to create the mirror immediately because of the traffic optimization method described at <A HREF="msearch-indexing.html#general-crawling-optimization" >the Section called <I >Crawling time optimization</I > in Chapter 3</A >. You can run <KBD CLASS="userinput" >indexer -am</KBD > once to turn off optimization, or clear the database using <KBD CLASS="userinput" >indexer -C</KBD > and then run <SPAN CLASS="application" >indexer</SPAN > without any arguments. </P ></BLOCKQUOTE ></DIV ></DIV ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="dump-restore" >Dumping and restoring the search database <A NAME="AEN2735" ></A ></A ></H2 ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="dump" >Dumping the search database</A ></H3 ><P > It is possible to dump and restore a <SPAN CLASS="application" >mnoGoSearch</SPAN > <ACRONYM CLASS="acronym" >SQL</ACRONYM > database using standard tools supplied with the database software, such as <SPAN CLASS="application" >mysqldump</SPAN > or <SPAN CLASS="application" >pg_dump</SPAN >. This approach works fine in case of a single <ACRONYM CLASS="acronym" >SQL</ACRONYM > database. </P ><P > However, if you use multiple <ACRONYM CLASS="acronym" >SQL</ACRONYM > databases to store <SPAN CLASS="application" >mnoGoSearch</SPAN > data, or use <A HREF="msearch-cluster.html" ><SPAN CLASS="application" >mnoGoSearch</SPAN > cluster</A > solution and want to re-distribute data between more <ACRONYM CLASS="acronym" >SQL</ACRONYM > databases (say, when adding a new machine into cluster), or want to reduce the number of separate <ACRONYM CLASS="acronym" >SQL</ACRONYM > databases (say, when removing a machine from cluster), the standard method of dumping and restoring <ACRONYM CLASS="acronym" >SQL</ACRONYM > data will not work because of conflicts in auto-generated values (<TT CLASS="literal" >auto_increment</TT > values, <TT CLASS="literal" >SEQUENCE</TT > values, <TT CLASS="literal" >IDENTITY</TT > values and so so). </P ><P > Starting from the version <TT CLASS="literal" >3.3.9</TT >, <SPAN CLASS="application" >mnoGoSearch</SPAN > includes dump and restore tools which allows to workaround this problem. <DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > As of version <TT CLASS="literal" >3.3.9</TT >, <SPAN CLASS="application" >mnoGoSearch</SPAN > dump and restore tools work only with <SPAN CLASS="application" >MySQL</SPAN >. Support for the other databases will be added in the future releases. </P ></BLOCKQUOTE ></DIV > In order to create a dump of your <SPAN CLASS="application" >mnoGoSearch</SPAN > database, you can run: <PRE CLASS="programlisting" > indexer -Edumpdata > dumpfile.sql </PRE > or pipe data to <SPAN CLASS="application" >gzip</SPAN >: <PRE CLASS="programlisting" > indexer -Edumpdata | gzip > dumpfile.sql.gz </PRE > to reduce the dump size. </P ><P > The dump file created by <KBD CLASS="userinput" >indexer -Edump</KBD > is a usual <ACRONYM CLASS="acronym" >SQL</ACRONYM > dump file, which does not include auto-generated values. A piece of a dump file in case of <SPAN CLASS="application" >MySQL</SPAN > database looks like: <PRE CLASS="programlisting" > --seed=39 INSERT INTO url (...all columns except rec_id...) VALUES (...); INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'body','Modules Directives FAQ...'); INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'CachedCopy','eNrtWc1v2zgWv+ev...'); INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'Charset','utf-8'); INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'Content-Language','en'); INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'Content-Type','text/html'); INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'title','Apache HTTP Server Ver...'); INSERT INTO bdicti VALUES(last_insert_id(),1,0x6B6F00011EC296170000726577726974696E6700017E4D,0...'); </PRE > The dump file consists of chunks of <TT CLASS="literal" >INSERT</TT > instructions for every document. The structure of the dump file forces <SPAN CLASS="application" >MySQL</SPAN > to assign a new auto-increment value for the column <CODE CLASS="varname" >url.rec_id</CODE > and use this value to insert data into the child tables <CODE CLASS="varname" >urlinfo</CODE > and <CODE CLASS="varname" >bdicti</CODE > at restore time. </P ><P > Additionally, every chunk consists of the comment <TT CLASS="literal" >--seed=xxx</TT > which is used to distribute data between multiple database properly at restore time. </P ><P > By default, <KBD CLASS="userinput" >indexer -Edump</KBD > dumps data from all databases specified in <TT CLASS="filename" >indexer.conf</TT > file. You can use the <CODE CLASS="option" >-D</CODE > command line argument to dump data from a certain database only. For example: <PRE CLASS="programlisting" > indexer -Edump -D2 </PRE > will dump data from the database described by the second command <B CLASS="command" ><A HREF="msearch-cmdref-dbaddr.html" >DBAddr</A ></B > in <TT CLASS="filename" >indexer.conf</TT >. </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="restore" >Restoring the search database</A ></H3 ><P > To restore a search database from a dump file, use: <PRE CLASS="programlisting" > indexer -Esql -v2 < dumpfile.sql </PRE > or in case of <TT CLASS="filename" >.gz</TT > file: <PRE CLASS="programlisting" > zcat dumpfile.sql.gz | indexer -Esql -v2 </PRE > <SPAN CLASS="application" >indexer</SPAN > will load the data back to the <ACRONYM CLASS="acronym" >SQL</ACRONYM > database. In case if you have two or more <B CLASS="command" ><A HREF="msearch-cmdref-dbaddr.html" >DBAddr</A ></B > commands in the current <TT CLASS="filename" >indexer.conf</TT > file, <SPAN CLASS="application" >indexer</SPAN > will also properly distribute the data between the corresponding <ACRONYM CLASS="acronym" >SQL</ACRONYM > databases. </P ></DIV ></DIV ></DIV ><DIV CLASS="NAVFOOTER" ><HR ALIGN="LEFT" WIDTH="100%"><TABLE SUMMARY="Footer navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" ><A HREF="msearch-stored.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="index.html" ACCESSKEY="H" >Home</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" ><A HREF="msearch-htmlparser.html" ACCESSKEY="N" >Next</A ></TD ></TR ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" >Cached copies <A NAME="AEN2074" ></A ></TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" > </TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" >mnoGoSearch <ACRONYM CLASS="acronym" >HTML</ACRONYM > parser</TD ></TR ></TABLE ></DIV ><!--#include virtual="body-after.html"--></BODY ></HTML >