<HTML ><HEAD ><TITLE >Indexing</TITLE ><META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.73 "><LINK REL="HOME" TITLE="mnoGoSearch 3.2 reference manual" HREF="index.html"><LINK REL="PREVIOUS" TITLE="Installation registration" HREF="msearch-register.html"><LINK REL="NEXT" TITLE="Supported HTTP response codes" HREF="msearch-http-codes.html"><LINK REL="STYLESHEET" TYPE="text/css" HREF="mnogo.css"><META NAME="Description" CONTENT="mnoGoSearch - Full Featured Web site Open Source Search Engine Software over the Internet and Intranet Web Sites Based on SQL Database. It is a Free search software covered by GNU license."><META NAME="Keywords" CONTENT="shareware, freeware, download, internet, unix, utilities, search engine, text retrieval, knowledge retrieval, text search, information retrieval, database search, mining, intranet, webserver, index, spider, filesearch, meta, free, open source, full-text, udmsearch, website, find, opensource, search, searching, software, udmsearch, engine, indexing, system, web, ftp, http, cgi, php, SQL, MySQL, database, php3, FreeBSD, Linux, Unix, mnoGoSearch, MacOS X, Mac OS X, Windows, 2000, NT, 95, 98, GNU, GPL, url, grabbing"></HEAD ><BODY CLASS="chapter" BGCOLOR="#EEEEEE" TEXT="#000000" LINK="#000080" VLINK="#800080" ALINK="#FF0000" ><DIV CLASS="NAVHEADER" ><TABLE SUMMARY="Header navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TH COLSPAN="3" ALIGN="center" >mnoGoSearch 3.2 reference manual: Full-featured search engine software</TH ></TR ><TR ><TD WIDTH="10%" ALIGN="left" VALIGN="bottom" ><A HREF="msearch-register.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="80%" ALIGN="center" VALIGN="bottom" ></TD ><TD WIDTH="10%" ALIGN="right" VALIGN="bottom" ><A HREF="msearch-http-codes.html" ACCESSKEY="N" >Next</A ></TD ></TR ></TABLE ><HR ALIGN="LEFT" WIDTH="100%"></DIV ><DIV CLASS="chapter" ><H1 ><A NAME="indexing" >Chapter 3. Indexing</A ></H1 ><DIV CLASS="TOC" ><DL ><DT ><B >Table of Contents</B ></DT ><DT ><A HREF="msearch-indexing.html#general" >Indexing in general</A ></DT ><DT ><A HREF="msearch-http-codes.html" >Supported HTTP response codes</A ></DT ><DT ><A HREF="msearch-content-enc.html" >Content-Encoding support <A NAME="AEN927" ></A ></A ></DT ><DT ><A HREF="msearch-indexer-configuration.html" >indexer configuration</A ></DT ><DT ><A HREF="msearch-extended-indexing.html" >Extended indexing features</A ></DT ><DT ><A HREF="msearch-syslog.html" >Using syslog <A NAME="AEN1665" ></A ></A ></DT ><DT ><A HREF="msearch-stored.html" >Storing compressed document copies <A NAME="AEN1713" ></A ></A ></DT ></DL ></DIV ><DIV CLASS="sect1" ><H1 CLASS="sect1" ><A NAME="general" >Indexing in general</A ></H1 ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="general-conf" >Configuration</A ></H2 ><P >First, you should configure mnoGoSearch. Indexer configuration is covered mostly by <TT CLASS="filename" >indexer.conf-dist</TT > file. You can find it in <TT CLASS="literal" >etc</TT > directory of mnoGoSearch distribution. You may take a look at other *.conf samples in <TT CLASS="literal" >doc/samples</TT > directory. </P ><P >To set up <TT CLASS="filename" >indexer.conf</TT > file, cd to mnoGoSearch installation <TT CLASS="literal" >/etc</TT > directory, copy <TT CLASS="filename" >indexer.conf-dist</TT > to <TT CLASS="filename" >indexer.conf</TT > and edit it.</P ><P >To configure search front-ends (<TT CLASS="filename" >search.cgi</TT > and/or <TT CLASS="filename" >search.php3</TT >, or other), you should copy <TT CLASS="filename" >search.htm-dist</TT > file in /etc directory of mnoGoSearch installation to <TT CLASS="filename" >search.htm</TT > and edit it. See <A HREF="msearch-templates.html" >the Section called <I >How to write search result templates <A NAME="AEN3083" ></A ></I > in Chapter 8</A > for detailed description.</P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="general-run" >Running <B CLASS="command" >indexer</B ></A ></H2 ><P >Just run indexer once a week (a day, an hour ...) to find the latest modifications in your web sites. You may also insert indexer into your <TT CLASS="literal" >crontab</TT > job.</P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="general-builtin" >Built-in database notes</A ></H2 ><P > <TT CLASS="literal" >indexer</TT > with built-in database support can't do reindexing and indexes the whole site every time it is started.</P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="general-sql" >SQL back-end notes</A ></H2 ><A NAME="AEN761" ></A ><P >By default, indexer being called without any command line arguments reindex only expired documents. You can change expiration period with <B CLASS="command" >Period</B > <TT CLASS="filename" >indexer.conf</TT > command. If you want to reindex all documents irrelevant if those are expired or not, use <TT CLASS="option" >-a</TT > option. indexer will mark all documents as expired at startup. </P ><P >Retrieving documents, indexer sends <TT CLASS="literal" >If-Modified-Since</TT > HTTP header for documents that are already stored in database. When indexer gets next document it calculates document's checksum. If checksum is the same with old checksum stored in database, it will not parse document again. indexer <TT CLASS="option" >-m</TT > command line option prevents indexer from sending <TT CLASS="literal" >If-Modified-Since</TT > headers and make it parse document even if checksum is the same. It is useful for example when you have changed your <B CLASS="command" >Allow/Disallow</B > rules in <TT CLASS="filename" >indexer.conf </TT > and it is required to add new pages that was disallowed earlier.</P ><P >If mnoGoSearch retrieves URL with redirect HTTP 301,302,303 status it will index URL given in <TT CLASS="literal" >Location:</TT > field of HTTP-header instead.</P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="general-subsect" >Subsection control with SQL back-end</A ></H2 ><P >indexer has -t, -u, -s options to limit action to only a part of the database. -t corresponds 'Tag' limitation, -u is a URL substring limitation (SQL LIKE wildcards). -s limits URLs with given HTTP status. All limit options in the same group are ORed and in the different groups are ANDed. mnoGoSearch with built-in database does not support subsection control. </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="general-cleardb" >How to clear database (SQL only) <A NAME="AEN781" ></A ></A ></H2 ><P >To clear the whole database, use 'indexer -C'. You may also delete only the part of database by using -t,-u,-s subsection control options.</P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="general-dbstat" >Database Statistics with SQL back-end <A NAME="AEN786" ></A ></A ></H2 ><P >If you run <TT CLASS="literal" >indexer -S</TT >, it will show database statistics, including count of total and expired documents of each status. -t, -u, -s filters are usable in this mode too.</P ><P >The meaning of status is:</P ><P ></P ><UL ><LI ><P >0 - new (not indexed yet) URL</P ></LI ></UL ><P >If status is not 0, then it is HTTP response code, some of the HTTP codes are:</P ><P ></P ><UL ><LI ><P > <TT CLASS="literal" >200</TT > - "OK" (url is successfully indexed)</P ></LI ><LI ><P > <TT CLASS="literal" >301</TT > - "Moved Permanently" (redirect to another URL)</P ></LI ><LI ><P > <TT CLASS="literal" >302</TT > - "Moved Temporarily" (redirect to another URL)</P ></LI ><LI ><P > <TT CLASS="literal" >303</TT > - "See Other" (redirect to another URL)</P ></LI ><LI ><P > <TT CLASS="literal" >304</TT > - "Not modified" (url has not been modified since last indexing)</P ></LI ><LI ><P > <TT CLASS="literal" >401</TT > - "Authorization required" (use login/password for given URL)</P ></LI ><LI ><P > <TT CLASS="literal" >403</TT > - "Forbidden" (you have no access to this URL(s))</P ></LI ><LI ><P > <TT CLASS="literal" >404</TT > - "Not found" (there were references to URLs that do not exist)</P ></LI ><LI ><P > <TT CLASS="literal" >500</TT > - "Internal Server Error" (error in cgi, etc)</P ></LI ><LI ><P > <TT CLASS="literal" >503</TT > - "Service Unavailable" (host is down, connection timed out)</P ></LI ><LI ><P > <TT CLASS="literal" >504</TT > - "Gateway Timeout" (read timeout when retrieving document)</P ></LI ></UL ><P > <A NAME="AEN830" ></A > <TT CLASS="literal" >HTTP 401</TT > means that this URL is password protected. You can use <B CLASS="command" >AuthBasic</B > command in <TT CLASS="filename" >indexer.conf</TT > to set <TT CLASS="literal" >login:password</TT > for this URL(s).</P ><P > <TT CLASS="literal" >HTTP 404</TT > means that you have incorrect reference in one of your document (reference to resource that does not exist).</P ><P >Take a look on <A HREF="http://www.w3.org/Protocols/" TARGET="_top" >HTTP specific documentation</A > for further explanation of different HTTP status codes.</P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="general-linkval" >Link validation (SQL only) <A NAME="AEN843" ></A ></A ></H2 ><P >Being started with -I command line argument, indexer displays URL and it's referrer pairs. It is very useful to find bad links on your site. Don't forget to use <B CLASS="command" >DeleteBad no</B > <TT CLASS="filename" >indexer.conf</TT > command for this mode. You may use subsection control options -t,-u,-s in this mode. For example, <TT CLASS="literal" >indexer -I -s 404</TT > will display all 'Not found' URLs with referrers where links to those bad documents are found. Setting relevant <TT CLASS="filename" >indexer.conf</TT > commands and command line options you may use mnoGoSearch special for site validation purposes.</P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="general-parallel" >Parallel indexing (SQL only) <A NAME="AEN852" ></A ></A ></H2 ><P >MySQL and PostgreSQL users may run several indexers simultaneously with the same indexer.conf file. We have successfully tested 30 simultaneous indexers with MySQL database. Indexer uses MySQL and PostgreSQL locking mechanism to avoid double indexing of the same URL by different indexer's copies. Parallel indexing in the same database is not implemented for other back-ends yet. You may use multi-threaded version of indexer with any SQL back-end though which does support several simultaneous connections. Multi-threaded indexer version uses own locking mechanism.</P ><P >It is not recommended to use the same database with different <TT CLASS="filename" >indexer.conf</TT > files! First process could add something but second could delete it, and it may never stop.</P ><P >On the other hand, you may run several indexer processes with different databases with ANY supported SQL back-end.</P ></DIV ></DIV ></DIV ><DIV CLASS="NAVFOOTER" ><HR ALIGN="LEFT" WIDTH="100%"><TABLE SUMMARY="Footer navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" ><A HREF="msearch-register.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="index.html" ACCESSKEY="H" >Home</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" ><A HREF="msearch-http-codes.html" ACCESSKEY="N" >Next</A ></TD ></TR ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" >Installation registration</TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" > </TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" >Supported HTTP response codes</TD ></TR ></TABLE ></DIV ></BODY ></HTML >