<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <HTML ><HEAD ><TITLE >indexer configuration</TITLE ><META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK REL="HOME" TITLE="mnoGoSearch 3.3.9 reference manual" HREF="index.html"><LINK REL="UP" TITLE="Indexing" HREF="msearch-indexing.html"><LINK REL="PREVIOUS" TITLE="Content-Encoding support " HREF="msearch-content-enc.html"><LINK REL="NEXT" TITLE="Using syslog " HREF="msearch-syslog.html"><LINK REL="STYLESHEET" TYPE="text/css" HREF="mnogo.css"><META NAME="Description" CONTENT="mnoGoSearch - Full Featured Web site Open Source Search Engine Software over the Internet and Intranet Web Sites Based on SQL Database. It is a Free search software covered by GNU license."><META NAME="Keywords" CONTENT="shareware, freeware, download, internet, unix, utilities, search engine, text retrieval, knowledge retrieval, text search, information retrieval, database search, mining, intranet, webserver, index, spider, filesearch, meta, free, open source, full-text, udmsearch, website, find, opensource, search, searching, software, udmsearch, engine, indexing, system, web, ftp, http, cgi, php, SQL, MySQL, database, php3, FreeBSD, Linux, Unix, mnoGoSearch, MacOS X, Mac OS X, Windows, 2000, NT, 95, 98, GNU, GPL, url, grabbing"></HEAD ><BODY CLASS="sect1" BGCOLOR="#EEEEEE" TEXT="#000000" LINK="#000080" VLINK="#800080" ALINK="#FF0000" ><!--#include virtual="body-before.html"--><DIV CLASS="NAVHEADER" ><TABLE SUMMARY="Header navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TH COLSPAN="3" ALIGN="center" ><SPAN CLASS="application" >mnoGoSearch</SPAN > 3.3.9 reference manual: Full-featured search engine software</TH ></TR ><TR ><TD WIDTH="10%" ALIGN="left" VALIGN="bottom" ><A HREF="msearch-content-enc.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="80%" ALIGN="center" VALIGN="bottom" >Chapter 3. Indexing</TD ><TD WIDTH="10%" ALIGN="right" VALIGN="bottom" ><A HREF="msearch-syslog.html" ACCESSKEY="N" >Next</A ></TD ></TR ></TABLE ><HR ALIGN="LEFT" WIDTH="100%"></DIV ><DIV CLASS="sect1" ><H1 CLASS="sect1" ><A NAME="indexer-configuration" >indexer configuration</A ></H1 ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="follow" >Specifying the Web space for indexing</A ></H2 ><P > When <SPAN CLASS="application" >indexer</SPAN > found a new URL during crawling, <SPAN CLASS="application" >indexer</SPAN > checks whether the URL has a corresponding Web space definition command <B CLASS="command" ><A HREF="msearch-cmdref-server.html" >Server</A ></B >, <B CLASS="command" ><A HREF="msearch-cmdref-realm.html" >Realm</A ></B > or <B CLASS="command" ><A HREF="msearch-cmdref-subnet.html" >Subnet</A ></B > given in <TT CLASS="filename" >indexer.conf</TT >. URLs which do not have a corresponding Web space definition command are not added into the database. Also, the URLs which already present in the search database and appear not to have corresponding Web space definition commands are deleted from the database. This can happen after removing of some of the definition commands from <TT CLASS="filename" >indexer.conf</TT >. </P ><P >The Web definiton commands have the following format: <PRE CLASS="programlisting" > Server [Method] [SubSection] <pattern> [alias] Realm [Method] [CaseType] [MatchType] [CmpType] <pattern> [alias] Subnet [Method] [MatchType] <pattern> </PRE > </P ><P >The mandatory parameter <CODE CLASS="option" >pattern</CODE > specifies an URL, or its part, or a pattern. </P ><P >The optional parameter <CODE CLASS="option" >method</CODE > specifies the action for this command. It can take one of the following values: <TT CLASS="literal" >Allow</TT >, <TT CLASS="literal" >Disallow</TT >, <TT CLASS="literal" >HrefOnly</TT >, <TT CLASS="literal" >CheckOnly</TT >, <TT CLASS="literal" >Skip</TT >, <TT CLASS="literal" >CheckMP3</TT >, <TT CLASS="literal" >CheckMP3Only</TT >. By default, the value <TT CLASS="literal" >Allow</TT > is used. <P ></P ><OL TYPE="1" ><LI ><P > <B CLASS="command" >Allow</B > <A NAME="AEN1493" ></A > </P ><P > <TT CLASS="literal" >Allow</TT > specifies that all corresponding documents will be indexed and scanned for new links. Depending on <TT CLASS="literal" >Content-Type</TT >, an external parser can be executed if needed. </P ></LI ><LI ><P > <B CLASS="command" >Disallow</B > <A NAME="AEN1503" ></A > </P ><P > <TT CLASS="literal" >Disallow</TT > specifies that all corresponding documents will be ignored and deleted from the database. </P ></LI ><LI ><P > <B CLASS="command" >HrefOnly</B > <A NAME="AEN1512" ></A > </P ><P > <TT CLASS="literal" >HrefOnly</TT > specifies that all corresponding documents will only be scanned for new links, but their content won't be indexed. This is useful, for example, when indexing mail archives, when the index pages are only scanned for links to new messages. <PRE CLASS="programlisting" > Server HrefOnly Page http://www.mail-archive.com/general%40mnogosearch.org/ Server Allow Path http://www.mail-archive.com/general%40mnogosearch.org/ </PRE > </P ></LI ><LI ><P > <B CLASS="command" >CheckOnly</B > <A NAME="AEN1522" ></A > </P ><P > <TT CLASS="literal" >CheckOnly</TT > specifies that all corresponding documents will be requested using the <TT CLASS="literal" >HEAD</TT > HTTP method instead of the default <TT CLASS="literal" >GET</TT > method. When using <TT CLASS="literal" >CheckOnly</TT >, only brief information about the documents (such as size, last modification time, content type) will be fetched. This method can be helpful to detect broken links on your site. For example: <PRE CLASS="programlisting" > Server HrefOnly http://www.mnogosearch.org/ Realm CheckOnly * </PRE > </P ><P >These commands instruct <B CLASS="command" >indexer</B > to scan all documents on the site <TT CLASS="literal" >www.mnogosearch.org</TT > and collect all outgoing links. Brief info about every document outside <TT CLASS="literal" >www.mnogosearch.org</TT > will be requested using the <TT CLASS="literal" >HEAD</TT > method. After indexing is done, use <B CLASS="command" >indexer -S</B > command to see if there are any pages with status <TT CLASS="literal" >404 Not found</TT >. </P ></LI ><LI ><P > <B CLASS="command" >Skip</B > <A NAME="AEN1542" ></A > </P ><P > <TT CLASS="literal" >Skip</TT > specifies that all corresponding documents will be skipped while indexing. This is useful when you need to disable temporarily reindexing of some sites, but to keep them available for search with their previous content. <SPAN CLASS="application" >indexer</SPAN > will mark these documents as "fresh" and put in the end of its queue. </P ></LI ><LI ><P > <B CLASS="command" >CheckMP3</B > <A NAME="AEN1552" ></A > </P ><P > <TT CLASS="literal" >CheckMP3</TT > specifies that the corresponding documents will be checked for MP3 tags even if the <TT CLASS="literal" >Content-Type</TT > is not equal to <TT CLASS="literal" >audio/mpeg</TT >. This is useful if the remote server sends <TT CLASS="literal" >application/octet-stream</TT > as <TT CLASS="literal" >Content-Type</TT > for MP3 files. In case when MP3 tags are found in some document, they will be indexed, otherwise the document will be further processed according to the <TT CLASS="literal" >Content-Type</TT >. </P ></LI ><LI ><P > <B CLASS="command" >CheckMP3Only</B > <A NAME="AEN1566" ></A > </P ><P > This method is very similar to <TT CLASS="literal" >CheckMP3</TT >, but in case when MP3 tags are not found in a document, the document is not further processed. </P ></LI ></OL > </P ><P > The optional <CODE CLASS="option" >SubSection</CODE > parameter specifies the pattern match method, which can be one of the following values: <TT CLASS="literal" >page</TT >, <TT CLASS="literal" >path</TT >, <TT CLASS="literal" >site</TT >, <TT CLASS="literal" >world</TT >, with <TT CLASS="literal" >path</TT > being the default. <P ></P ><OL TYPE="1" ><LI ><P ><TT CLASS="literal" >Server path</TT ></P ><P > All URLs from the same directory match. For example, if: <TT CLASS="literal" >Server path http://localhost/path/to/index.html</TT > is given, all URLs starting with <TT CLASS="literal" >http://localhost/path/to/</TT > will match this command. </P ><P >The following commands have the same effect when searching for a matching Web space definition command: </P ><P > <PRE CLASS="programlisting" > Server path http://localhost/path/to/index.html Server path http://localhost/path/to/index Server path http://localhost/path/to/index.cgi?q=bla Server path http://localhost/path/to/index?q=bla </PRE > </P ></LI ><LI ><P ><TT CLASS="literal" >Server site</TT ></P ><P > All URLs from the same host match. For example, <TT CLASS="literal" >Server site http://localhost/path/to/a.html</TT > will allow to index the entire site <TT CLASS="literal" >http://localhost/</TT >. </P ></LI ><LI ><P > <TT CLASS="literal" >Server world</TT ></P ><P >If <TT CLASS="literal" >world</TT > subsection is specified, then absolutely any URL will correspond to this definiton command. See the explanation below. </P ></LI ><LI ><P > <TT CLASS="literal" >Server page</TT ></P ><P >Means exact match, only the given URL will match this command. </P ></LI ><LI ><P >subsection in <TT CLASS="literal" >news://</TT > schema</P ><P >Subsection is always considered as <TT CLASS="literal" >site</TT > for the <TT CLASS="literal" >news://</TT > URL schema. This is because unlike <TT CLASS="literal" >ftp://</TT > or <TT CLASS="literal" >http://</TT >, the <TT CLASS="literal" >news://</TT > schema has no recursive paths. Use <TT CLASS="literal" >Server news://news.server.com/</TT > to index the whole news server or, for example, <TT CLASS="literal" >Server news://news.server.com/udm</TT > to index all messages from the <TT CLASS="literal" >/udm</TT > hierarchy.</P ></LI ></OL > </P ><P > The optional parameter <TT CLASS="literal" >CaseType</TT > specifies case sensitivity for string comparison, it can take one of the following values: <TT CLASS="literal" >case</TT > - case insensitive comparison, or <TT CLASS="literal" >nocase</TT > - case sensitive comparison. </P ><P >The optional parameter <TT CLASS="literal" >CmpType</TT > specifies comparison type and can take two values: <TT CLASS="literal" >Regex</TT > and <TT CLASS="literal" >String</TT >. <TT CLASS="literal" >String</TT > wildcards are the default match type. You can use ? and * signs in the <TT CLASS="literal" >patter</TT >, they mean "one character" and "any number of characters" respectively. For example, if you want to index all HTTP sites in the <TT CLASS="literal" >.ru</TT > domain, you can use this command: <PRE CLASS="programlisting" > Realm http://*.ru/* </PRE > </P ><P >The <TT CLASS="literal" >regex</TT > comparison type says that the pattern is a regular expression. For example, you can describe everything in the <TT CLASS="literal" >.ru</TT > domain using the <TT CLASS="literal" >regex</TT > comparison type: <PRE CLASS="programlisting" > Realm Regex ^http://.*\.ru/ </PRE > </P ><P >The optional parameter <TT CLASS="literal" >MatchType</TT > can be <TT CLASS="literal" >Match</TT > or <TT CLASS="literal" >NoMatch</TT >, with <TT CLASS="literal" >Match</TT > as default. <TT CLASS="literal" >Realm NoMatch</TT > has reverse effect. It means that URLs not matching the given <CODE CLASS="option" >pattern</CODE > will correspond to this <B CLASS="command" >Realm</B > command. For example, use this command to index everything but the <TT CLASS="literal" >.com</TT > domain: <PRE CLASS="programlisting" > Realm NoMatch http://*.com/* </PRE > </P ><P >The optional <CODE CLASS="option" >alias</CODE > argument provides URL rewrite rules, described in details in <A HREF="msearch-indexer-configuration.html#aliases" >the Section called <I >Aliases</I ></A >. </P ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="follow-difparam" >Using different parameters for a server and its subsections</A ></H3 ><P ><SPAN CLASS="application" >indexer</SPAN > examines the Web space definition command in order of their appearance in <TT CLASS="filename" >indexer.conf</TT >. Thus, if you want to give different parameters to a site and its subsections, you can add the command describing a subsection before the command describing the entire site. Imagine that you have a subdirectory which contains news articles and want those articles to be reindexed more often than the rest of the site. The following combination can be useful in this cases:</P ><P > <PRE CLASS="programlisting" > # Add subsection Period 200000 Server http://servername/news/ # Add server Period 600000 Server http://servername/ </PRE > </P ><P >These commands give different reindexing periods for the <TT CLASS="filename" >/news/</TT > subdirectory and the rest of the site. <SPAN CLASS="application" >indexer</SPAN > will choose the first command for the URL <TT CLASS="filename" >http://servername/news/page1.html</TT >. </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="follow-default" >The default indexer behavior</A ></H3 ><P >The default behavior of indexer is to follow through the links found having correspondent Web space definition commands given in the <TT CLASS="filename" >indexer.conf</TT > file. <SPAN CLASS="application" >indexer</SPAN > jumps between sites if both of them have a corresponding Web definition command. For example, there are two commands:</P ><P > <PRE CLASS="programlisting" > Server http://www/ Server http://web/ </PRE > </P ><P >When indexing <TT CLASS="filename" >http://www/page1.html</TT > indexer WILL follow the link <TT CLASS="filename" >http://web/page2.html</TT >. Note that although these pages are on different sites, BOTH of them have a correspondent Web space definition command. </P ><P >If we delete one of the commands, <SPAN CLASS="application" >indexer</SPAN > will remove all expired URLs from this server during the next crawling sessions. </P ></DIV ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="aliases" >Aliases</A ></H2 ><P ><SPAN CLASS="application" >mnoGoSearch</SPAN > offers a flexible technique of aliases and reverse aliases, making it possible to index sites by downloading documents from another location. For example, if you index your local web server, it is possible to load pages directly from the hard disk without involving your web server in the crawling process. Another example is building of a search engine for the primary site using its mirror to download the documents. </P ><P >Different ways of using aliases are described in the next sections. </P ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="alias-conf" >The <B CLASS="command" ><A HREF="msearch-cmdref-alias.html" >Alias</A ></B > <TT CLASS="filename" >indexer.conf</TT > command <A NAME="AEN1679" ></A ></A ></H3 ><P >The <B CLASS="command" ><A HREF="msearch-cmdref-alias.html" >Alias</A ></B > <TT CLASS="filename" >indexer.conf</TT > command uses this format: <PRE CLASS="programlisting" > Alias <masterURL> <mirrorURL> </PRE > </P ><P >For example, if you wish to index <TT CLASS="literal" >http://www.mnogosearch.ru/</TT > using the nearest German mirror <TT CLASS="literal" >http://www.gstammw.de/mirrors/mnoGoSearch/</TT >, you can add these lines into your <TT CLASS="filename" >indexer.conf</TT >: <PRE CLASS="programlisting" > Server http://www.mnogosearch.ru/ Alias http://www.mnogosearch.ru/ http://www.gstammw.de/mirrors/mnoGoSearch/ </PRE > </P ><P > When crawling, <TT CLASS="filename" >indexer</TT > will download the documents from the mirror site <TT CLASS="literal" >http://www.gstammw.de/mirrors/mnoGoSearch/</TT >. At search time <SPAN CLASS="application" >search.cgi</SPAN > will display URLs from the master site <TT CLASS="literal" >http://www.mnogosearch.ru/</TT >. </P ><P >Another example: You want to index all sites from the domain <TT CLASS="literal" >udm.net</TT >. Suppose one of the servers (e.g. <TT CLASS="literal" >http://home.udm.net/</TT >) is stored on the local machine in the directory <TT CLASS="filename" >/home/httpd/htdocs/</TT >. These commands will be useful: <PRE CLASS="programlisting" > Realm http://*.udm.net/ Alias http://home.udm.net/ file:///home/httpd/htdocs/ </PRE > </P ><P ><TT CLASS="filename" >Indexer</TT > will load documents form the site <TT CLASS="literal" >home.udm.net</TT > using the local disk, and will use HTTP for the other sites. </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="aliases-diff" >Using different aliases for server parts</A ></H3 ><P >Aliases are searched in the order of their appearance in <TT CLASS="filename" >indexer.conf</TT >. So, you can create different aliases for a server and its parts: <PRE CLASS="programlisting" > # First, create alias for example for /stat/ directory which # is not under common location: Alias http://home.udm.net/stat/ file:///usr/local/stat/htdocs/ # Then create alias for the rest of the server: Alias http://home.udm.net/ file:///usr/local/apache/htdocs/ </PRE > </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B >If you change the order of these commands, the alias for the directory <TT CLASS="filename" >/stat/</TT > will never be found.</P ></BLOCKQUOTE ></DIV ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="alias-server" >Using aliases in the <B CLASS="command" ><A HREF="msearch-cmdref-server.html" >Server</A ></B > command</A ></H3 ><P >You can specify the location used by indexer as an optional argument in a <B CLASS="command" ><A HREF="msearch-cmdref-server.html" >Server</A ></B > command: <PRE CLASS="programlisting" > Server http://home.udm.net/ file:///home/httpd/htdocs/ </PRE > </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="alias-realm" >Using aliases in the <B CLASS="command" ><A HREF="msearch-cmdref-realm.html" >Realm</A ></B > command</A ></H3 ><P >Aliases in the <B CLASS="command" ><A HREF="msearch-cmdref-realm.html" >Realm</A ></B > command are based on regular expressions. The implementation of this feature reminds PHP's <CODE CLASS="function" >preg_replace()</CODE > function. Aliases in the <B CLASS="command" ><A HREF="msearch-cmdref-realm.html" >Realm</A ></B > command work only if the <TT CLASS="literal" >regex</TT > match type is used, and do not work in case of the <TT CLASS="literal" >string</TT > match type. </P ><P >Use this syntax for Realm aliases: <PRE CLASS="programlisting" > Realm regex <URL_pattern> <alias_pattern> </PRE > </P ><P >When <TT CLASS="filename" >indexer</TT > finds a URL matching to <CODE CLASS="varname" >URL_pattern</CODE >, it builds an alias using <CODE CLASS="varname" >alias_pattern</CODE >. <CODE CLASS="varname" >alias_pattern</CODE > can contain references of the form $n, where n is a number in the range of 0-9. Every reference is replaced to text captured by the <CODE CLASS="varname" >n</CODE >-th parenthesized sub-pattern. <TT CLASS="literal" >$0</TT > refers to text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing sub-pattern. </P ><P >Example: your company hosts a few hundred users with their own domains in the form of <TT CLASS="literal" >www.username.yourname.com</TT >. All user sites are stored on the disk in the subdirectory <TT CLASS="filename" >/htdocs</TT > under their home directories: <TT CLASS="literal" >/home/username/htdocs/</TT >. </P ><P >You can write this command into <TT CLASS="filename" >indexer.conf</TT > (note that the dot '.' character has a special meaning in regular expressions and should be escaped with a '\' sign when dot is used in its literal meaning): <PRE CLASS="programlisting" > Realm regex (http://www\.)(.*)(\.yourname\.com/)(.*) file:///home/$2/htdocs/$4 </PRE > </P ><P >Imagine that <TT CLASS="filename" >indexer</TT > processes a document located at <TT CLASS="filename" >http://www.john.yourname.com/news/index.html</TT >. These patterns will be captured: </P ><P CLASS="literallayout" ><br> $0 = <TT CLASS="literal" >http://www.john.yourname.com/news/index.htm</TT > (the whole pattern match)<br> $1 = <TT CLASS="literal" >http://www.</TT > - subpattern matching <TT CLASS="literal" >(http://www\.)</TT ><br> $2 = <TT CLASS="literal" >john</TT > - subpattern matching <TT CLASS="literal" >(.*)</TT ><br> $3 = <TT CLASS="literal" >.yourname.com/</TT > - subpattern matching <TT CLASS="literal" >(\.yourname\.com/)</TT ><br> $4 = <TT CLASS="literal" >/news/index.html</TT > - subpattern matching <TT CLASS="literal" >(.*)</TT ><br> </P ><P >After the matches are found, the subpatterns <TT CLASS="literal" >$2</TT > and <TT CLASS="literal" >$4</TT > are substituted to <CODE CLASS="varname" >alias_pattern</CODE >, which will result into this alias: <PRE CLASS="programlisting" > file:///home/john/htdocs/news/index.html </PRE > </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="alias-prog" >The <B CLASS="command" ><A HREF="msearch-cmdref-aliasprog.html" >AliasProg</A ></B > command <A NAME="AEN1771" ></A ></A ></H3 ><P ><B CLASS="command" ><A HREF="msearch-cmdref-aliasprog.html" >AliasProg</A ></B > can be useful for a web hosting company indexing its customer web sites by loading documents directly from the disk without having to involve the HTTP server into crawling process (to offload the server). Document layout can be very complex to describe it using the <B CLASS="command" ><A HREF="msearch-cmdref-server.html" >Server</A ></B > or <B CLASS="command" ><A HREF="msearch-cmdref-realm.html" >Realm</A ></B > commands. <B CLASS="command" ><A HREF="msearch-cmdref-aliasprog.html" >AliasProg</A ></B > defines an external program that can be executed with an URL in the command line argument and return the corresponding alias to <TT CLASS="filename" >STDOUT</TT >. Use <TT CLASS="literal" >$1</TT > to pass URLs to the command line. </P ><P > The command in this example uses the <SPAN CLASS="application" >replace</SPAN > program from <SPAN CLASS="application" >MySQL</SPAN > distribution and replaces URL substring <TT CLASS="literal" >http://www.apache.org/</TT > to <TT CLASS="literal" >file:///usr/local/apache/htdocs/</TT >: <PRE CLASS="programlisting" > AliasProg "echo $1 | /usr/local/mysql/bin/mysql/replace http://www.apache.org/ file:///usr/local/apache/htdocs/" </PRE > </P ><P >You can write your own complex program for converting URLs int their aliases using any preferred programming language.</P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="alias-reverse" >The <B CLASS="command" ><A HREF="msearch-cmdref-reversealias.html" >ReverseAlias</A ></B > command <A NAME="AEN1796" ></A ></A ></H3 ><P >The <B CLASS="command" ><A HREF="msearch-cmdref-reversealias.html" >ReverseAlias</A ></B > <TT CLASS="filename" >indexer.conf</TT > command allows mapping of URLs before a URL is inserted into the database. Unlike the <B CLASS="command" ><A HREF="msearch-cmdref-alias.html" >Alias</A ></B > command (which performs mapping right before a document is downloaded), the <B CLASS="command" ><A HREF="msearch-cmdref-reversealias.html" >ReverseAlias</A ></B > command performs mapping immediately after a new link is found. <PRE CLASS="programlisting" > ReverseAlias http://name2/ http://name2.yourname.com/ Server http://name2.yourname.com/ </PRE > </P ><P >In the above example, all links with the short server name will be converted to links with the full server and will be put into the database after converting. </P ><P >Another possible use of the <B CLASS="command" ><A HREF="msearch-cmdref-reversealias.html" >ReverseAlias</A ></B > is stripping off various undesired query string parameters like <TT CLASS="literal" >PHPSESSID=XXXX</TT >. </P ><P >The following example will strip off the <TT CLASS="literal" >PHPSESSID=XXXX</TT > part from the URLs like <TT CLASS="literal" >http://www/a.php?PHPSESSID=XXX</TT >, when there are no any other query string parameters other than <TT CLASS="literal" >PHPSESSID</TT >. The question mark is deleted as well: <PRE CLASS="programlisting" > ReverseAlias regex (http://[^?]*)[?]PHPSESSID=[^&]*$ $1$2 </PRE > </P ><P >Stripping the <TT CLASS="literal" >PHPSESSID=XXXX</TT > from the URL like <TT CLASS="literal" >w/a.php?PHPSESSID=xxx&..</TT >, that is when <TT CLASS="literal" >PHPSESSID=XXXX</TT > is the very first query string parameter followed by a number of other parameters. The ampersand sign <TT CLASS="literal" >&</TT > after the <TT CLASS="literal" >PHPSESSID=XXXX</TT > part is deleted as well. The question mark <TT CLASS="literal" >?</TT > is not deleted: <PRE CLASS="programlisting" > ReverseAlias regex (http://[^?]*[?])PHPSESSID=[^&]*&(.*) $1$2 </PRE > </P ><P >Stripping the <TT CLASS="literal" >PHPSESSID=XXXX</TT > part from the URLs like <TT CLASS="literal" >http://www/a.php?a=b&PHPSESSID=xxx</TT > or <TT CLASS="literal" >http://www/a.php?a=b&PHPSESSID=xxx&c=d</TT >, where <TT CLASS="literal" >PHPSESSID=XXXX</TT > is not the first parameter. The ampersand sign <TT CLASS="literal" >&</TT > before <TT CLASS="literal" >PHPSESSID=XXXX</TT > is deleted: <PRE CLASS="programlisting" > ReverseAlias regex (http://.*)&PHPSESSION=[^&]*(.*) $1$2 </PRE > </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="alias-search" >Search-time aliases in <TT CLASS="filename" >search.htm</TT > <A NAME="AEN1837" ></A ></A ></H3 ><P >It is also possible to define aliases in the search template (<TT CLASS="filename" >search.htm</TT >). The <B CLASS="command" ><A HREF="msearch-cmdref-alias.html" >Alias</A ></B > command in <TT CLASS="filename" >search.htm</TT > is identical to the one in <TT CLASS="filename" >indexer.conf</TT >, but is applied at search time rather than during crawling.</P ><P >The syntax of the <B CLASS="command" ><A HREF="msearch-cmdref-alias.html" >Alias</A ></B > command in <TT CLASS="filename" >search.htm</TT > is similar to <TT CLASS="filename" >indexer.conf</TT >: <PRE CLASS="programlisting" > Alias <find-prefix> <replace-prefix> </PRE > </P ><P >Suppose your <TT CLASS="filename" >search.htm</TT > has the following command: <PRE CLASS="programlisting" > Alias http://localhost/ http://www.mnogo.ru/ </PRE > </P ><P >When <SPAN CLASS="application" >search.cgi</SPAN > returns a page with the URL <TT CLASS="literal" >http://localhost/news/article10.html</TT >, it will be replaced to <TT CLASS="literal" >http://www.mnogo.ru/news/article10.html</TT >. </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B >When you need aliases, you can put aliases either into <TT CLASS="filename" >indexer.conf</TT > (to convert the remote notation to the local notation during crawling time) or into <TT CLASS="filename" >search.htm</TT > (to convert the local notation to the remote notation during search time). Use the approach which looks more convenient for you. </P ></BLOCKQUOTE ></DIV ></DIV ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="srvtable" >ServerTable <A NAME="AEN1865" ></A ></A ></H2 ><P >The quick way to specify URLs to be indexed by <SPAN CLASS="application" >mnoGoSearch</SPAN > is just to specify them using the <B CLASS="command" ><A HREF="msearch-cmdref-server.html" >Server</A ></B > or <B CLASS="command" ><A HREF="msearch-cmdref-realm.html" >Realm</A ></B > directives in the <TT CLASS="filename" >indexer.conf</TT > file. However, in some cases users might already have URLs saved in a <ACRONYM CLASS="acronym" >SQL</ACRONYM > database, it would be much simpler to have <SPAN CLASS="application" >mnoGoSearch</SPAN > use this information. This can be done using the <B CLASS="command" ><A HREF="msearch-cmdref-servertable.html" >ServerTable</A ></B > command, which is available in <SPAN CLASS="application" >mnoGoSearch</SPAN > starting from the version 3.3.7. </P ><P >When <B CLASS="command" >ServerTable mysql://user:pass@host/dbname/my_server?srvinfo=my_srvinfo</B > is specified, <SPAN CLASS="application" >indexer</SPAN > loads server information from the given <ACRONYM CLASS="acronym" >SQL</ACRONYM > table <TT CLASS="literal" >my_server</TT > and loads the server parameters from the table <TT CLASS="literal" >my_srvinfo</TT >. </P ><P > The following sections provide step-by-step instructions how to create, populate and load Server tables. </P ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="srvtable-create" >Step 1: creating Server table</A ></H3 ><P > The tables <TT CLASS="literal" >server</TT > and <TT CLASS="literal" >srvinfo</TT > that are already present in <SPAN CLASS="application" >mnoGoSearch</SPAN > are used internally. One should not try to use these tables to insert your own URLs. Instead, you must create your own tables with similar structures. For example, with <SPAN CLASS="application" >MySQL</SPAN > you can do: <PRE CLASS="programlisting" > CREATE TABLE my_server LIKE server; CREATE TABLE my_srvinfo LIKE srvinfo; </PRE > </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > You may find useful to do some modifications in the column types, for example, add <TT CLASS="literal" >AUTOINCREMENT</TT > flag to <TT CLASS="literal" >rec_id</TT >. However, don't change the column names - <SPAN CLASS="application" >mnoGoSearch</SPAN > looks up the columns by their names. </P ></BLOCKQUOTE ></DIV ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="srvtable-insert" >Step 2: populating Server table</A ></H3 ><P >Now that you have your custom tables, you can load data: <PRE CLASS="programlisting" > INSERT INTO my_server (rec_id, enabled, command, url) VALUES (1, 1, 'S', 'http://server1/'); INSERT INTO my_srvinfo (srv_id, sname, sval) VALUES ('Period', '30d'); INSERT INTO my_server (rec_id, enabled, command, url) VALUES (2, 1, 'S', 'http://server2/'); INSERT INTO my_srvinfo (srv_id, sname, sval) VALUES ('MaxHops', '3'); </PRE > </P ><P > The columns <TT CLASS="literal" >rec_id</TT >, <TT CLASS="literal" >enabled</TT > and <TT CLASS="literal" >url</TT > must be specified in the <TT CLASS="literal" >INSERT INTO my_server</TT > statements. </P ><P > The columns <TT CLASS="literal" >parent</TT > and <TT CLASS="literal" >pop_weight</TT > <SPAN CLASS="emphasis" ><I CLASS="emphasis" >should NOT</I ></SPAN > be specified, as these columns used by <SPAN CLASS="application" >mnoGoSearch</SPAN > internally. </P ><P > The columns <TT CLASS="literal" >tag</TT >, <TT CLASS="literal" >category</TT >, <TT CLASS="literal" >ordre</TT >, <TT CLASS="literal" >weight</TT > can be specified optionally. </P ><P > <TT CLASS="literal" >my_srvinfo</TT > is a child table of <TT CLASS="literal" >my_server</TT >. These tables are joint using the condition <TT CLASS="literal" >my_server.rec_id = my_srvinfo.srv_id</TT >. </P ><P > <TT CLASS="literal" >sname</TT > in the table <TT CLASS="literal" >my_srvinfo</TT > is the name of a directive that might be specified for the particular URL in <TT CLASS="filename" >indexer.conf</TT >. For example, you might want to specify <TT CLASS="literal" >Period</TT > of "<TT CLASS="literal" >30d</TT >" for the respective URL, so you insert a record with <TT CLASS="literal" >sname="Period"</TT > and <TT CLASS="literal" >sval="30d"</TT >, or set <TT CLASS="literal" >MaxHops</TT > to "<TT CLASS="literal" >3</TT >", so you insert a record with <TT CLASS="literal" >sname="MaxHops"</TT > and <TT CLASS="literal" >sval="3"</TT >. </P ><P > The meaning of various columns is explained in <A HREF="msearch-dbschema.html" >the Section called <I >Database schema</I > in Chapter 12</A >. </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > Look at the table <TT CLASS="literal" >srvinfo</TT > data to get examples about how it is used. </P ></BLOCKQUOTE ></DIV ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="srvtable-load" >Step 3: loading Server table</A ></H3 ><P > Now that you have data in your custom Server tables, you need to specify the new tables in <TT CLASS="filename" >indexer.conf</TT >. Just add the following line: <PRE CLASS="programlisting" > ServerTable mysql://user:pass@host/dbname/my_server?srvinfo=my_srvinfo </PRE > </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > If the <TT CLASS="literal" >srvinfo</TT > parameter is omitted, parameters are loaded from the table with name <TT CLASS="literal" >srvinfo</TT > by default. </P ></BLOCKQUOTE ></DIV ><P > A quick way to test if your Server table works fine is to insert one or two URLs into the <TT CLASS="literal" >my_server</TT > table that do not already exist in your <TT CLASS="filename" >indexer.conf</TT >, then run <SPAN CLASS="application" >indexer</SPAN > and specify that only the given URLs are to be indexed, e.g.: <PRE CLASS="programlisting" > ./indexer -a -u http://server1/ ./indexer -a -u http://server2/ </PRE > If it is working properly, you should see the test URLs being indexed. </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="srvtable-notes" >Important notes on using Server table</A ></H3 ><P >1) You can create as many custom server/srvinfo tables as you like, and then specify each pair in the <TT CLASS="filename" >indexer.conf</TT > file using a different <B CLASS="command" ><A HREF="msearch-cmdref-servertable.html" >ServerTable</A ></B > directive with the appropriate values. </P ><P >2) Using your own Server table does not stop other URLs that are specified in your <TT CLASS="filename" >indexer.conf</TT > from being indexed. <SPAN CLASS="application" >indexer</SPAN > will do both. So you can define some non-changing URLs in the <TT CLASS="filename" >indexer.conf</TT > file, and put the URLs that tend to come and go into your custom Server table. You can also write some scripts that copy URLs from your own database into your custom Server table used by <SPAN CLASS="application" >mnoGoSearch</SPAN >. </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="srvtable-structure" >Server table structure</A ></H3 ><P > See <A HREF="msearch-dbschema.html" >the Section called <I >Database schema</I > in Chapter 12</A > for the meaning of various columns in the Server tables. </P ></DIV ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="flushsrvtable" >FlushServerTable <A NAME="AEN1971" ></A ></A ></H2 ><P > Flush server sets active field to inactive for all ServerTable records. Use this command to deactivate all commands in ServerTable before loading new commands from <TT CLASS="filename" >indexer.conf</TT > or from other ServerTable. </P ></DIV ></DIV ><DIV CLASS="NAVFOOTER" ><HR ALIGN="LEFT" WIDTH="100%"><TABLE SUMMARY="Footer navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" ><A HREF="msearch-content-enc.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="index.html" ACCESSKEY="H" >Home</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" ><A HREF="msearch-syslog.html" ACCESSKEY="N" >Next</A ></TD ></TR ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" >Content-Encoding support <A NAME="AEN1434" ></A ></TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="msearch-indexing.html" ACCESSKEY="U" >Up</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" >Using syslog <A NAME="AEN1978" ></A ></TD ></TR ></TABLE ></DIV ><!--#include virtual="body-after.html"--></BODY ></HTML >