<HTML ><HEAD ><TITLE >Extended indexing features</TITLE ><META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.73 "><LINK REL="HOME" TITLE="mnoGoSearch 3.2 reference manual" HREF="index.html"><LINK REL="UP" TITLE="Indexing" HREF="msearch-indexing.html"><LINK REL="PREVIOUS" TITLE="indexer configuration" HREF="msearch-indexer-configuration.html"><LINK REL="NEXT" TITLE="Using syslog " HREF="msearch-syslog.html"><LINK REL="STYLESHEET" TYPE="text/css" HREF="mnogo.css"><META NAME="Description" CONTENT="mnoGoSearch - Full Featured Web site Open Source Search Engine Software over the Internet and Intranet Web Sites Based on SQL Database. It is a Free search software covered by GNU license."><META NAME="Keywords" CONTENT="shareware, freeware, download, internet, unix, utilities, search engine, text retrieval, knowledge retrieval, text search, information retrieval, database search, mining, intranet, webserver, index, spider, filesearch, meta, free, open source, full-text, udmsearch, website, find, opensource, search, searching, software, udmsearch, engine, indexing, system, web, ftp, http, cgi, php, SQL, MySQL, database, php3, FreeBSD, Linux, Unix, mnoGoSearch, MacOS X, Mac OS X, Windows, 2000, NT, 95, 98, GNU, GPL, url, grabbing"></HEAD ><BODY CLASS="sect1" BGCOLOR="#EEEEEE" TEXT="#000000" LINK="#000080" VLINK="#800080" ALINK="#FF0000" ><DIV CLASS="NAVHEADER" ><TABLE SUMMARY="Header navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TH COLSPAN="3" ALIGN="center" >mnoGoSearch 3.2 reference manual: Full-featured search engine software</TH ></TR ><TR ><TD WIDTH="10%" ALIGN="left" VALIGN="bottom" ><A HREF="msearch-indexer-configuration.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="80%" ALIGN="center" VALIGN="bottom" >Chapter 3. Indexing</TD ><TD WIDTH="10%" ALIGN="right" VALIGN="bottom" ><A HREF="msearch-syslog.html" ACCESSKEY="N" >Next</A ></TD ></TR ></TABLE ><HR ALIGN="LEFT" WIDTH="100%"></DIV ><DIV CLASS="sect1" ><H1 CLASS="sect1" ><A NAME="extended-indexing" >Extended indexing features</A ></H1 ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="news" >News extensions <A NAME="AEN1425" ></A ></A ></H2 ><P >By Heiko Stoermer <TT CLASS="email" ><<A HREF="mailto:heiko.stoermer@innominate.de" >heiko.stoermer@innominate.de</A >></TT > </P ><P >mnoGoSearch comes with an integrated extension to archive news servers. (currently MySQL only! see <A HREF="msearch-extended-indexing.html#news-restr" >the Section called <I >Restrictions</I ></A >) This means that you can now download all messages from a news server an save them completely in a database. </P ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="news-benefits" >Benefits</A ></H3 ><P ></P ><UL ><LI ><P >you can expire the messages on the news server to keep it slim and fast </P ></LI ><LI ><P >you can search the complete message base with all the features that regular mnoGoSearch offers </P ></LI ><LI ><P >you can still browse discussion threads over the complete archive </P ></LI ></UL ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="news-restr" >Restrictions</A ></H3 ><P ></P ><UL ><LI ><P >currently mysql only (I would have really liked to do this for postgresql, but some really annoying restrictions concerning query size and field size in postgresql finally made me switch to mysql.) </P ></LI ><LI ><P >perl front-end only </P ></LI ><LI ><P >single dict only (because mysql-perl front-end does not support multi-dict) </P ></LI ></UL ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="news-todo" >To be implemented</A ></H3 ><P >No new features are planned for this thing. It works the way it is (at least as far as I can see) and does everything I wanted it to do. What I will do is to make the code a bit more portable to other databases and fix the few very tiny bugs in the front-end. Of course newly discovered bugs will be fixed. I'm maintaining it as good as I can. </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="news-perf" >Performance</A ></H3 ><P >Of course, important questions always are: how fast.../how big.../how long.... </P ><P ></P ><UL ><LI ><P >Our local intranet installation of mnoGoSearch says the following: <PRE CLASS="programlisting" > mnoGoSearch statistics Status Expired Total ----------------------------- 200 76132 76132 OK 404 119 119 Not found 503 17 17 Service Unavailable 504 802 802 Gateway Timeout ----------------------------- Total 77070 77070 </PRE > </P ><P >which means that roughly 77.000 messages are archived in the database</P ></LI ><LI ><P >Current database size is: 423 Megabytes </P ></LI ><LI ><P >The dict table has 6.076.462 entries </P ></LI ><LI ><P >It's run on an AMD K6 400 with 64 MBs of RAM (very tiny thing) </P ></LI ><LI ><P > <SPAN CLASS="emphasis" ><I CLASS="emphasis" >typical queries take between 2 and 10 seconds. </I ></SPAN > </P ></LI ></UL ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="news-install" >Installation</A ></H3 ><P ></P ><OL TYPE="1" ><LI ><P >Compile:</P ><P >Unpack the mnoGoSearch distribution archive. Start the configure script with the option <TT CLASS="literal" >--with-mysql</TT >. <TT CLASS="literal" >make</TT > and <TT CLASS="literal" >make install</TT > as described in the regular install instructions </P ></LI ><LI ><P >Create Database:</P ><P >The news extension uses a slightly different database layout. The create files can be found in <TT CLASS="filename" >frontends/mysql-perl-news/create/</TT > (Of course you have to do <TT CLASS="literal" >mysqladmin create mnoGoSearch</TT > first and set permissions to the account the web front-end and indexer are run as) </P ></LI ><LI ><P >Install <TT CLASS="filename" >indexer.conf</TT >:</P ><P >an <TT CLASS="filename" >indexer.conf</TT > for incremental news archiving (messages hardly ever change...) can be found in <TT CLASS="filename" >frontends/mysql-perl-news/etc/</TT > together with a sample cron shell script that can be run once a day or so. Please see <TT CLASS="filename" >indexer.conf</TT > for detailed description of the indexing process. </P ></LI ><LI ><P >Install perl front-end:</P ><P >copy <TT CLASS="filename" >frontends/mysql-perl-news/*.pl</TT > and <TT CLASS="filename" >frontends/mysql-perl-news/*.htm*</TT > to your cgi-bin directory. </P ><P >copy <TT CLASS="filename" >frontends/mysql-perl-news/*.pm</TT > to your site's perl library dir (<TT CLASS="filename" >site_perl</TT > or so) where the modules can be found by the perl scripts. </P ><P >edit <TT CLASS="filename" >search.htm</TT > and change the included database login information. The Perl front-end has additional features that allow you to browse message threads. You will see. </P ></LI ><LI ><P >Now you are set and can run indexer for the first time according to the instructions you can find in <TT CLASS="filename" >indexer.conf</TT >.</P ></LI ></OL ><P >I hope this is a nice feature for you. If anyone is interested in porting this to other databases/multidict mode/the PHP front-end, PLEASE DO SO! I would be pleased and will assist you. </P ></DIV ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="htdb" >Indexing SQL database tables (htdb: virtual URL scheme) <A NAME="AEN1506" ></A ></A ></H2 ><P >mnoGoSearch can index SQL database text fields - the so called htdb: virtual URL scheme.</P ><P >Using htdb:/ virtual scheme you can build full text index of your SQL tables as well as index your database driven WWW server.</P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B >currently mnoGoSearch can index only those tables that are in the same database with mnoGoSearch tables. MySQL users may specify database in the query though. Also you must have PRIMARY key on the table you want to index.</P ></BLOCKQUOTE ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="htdb-indexer" >HTDB indexer.conf commands</A ></H3 ><P >Two <TT CLASS="filename" >indexer.conf</TT > commands provide HTDB. They are HTDBList and HTDBDoc. </P ><P > <B CLASS="command" >HTDBList <A NAME="AEN1519" ></A > </B > is SQL query to generate list of all URLs which correspond to records in the table using PRIMARY key field. You may use either absolute or relative URLs in HTDBList command:</P ><P >For example: <PRE CLASS="programlisting" > HTDBList SELECT concat('htdb:/',id) FROM messages or HTDBList SELECT id FROM messages </PRE > </P ><P > <B CLASS="command" >HTDBDoc <A NAME="AEN1526" ></A > </B > is a query to get only certain record from database using PRIMARY key value.</P ><P >HTDBList SQL query is used for all URLs which end with '/' sign. For other URLs SQL query given in HTDBDoc is used.</P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B >HTDBDoc query must return FULL HTTP response with headers. So, you can build very flexible indexing system giving different HTTP status in query. Take a look at HTTP response codes section of documentation to understand indexer behavior when it gets different HTTP status.</P ></BLOCKQUOTE ></DIV ><P >If there is no result of HTDBDoc or query does return several records, HTDB retrieval system generates "HTTP 404 Not Found". This may happen at reindex time if record was deleted from your table since last reindexing. You may use "DeleteBad yes" to delete such records from mnoGoSearch tables as well.</P ><P >You may use several HTDBDoc/List commands in one <TT CLASS="filename" >indexer.conf</TT > with corresponding Server commands.</P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="htdb-var" >HTDB variables <A NAME="AEN1537" ></A ></A ></H3 ><P >You may use PATH parts of URL as parameters of both HTDBList and HTDBDoc SQL queries. All parts are to be used as $1, $2, ... $n, where number is the number of PATH part: <PRE CLASS="programlisting" > htdb:/part1/part2/part3/part4/part5 $1 $2 $3 $4 $5 </PRE > </P ><P >For example, you have this <TT CLASS="filename" >indexer.conf</TT > command: <PRE CLASS="programlisting" > HTDBList SELECT id FROM catalog WHERE category='$1' </PRE > </P ><P >When htdb:/cars/ URL is indexed, $1 will be replaced with 'cars': <PRE CLASS="programlisting" > SELECT id FROM catalog WHERE category='cars' </PRE > </P ><P >You may use long URLs to provide several parameters to both HTDBList and HTDBDoc queries. For example, <TT CLASS="literal" >htdb:/path1/path2/path3/path4/id</TT > with query: <PRE CLASS="programlisting" > HTDBList SELECT id FROM table WHERE field1='$1' AND field2='$2' and field3='$3' </PRE > </P ><P >This query will generate the following URLs: <PRE CLASS="programlisting" > htdb:/path1/path2/path3/path4/id1 ... htdb:/path1/path2/path3/path4/idN </PRE > </P ><P >for all values of the field "id" which are in HTDBList output.</P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="htdb-fulltext" >Creating full text index</A ></H3 ><P >Using htdb:/ scheme you can create full text index and use it further in your application. Lets imagine you have a big SQL table which stores for example web board messages in plain text format. You also want to build an application with messages search facility. Lets say messages are stored in "messages" table with two fields "id" and "msg". "id" is an integer primary key and "msg" big text field contains messages themselves. Using usual SQL LIKE search may take long time to answer: <PRE CLASS="programlisting" > SELECT id, message FROM message WHERE message LIKE '%someword%' </PRE > </P ><P >Using mnoGoSearch htdb: scheme you have a possibility to create full text index on "message" table. Install mnoGoSearch in usual order. Then edit your <TT CLASS="filename" >indexer.conf</TT >: <PRE CLASS="programlisting" > DBAddr mysql://foo:bar@localhost/database/ DBMode single HTDBList SELECT id FROM messages HTDBDoc SELECT concat(\ 'HTTP/1.0 200 OK\\r\\n',\ 'Content-type: text/plain\\r\\n',\ '\\r\\n',\ msg) \ FROM messages WHERE id='$1' Server htdb:/ </PRE > </P ><P >After start indexer will insert 'htdb:/' URL into database and will run an SQL query given in HTDBList. It will produce 1,2,3, ..., N values in result. Those values will be considered as links relative to 'htdb:/' URL. A list of new URLs in the form htdb:/1, htdb:/2, ... , htdb:/N will be added into database. Then HTDBDoc SQL query will be executed for each new URL. HTDBDoc will produce HTTP document for each document in the form: <PRE CLASS="programlisting" > HTTP/1.0 200 OK Content-Type: text/plain <some text from 'message' field here> </PRE > </P ><P >This document will be used to create full text index using words from 'message' fields. Words will be stored in 'dict' table assuming that we are using 'single' storage mode.</P ><P >After indexing you can use mnoGoSearch tables to perform search: <PRE CLASS="programlisting" > SELECT url.url FROM url,dict WHERE dict.url_id=url.rec_id AND dict.word='someword'; </PRE > </P ><P >As far as mnoGoSearch 'dict' table has an index on 'word' field this query will be executed much faster than queries which use SQL LIKE search on 'messages' table.</P ><P >You can also use several words in search: <PRE CLASS="programlisting" > SELECT url.url, count(*) as c FROM url,dict WHERE dict.url_id=url.rec_id AND dict.word IN ('some','word') GROUP BY url.url ORDER BY c DESC; </PRE > </P ><P >Both queries will return 'htdb:/XXX' values in url.url field. Then your application has to cat leading 'htdb:/' from those values to get PRIMARY key values of your 'messages' table.</P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="htdb-web" >Indexing SQL database driven web server</A ></H3 ><P >You can also use htdb:/ scheme to index your database driven WWW server. It allows to create indexes without having to invoke your web server while indexing. So, it is much faster and requires less CPU resources when direct indexing from WWW server. </P ><P >The main idea of indexing database driven web server is to build full text index in usual order. The only thing is that search must produce real URLs instead of URLs in 'htdb:/...' form. This can be achieved using mnoGoSearch aliasing tools.</P ><P >Take a look at sample <TT CLASS="filename" >indexer.conf</TT > in <TT CLASS="filename" >doc/samples/htdb.conf</TT > It is an <TT CLASS="filename" >indexer.conf</TT > used to index <A HREF="http://mnogosearch.org/" TARGET="_top" >our webboad</A >.</P ><P >HTDBList command generates URLs in the form: <PRE CLASS="programlisting" > http://search.mnogo.ru/board/message.php?id=XXX </PRE > </P ><P >where XXX is a "messages" table primary key values.</P ><P >For each primary key value HTDBDoc command generates text/html document with HTTP headers and content like this: <PRE CLASS="programlisting" > <HTML> <HEAD> <TITLE> ... subject field here .... </TITLE> <META NAME="Description" Content=" ... author here ..."> </HEAD> <BODY> ... message text here ... </BODY> </PRE > </P ><P >At the end of <TT CLASS="filename" >doc/samples/htdb.conf</TT > we wrote three commands: <PRE CLASS="programlisting" > Server htdb:/ Realm http://search.mnogo.ru/board/message.php?id=* Alias http://search.mnogo.ru/board/message.php?id= htdb:/ </PRE > </P ><P >First command says indexer to execute HTDBList query which will generate a list of messages in the form: <PRE CLASS="programlisting" > http://search.mnogo.ru/board/message.php?id=XXX </PRE > </P ><P >Second command allow indexer to accept such message URLs using string match with '*' wildcard at the end.</P ><P >Third command replaces "http://search.mnogo.ru/board/message.php?id=" substring in URL with "htdb:/" when indexer retrieve documents with messages. It means that "http://mysearch.udm.net/board/message.php?id=xxx" URLs will be shown in search result, but "htdb:/xxx" URL will be indexed instead, where xxx is the PRIMARY key value, the ID of record in "messages" table.</P ></DIV ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="exec" >Indexing binaries output (exec: and cgi: virtual URL schemes) <A NAME="AEN1592" ></A ></A ></H2 ><P >mnoGoSearch supports exec: and cgi: virtual URL schemes. They allows running an external program. This program must return a result to it's sdtout. Result must be in HTTP standard, i.e. HTTP response header followed by document's content.</P ><P >For example, when indexing both <TT CLASS="literal" >cgi:/usr/local/bin/myprog</TT > and <TT CLASS="literal" >exec:/usr/local/bin/myprog</TT >, indexer will execute the <TT CLASS="filename" >/usr/local/bin/myprog</TT > program.</P ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="exec-cgi" >Passing parameters to cgi: virtual scheme</A ></H3 ><P >When executing a program given in cgi: virtual scheme, indexer emulates that program is running under HTTP server. It creates REQUEST_METHOD environment variable with "GET" value and QUERY_STRING variable according to HTTP standards. For example, if <TT CLASS="literal" >cgi:/usr/local/apache/cgi-bin/test-cgi?a=b&d=e</TT > is being indexed, indexer creates QUERY_STRING with <TT CLASS="literal" >a=b&d=e</TT > value. cgi: virtual URL scheme allows indexing your site without having to invoke web servers even if you want to index CGI scripts. For example, you have a web site with static documents under <TT CLASS="filename" >/usr/local/apache/htdocs/</TT > and with CGI scripts under <TT CLASS="filename" >/usr/local/apache/cgi-bin/</TT >. Use the following configuration: <PRE CLASS="programlisting" > Server http://localhost/ Alias http://localhost/cgi-bin/ cgi:/usr/local/apache/cgi-bin/ Alias http://localhost/ file:/usr/local/apache/htdocs/ </PRE > </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="exec-exec" >Passing parameters to exec: virtual scheme</A ></H3 ><P >indexer does not create QUERY_STRING variable like in cgi: scheme. It creates a command line with argument given in URL after ? sign. For example, when indexing <TT CLASS="literal" >exec:/usr/local/bin/myprog?a=b&d=e</TT >, this command will be executed: <PRE CLASS="programlisting" > /usr/local/bin/myprog "a=b&d=e" </PRE > </P ></DIV ><DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="exec-ext" >Using exec: virtual scheme as an external retrieval system</A ></H3 ><P >exec: virtual scheme allow using it as an external retrieval system. It allows using protocols which are not supported natively by mnoGoSearch. For example, you can use curl program which is available from <A HREF="http://curl.haxx.se/" TARGET="_top" >http://curl.haxx.se/</A > to index HTTPS sites.</P ><P >Put this short script to <TT CLASS="literal" >/usr/local/mnogosearch/bin/</TT > under <TT CLASS="filename" >curl.sh</TT > name. <PRE CLASS="programlisting" > #!/bin/sh /usr/local/bin/curl -i $1 2>/dev/null </PRE > </P ><P >This script takes an URL given in command line argument and executes curl program to download it. -i argument says curl to output result together with HTTP headers.</P ><P >Now use these commands in your <TT CLASS="filename" >indexer.conf</TT >: <PRE CLASS="programlisting" > Server https://some.https.site/ Alias https:// exec:/usr/local/mnogosearch/etc/curl.sh?https:// </PRE > </P ><P >When indexing <TT CLASS="filename" >https://some.https.site/path/to/page.html</TT >, indexer will translate this URL to <PRE CLASS="programlisting" > exec:/usr/local/mnogosearch/etc/curl.sh?https://some.https.site/path/to/page.html </PRE > </P ><P >execute the <TT CLASS="filename" >curl.sh</TT > script: <PRE CLASS="programlisting" > /usr/local/mnogosearch/etc/curl.sh "https://some.https.site/path/to/page.html" </PRE > </P ><P >and take it's output.</P ></DIV ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="mirror" >Mirroring <A NAME="AEN1633" ></A ></A ></H2 ><P > <A NAME="AEN1636" ></A > You may specify a path to root dir to enable sites mirroring <PRE CLASS="programlisting" > MirrorRoot /path/to/mirror </PRE > </P ><P > <A NAME="AEN1641" ></A > You may specify as well root dir of mirrored document's headers indexer will store HTTP headers to local disk too. <PRE CLASS="programlisting" > MirrorHeadersRoot /path/to/headers </PRE > </P ><P > <A NAME="AEN1646" ></A > You may specify period during which earlier mirrored files will be used while indexing instead of real downloading. <PRE CLASS="programlisting" > MirrorPeriod <time> </PRE > </P ><P >It is very useful when you do some experiments with mnoGoSearch indexing the same hosts and do not want much traffic from/to Internet. If MirrorHeadersRoot is not specified and headers are not stored to local disk then default Content-Type's given in AddType commands will be used. Default value of the MirrorPeriod is -1, which means <TT CLASS="literal" >do not use mirrored files</TT >.</P ><P ><time> is in the form <TT CLASS="literal" >xxxA[yyyB[zzzC]]</TT > (Spaces are allowed between xxx and A and yyy and so on) where xxx, yyy, zzz are numbers (can be negative!). A, B, C can be one of the following: <PRE CLASS="programlisting" > s - second M - minute h - hour d - day m - month y - year </PRE > </P ><P >(these letters are the same as in strptime/strftime functions)</P ><P >Examples: <PRE CLASS="programlisting" > 15s - 15 seconds 4h30M - 4 hours and 30 minutes 1y6m-15d - 1 year and six month minus 15 days 1h-10M+1s - 1 hour minus 10 minutes plus 1 second </PRE > </P ><P >If you specify only number without any character, it is assumed that time is given in seconds (this behavior is for compatibility with versions prior to 3.1.7).</P ><P >The following command will force using local copies for one day: <PRE CLASS="programlisting" > MirrorPeriod 1d </PRE > </P ><P >If your pages are already indexed, when you re-index with -a indexer will check the headers and only download files that have been modified since the last indexing. Thus, all pages that are not modified will not be downloaded and therefore not mirrored either. To create the mirror you need to either (a) start again with a clean database or (b) use the -m switch. </P ><P >You can actually use the created files as a full featured mirror to you site. However be careful: indexer will not download a document that is larger than MaxDocSize. If a document is larger it will be only partially downloaded. If you site has no large documents, everything will be fine.</P ></DIV ></DIV ><DIV CLASS="NAVFOOTER" ><HR ALIGN="LEFT" WIDTH="100%"><TABLE SUMMARY="Footer navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" ><A HREF="msearch-indexer-configuration.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="index.html" ACCESSKEY="H" >Home</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" ><A HREF="msearch-syslog.html" ACCESSKEY="N" >Next</A ></TD ></TR ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" >indexer configuration</TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="msearch-indexing.html" ACCESSKEY="U" >Up</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" >Using syslog <A NAME="AEN1665" ></A ></TD ></TR ></TABLE ></DIV ></BODY ></HTML >