<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <HTML ><HEAD ><TITLE >Storing mnoGoSearch data </TITLE ><META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK REL="HOME" TITLE="mnoGoSearch 3.3.9 reference manual" HREF="index.html"><LINK REL="PREVIOUS" TITLE="External parsers" HREF="msearch-parsers.html"><LINK REL="NEXT" TITLE="Cache mode storage" HREF="msearch-cachemode.html"><LINK REL="STYLESHEET" TYPE="text/css" HREF="mnogo.css"><META NAME="Description" CONTENT="mnoGoSearch - Full Featured Web site Open Source Search Engine Software over the Internet and Intranet Web Sites Based on SQL Database. It is a Free search software covered by GNU license."><META NAME="Keywords" CONTENT="shareware, freeware, download, internet, unix, utilities, search engine, text retrieval, knowledge retrieval, text search, information retrieval, database search, mining, intranet, webserver, index, spider, filesearch, meta, free, open source, full-text, udmsearch, website, find, opensource, search, searching, software, udmsearch, engine, indexing, system, web, ftp, http, cgi, php, SQL, MySQL, database, php3, FreeBSD, Linux, Unix, mnoGoSearch, MacOS X, Mac OS X, Windows, 2000, NT, 95, 98, GNU, GPL, url, grabbing"></HEAD ><BODY CLASS="chapter" BGCOLOR="#EEEEEE" TEXT="#000000" LINK="#000080" VLINK="#800080" ALINK="#FF0000" ><!--#include virtual="body-before.html"--><DIV CLASS="NAVHEADER" ><TABLE SUMMARY="Header navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TH COLSPAN="3" ALIGN="center" ><SPAN CLASS="application" >mnoGoSearch</SPAN > 3.3.9 reference manual: Full-featured search engine software</TH ></TR ><TR ><TD WIDTH="10%" ALIGN="left" VALIGN="bottom" ><A HREF="msearch-parsers.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="80%" ALIGN="center" VALIGN="bottom" ></TD ><TD WIDTH="10%" ALIGN="right" VALIGN="bottom" ><A HREF="msearch-cachemode.html" ACCESSKEY="N" >Next</A ></TD ></TR ></TABLE ><HR ALIGN="LEFT" WIDTH="100%"></DIV ><DIV CLASS="chapter" ><H1 ><A NAME="howstore" ></A >Chapter 7. Storing <SPAN CLASS="application" >mnoGoSearch</SPAN > data </H1 ><DIV CLASS="TOC" ><DL ><DT ><B >Table of Contents</B ></DT ><DT ><A HREF="msearch-howstore.html#sql-stor" >Word modes with an <ACRONYM CLASS="acronym" >SQL</ACRONYM > database</A ></DT ><DT ><A HREF="msearch-cachemode.html" >Cache mode storage</A ></DT ><DT ><A HREF="msearch-perf.html" >mnoGoSearch performance issues <A NAME="AEN3527" ></A ></A ></DT ><DT ><A HREF="msearch-oracle.html" >Oracle notes <A NAME="AEN3556" ></A ></A ></DT ><DT ><A HREF="msearch-db2.html" >IBM DB2 notes <A NAME="AEN3637" ></A ></A ></DT ></DL ></DIV ><DIV CLASS="sect1" ><H1 CLASS="sect1" ><A NAME="sql-stor" >Word modes with an <ACRONYM CLASS="acronym" >SQL</ACRONYM > database</A ></H1 ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="sql-stor-modes" >Various modes used to store words</A ></H2 ><P > <SPAN CLASS="application" >mnoGoSearch</SPAN > can use a number of different formats (modes) to store word information in the database, suitable for different purposes. The available modes are: <TT CLASS="literal" >single</TT >, <TT CLASS="literal" >multi</TT > and <TT CLASS="literal" >blob</TT >. The default mode is <TT CLASS="literal" >blob</TT >. The mode can be selected using the <CODE CLASS="option" >DBMode</CODE > part of the <B CLASS="command" ><A HREF="msearch-cmdref-dbaddr.html" >DBAddr</A ></B > command in <TT CLASS="filename" >indexer.conf</TT > and <TT CLASS="filename" >search.htm</TT >. </P ><P > Examples: <PRE CLASS="programlisting" > DBAddr mysql://localhost/test/?DBMode=single DBAddr mysql://localhost/test/?DBMode=multi DBAddr mysql://localhost/test/?DBMode=blob </PRE > </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="sql-stor-single" >Storage mode - single <A NAME="AEN3264" ></A ></A ></H2 ><P > The <TT CLASS="literal" >single</TT > mode is suitable for a small site with the total number of documents up to <TT CLASS="literal" >5000</TT >. </P ><P > When the <TT CLASS="literal" >single</TT > mode is specified, all words are stored in a single table <TT CLASS="literal" >dict</TT > with three columns <TT CLASS="literal" >(url_id,word,coord)</TT >, where <CODE CLASS="varname" >url_id</CODE > is the <ACRONYM CLASS="acronym" >ID</ACRONYM > of the document which is referenced by <CODE CLASS="varname" >rec_id</CODE > field in the table <TT CLASS="literal" >url</TT >, and <CODE CLASS="varname" >coord</CODE > is a combination of the <TT CLASS="literal" >section ID</TT > and position of the words in the section. Word has the <TT CLASS="literal" >variable char(32)</TT > <ACRONYM CLASS="acronym" >SQL</ACRONYM > type. Every appearance of the same word in a document produces a separate record in the table. </P ><P > The advantage of the <TT CLASS="literal" >single</TT > mode is <SPAN CLASS="emphasis" ><I CLASS="emphasis" >live updates</I ></SPAN > support - a document updated by <SPAN CLASS="application" >indexer</SPAN > becomes immediately visible for searches with its new content. In other words <SPAN CLASS="emphasis" ><I CLASS="emphasis" >crawling</I ></SPAN > and <SPAN CLASS="emphasis" ><I CLASS="emphasis" >indexing</I ></SPAN > is done at the same time, for every document individually. </P ><P > Another advantage of the <TT CLASS="literal" >single</TT > mode is its simplicity and straightforward data format. You can use <SPAN CLASS="application" >mnoGoSearch</SPAN > as a fulltext solution for your database-driven Web application. For example, you may find useful to create a simple search page which will query the data collected by <SPAN CLASS="application" >indexer</SPAN > this way: <PRE CLASS="programlisting" > SELECT url.url, count(*) AS rank FROM dict, url WHERE url.rec_id=dict.url_id AND dict.word IN ('some','words') GROUP BY url.url ORDER BY rank DESC; </PRE > and display the results of this search query. </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > The above query implements very simple ranking based on the count of the word hits. You can also integrate <SPAN CLASS="application" >mnoGoSearch</SPAN > with your own application using the <A HREF="msearch-cmdref-usercachequery.html" >UserCacheQuery</A > command, which supports full-featured ranking taking into account all factors described in <A HREF="msearch-rel.html#score-commands" >the Section called <I >Commands affecting document score <A NAME="AEN5884" ></A ></I > in Chapter 10</A >. </P ></BLOCKQUOTE ></DIV ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > When you use <SPAN CLASS="application" >mnoGoSearch</SPAN > to index data stored in your <ACRONYM CLASS="acronym" >SQL</ACRONYM > tables (see <A HREF="msearch-extended-indexing.html#htdb" >the Section called <I >Indexing <ACRONYM CLASS="acronym" >SQL</ACRONYM > tables (<TT CLASS="literal" >htdb:/</TT > virtual <ACRONYM CLASS="acronym" >URL</ACRONYM > scheme) <A NAME="AEN2227" ></A ></I > in Chapter 4</A > for details), you may find useful to run queries joining the table <TT CLASS="literal" >dict</TT > with your own tables. </P ></BLOCKQUOTE ></DIV ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="sql-stor-multi" >Storage mode - multi <A NAME="AEN3306" ></A ></A ></H2 ><P > The <TT CLASS="literal" >multi</TT > mode is suitable for a medium size Web space with up to about <TT CLASS="literal" >50000</TT > documents. It can be useful if your documents are updated very often. </P ><P > If the <TT CLASS="literal" >multi</TT > mode is selected, word information is distributed into <TT CLASS="literal" >256</TT > separate tables <CODE CLASS="varname" >dict00</CODE >..<CODE CLASS="varname" >dictFF</CODE > using a hash function for distribution. The structure of these tables is close to the table <CODE CLASS="varname" >dict</CODE > used in the <TT CLASS="literal" >single</TT > mode: <TT CLASS="literal" >(url_id,secno,word,coords)</TT >. The difference is that all positions of the same word (<TT CLASS="literal" >hits</TT >) in a section of a document are grouped into a single binary array <CODE CLASS="varname" >coords</CODE >, instead of producing multiple records. Word information for different sections is stored in separate records. </P ><P > Similar to the <TT CLASS="literal" >single</TT > mode, the <TT CLASS="literal" >multi</TT > mode supports <SPAN CLASS="emphasis" ><I CLASS="emphasis" >live updates</I ></SPAN >. That is, crawling and indexing are done at the same time. A new document (or an updated document) becomes available for search very soon after <SPAN CLASS="application" >indexer</SPAN > has crawled it. </P ><P > When working in the <TT CLASS="literal" >multi</TT > mode, <SPAN CLASS="application" >indexer</SPAN > performs caching of the word information in memory for better crawling performance. The word cache is flushed to the database as soon as it grows up to the value given in <A HREF="msearch-cmdref-wordcachesize.html" >WordCacheSize</A >, with <TT CLASS="literal" >8Mb</TT > by default. You can change <A HREF="msearch-cmdref-wordcachesize.html" >WordCacheSize</A > to a bigger value for better crawling performance. </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > The disadvantage of having a too big <A HREF="msearch-cmdref-wordcachesize.html" >WordCacheSize</A > value is that in case when <SPAN CLASS="application" >indexer</SPAN > crashes or dies for any other reasons, all cached information gets lost. </P ></BLOCKQUOTE ></DIV ><P > Grouping word hits into the same record and distribution between multiple tables make the <TT CLASS="literal" >multi</TT > mode much faster both for search and indexing comparing to the <TT CLASS="literal" >single</TT > mode. </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="sql-stor-blob" >Storage mode - blob <A NAME="AEN3342" ></A ></A ></H2 ><P >The <TT CLASS="literal" >blob</TT > mode is the fastest mode currently available in <SPAN CLASS="application" >mnoGoSearch</SPAN > for both purposes: indexing and searching. This mode can handle up to <TT CLASS="literal" >1,000,000</TT > - <TT CLASS="literal" >2,000,000</TT > million documents on a single machine. </P ><P > <TT CLASS="literal" >DBMode=blob</TT > is know to work fine with <SPAN CLASS="application" >DB2</SPAN >, <SPAN CLASS="application" >Mimer</SPAN >, <SPAN CLASS="application" >MS SQL</SPAN >, <SPAN CLASS="application" >MySQL</SPAN >, <SPAN CLASS="application" >PostgreSQL</SPAN >, <SPAN CLASS="application" >Oracle</SPAN >, <SPAN CLASS="application" >Sybase</SPAN >, <SPAN CLASS="application" >Firebird</SPAN >/<SPAN CLASS="application" >Interbase</SPAN >, <SPAN CLASS="application" >SQLite3</SPAN >. </P ><P > In the <TT CLASS="literal" >blob</TT > mode crawling and indexing are done separately. Crawling is done by starting <SPAN CLASS="application" >indexer</SPAN > without any command line arguments. At crawling time <SPAN CLASS="application" >indexer</SPAN > collects word information into the table <CODE CLASS="varname" >bdicti</CODE > with a structure optimized for crawling purposes, but not suitable for search purposes. </P ><P > After crawling is done, an extra step is required to create the <SPAN CLASS="emphasis" ><I CLASS="emphasis" >search index</I ></SPAN > by launching <KBD CLASS="userinput" >indexer -Eblob</KBD >. When creating the search index, <SPAN CLASS="application" >indexer</SPAN > loads information from the table <CODE CLASS="varname" >bdicti</CODE >, groups all hits of the same word in different documents together and writes the grouped data into the table <CODE CLASS="varname" >bdict</CODE > with a structure optimized for search purposes. The table <CODE CLASS="varname" >bdict</CODE > consists of three columns (<CODE CLASS="varname" >word</CODE >, <CODE CLASS="varname" >secno</CODE >, <CODE CLASS="varname" >intag</CODE >), where <CODE CLASS="varname" >intag</CODE > is a binary array which includes information about all documents this word appears in (using 32-bit <TT CLASS="literal" >ID</TT >s of the documents), as well as positions of the word in every document (for phrase search). The table <CODE CLASS="varname" >bdict</CODE > has an index on the column <CODE CLASS="varname" >word</CODE > for fast look-up at search time. Words from different sections (e.g. <TT CLASS="literal" >title</TT > and <TT CLASS="literal" >body</TT >) are written in separate records. <DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > Separate records for different sections are needed to optimize searches with section limits, for example "<SPAN CLASS="emphasis" ><I CLASS="emphasis" >find only in title</I ></SPAN >". </P ></BLOCKQUOTE ></DIV > </P ><P > Also, additional arrays of data are written into the table <CODE CLASS="varname" >bdict</CODE >: <P ></P ><UL ><LI ><P > <TT CLASS="literal" >#rec_id</TT > - a list of 32-bit document <TT CLASS="literal" >ID</TT >s </P ></LI ><LI ><P > <TT CLASS="literal" >#last_mod_time</TT > - an array of 32-bit <TT CLASS="literal" >Last-Modified</TT > values (in Unix <TT CLASS="literal" >timestamp</TT > format) - for fast limiting searches by date. </P ></LI ><LI ><P > <TT CLASS="literal" >#pop_rank</TT > - an array of popularity rank values, each in the <SPAN CLASS="type" >32-bit float</SPAN > format. </P ></LI ><LI ><P > <TT CLASS="literal" >#site_id</TT > - an array of 32-bit site <TT CLASS="literal" >IDs</TT >, for <TT CLASS="literal" >GroupBySite</TT >. </P ></LI ><LI ><P > <TT CLASS="literal" >#limit#name</TT > - a list of document <TT CLASS="literal" >IDs</TT > covered by a user defined limit with name "<TT CLASS="literal" >name</TT >". A separate <TT CLASS="literal" >#limit#xxx</TT > record is created for every user defined <A HREF="msearch-cmdref-limit.html" >Limit</A > configured in <TT CLASS="filename" >indexer.conf</TT >. </P ></LI ><LI ><P > <TT CLASS="literal" >#ts</TT > - the timestamp indicating when <KBD CLASS="userinput" >indexer -Eblob</KBD > was executed last time, in textual representation, using the Unix timestamp format. This value is used for invalidating old queries stored in the search result cache, as well as for searches with <SPAN CLASS="emphasis" ><I CLASS="emphasis" >live updates</I ></SPAN >, described in <A HREF="msearch-howstore.html#sql-stor-blob-live" >the Section called <I >Live updates emulator with <TT CLASS="literal" >DBMode=blob</TT ></I ></A >. </P ></LI ><LI ><P > <TT CLASS="literal" >#version</TT > - a string representing the version <TT CLASS="literal" >ID</TT > of <SPAN CLASS="application" >indexer</SPAN > which created the search index. For example, <SPAN CLASS="application" >indexer</SPAN > from <SPAN CLASS="application" >mnoGoSearch 3.3.0</SPAN > writes the string "<TT CLASS="literal" >30300</TT >". This record is required for easier upgrade purposes, to make a newer version of <SPAN CLASS="application" >search.cgi</SPAN > recognize an older format. </P ></LI ></UL > </P ><P > Note, creating fast search index is also possible for the databases using <TT CLASS="literal" >DBMode=single</TT > and <TT CLASS="literal" >DBMode=multi</TT >. This is useful when you need to quickly switch to <TT CLASS="literal" >DBMode=blob</TT > when search performance with the other modes became bad - without even having to re-index your Web space. Later you can completely switch to <TT CLASS="literal" >DBMode=blob</TT > in both <TT CLASS="filename" >indexer.conf</TT > and <TT CLASS="filename" >search.htm</TT >, and run indexing from the very beginning. </P ><P > The disadvantage of <TT CLASS="literal" >DBMode=blob</TT > is that it does not support live updates directly. New or updated documents, crawled by <SPAN CLASS="application" >indexer</SPAN > are not visible for search until <KBD CLASS="userinput" >indexer -Eblob</KBD > is run again. Creating search index takes about <TT CLASS="literal" >6</TT > minutes on a collection with <TT CLASS="literal" >200000</TT > <ACRONYM CLASS="acronym" >HTML</ACRONYM > documents, with <TT CLASS="literal" >10Gb</TT > total size (on a Intel Core Duo 2.13GHz <ACRONYM CLASS="acronym" >CPU</ACRONYM >), which can be unacceptably long for some applications (for example, on a news site, or when using <SPAN CLASS="application" >mnoGoSearch</SPAN > as an external full-text engine for <ACRONYM CLASS="acronym" >SQL</ACRONYM > tables with help of <A HREF="msearch-extended-indexing.html#htdb" >HTDB</A >). </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="sql-stor-blob-live" >Live updates emulator with <TT CLASS="literal" >DBMode=blob</TT ></A ></H2 ><P > Starting from version <TT CLASS="literal" >3.3.1</TT >, <SPAN CLASS="application" >mnoGoSearch</SPAN > emulates <SPAN CLASS="emphasis" ><I CLASS="emphasis" >live updates</I ></SPAN > by reading word information for the new or updated documents directly from the crawler table <CODE CLASS="varname" >bdicti</CODE >. It allows to add or update up to about <TT CLASS="literal" >10,000</TT > documents without having to run <KBD CLASS="userinput" >indexer -Eblob</KBD >. To activate using live updates, please add <CODE CLASS="option" >LiveUpdates=yes</CODE > parameter to the <A HREF="msearch-cmdref-dbaddr.html" >DBAddr</A > command in <TT CLASS="filename" >search.htm</TT >. </P ><P > <P ><B >Example:</B ></P > <PRE CLASS="programlisting" > DBAddr mysql://root@localhost/test/?DBMode=blob&LiveUpdates=yes </PRE > </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="sql-stor-blob-ext" >Extended features with <CODE CLASS="option" >DBMode=blob</CODE ></A ></H2 ><P > Starting with the version <TT CLASS="literal" >3.3.0</TT >, <KBD CLASS="userinput" >indexer -Eblob</KBD > can be used in combination with <TT CLASS="literal" >URL</TT > and <A HREF="msearch-cmdref-tag.html" >Tag</A > limits, other limits described in <A HREF="msearch-indexing.html#general-subsect" >the Section called <I >Subsection control</I > in Chapter 3</A >, as well as in combination with a user defined limit described by a <A HREF="msearch-cmdref-limit.html" >Limit</A > command. The limits allow to generate a search index over a subset of the documents collected by <SPAN CLASS="application" >indexer</SPAN > at crawling time. </P ><P ><P ><B >Examples:</B ></P > <PRE CLASS="programlisting" > indexer -Eblob -u %/subdir/% indexer -Eblob -t tag indexer -Eblob --fl=limitname </PRE > </P ><P >Starting with the version <TT CLASS="literal" >3.2.36</TT > an additional command is available: <KBD CLASS="userinput" >indexer -Erewriteurl</KBD >. When <SPAN CLASS="application" >indexer</SPAN > is launched with this parameter it rewrites <ACRONYM CLASS="acronym" >URL</ACRONYM > data for <CODE CLASS="option" >DBMode=blob</CODE >. It can be useful to rewrite <ACRONYM CLASS="acronym" >URL</ACRONYM > data quickly without having to rebuild the entire search index, for example if you added the <CODE CLASS="option" >Deflate=yes</CODE > parameter to <A HREF="msearch-cmdref-dbaddr.html" >DBAddr</A >, or after running <KBD CLASS="userinput" >indexer -n0 -R</KBD > to update the <A HREF="msearch-rel.html#poprank" >Popularity Rank</A >. </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="sql-store-maxwords" >Maximum amount of words collected from a document</A ></H2 ><P > Starting from the version <TT CLASS="literal" >3.3.0</TT >, <SPAN CLASS="application" >mnoGoSearch</SPAN > enumerates words positions for every section separately and allows to store information about up to 2 million words per section. </P ><P > In the versions prior to <TT CLASS="literal" >3.3.0</TT > it was possible to store up to <TT CLASS="literal" >64K</TT > words from a single document. </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="sql-stor-noncrc" >Substring search notes</A ></H2 ><P > The <TT CLASS="literal" >single</TT >, <TT CLASS="literal" >multi</TT > and <TT CLASS="literal" >blob</TT > modes support substring search. An <ACRONYM CLASS="acronym" >SQL</ACRONYM > query containing a <B CLASS="command" >LIKE</B > predicate is executed internally in order to do substring search. Substring search is usually slower than searching for a full word, especially in case of very short substring. You can use the <A HREF="msearch-cmdref-substringmatchminwordlength.html" >SubstringMatchMinWordLength</A > command to limit the minimal word length allowed for substring search. </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > When performing substring search in the <TT CLASS="literal" >multi</TT > mode, <SPAN CLASS="application" >search.cgi</SPAN > has to iterate search queries through all <TT CLASS="literal" >256</TT > tables <CODE CLASS="varname" >dict00</CODE >..<CODE CLASS="varname" >dictFF</CODE >, which makes substring search especially slow. Using substring search is not recommended with <CODE CLASS="option" >DBMode=multi</CODE >. </P ></BLOCKQUOTE ></DIV ></DIV ></DIV ></DIV ><DIV CLASS="NAVFOOTER" ><HR ALIGN="LEFT" WIDTH="100%"><TABLE SUMMARY="Footer navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" ><A HREF="msearch-parsers.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="index.html" ACCESSKEY="H" >Home</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" ><A HREF="msearch-cachemode.html" ACCESSKEY="N" >Next</A ></TD ></TR ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" >External parsers<A NAME="AEN2926" ></A ></TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" > </TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" >Cache mode storage</TD ></TR ></TABLE ></DIV ><!--#include virtual="body-after.html"--></BODY ></HTML >