<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <HTML ><HEAD ><TITLE >mnoGoSearch cluster</TITLE ><META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK REL="HOME" TITLE="mnoGoSearch 3.3.10 reference manual" HREF="index.html"><LINK REL="PREVIOUS" TITLE="Fuzzy search" HREF="msearch-fuzzy.html"><LINK REL="NEXT" TITLE="Miscellaneous" HREF="msearch-misc.html"><LINK REL="STYLESHEET" TYPE="text/css" HREF="mnogo.css"><META NAME="Description" CONTENT="mnoGoSearch - Full Featured Web site Open Source Search Engine Software over the Internet and Intranet Web Sites Based on SQL Database. It is a Free search software covered by GNU license."><META NAME="Keywords" CONTENT="shareware, freeware, download, internet, unix, utilities, search engine, text retrieval, knowledge retrieval, text search, information retrieval, database search, mining, intranet, webserver, index, spider, filesearch, meta, free, open source, full-text, udmsearch, website, find, opensource, search, searching, software, udmsearch, engine, indexing, system, web, ftp, http, cgi, php, SQL, MySQL, database, php3, FreeBSD, Linux, Unix, mnoGoSearch, MacOS X, Mac OS X, Windows, 2000, NT, 95, 98, GNU, GPL, url, grabbing"></HEAD ><BODY CLASS="chapter" BGCOLOR="#EEEEEE" TEXT="#000000" LINK="#000080" VLINK="#800080" ALINK="#FF0000" ><!--#include virtual="body-before.html"--><DIV CLASS="NAVHEADER" ><TABLE SUMMARY="Header navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TH COLSPAN="3" ALIGN="center" ><SPAN CLASS="application" >mnoGoSearch</SPAN > 3.3.10 reference manual: Full-featured search engine software</TH ></TR ><TR ><TD WIDTH="10%" ALIGN="left" VALIGN="bottom" ><A HREF="msearch-fuzzy.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="80%" ALIGN="center" VALIGN="bottom" ></TD ><TD WIDTH="10%" ALIGN="right" VALIGN="bottom" ><A HREF="msearch-misc.html" ACCESSKEY="N" >Next</A ></TD ></TR ></TABLE ><HR ALIGN="LEFT" WIDTH="100%"></DIV ><DIV CLASS="chapter" ><H1 ><A NAME="cluster" ></A >Chapter 11. mnoGoSearch cluster</H1 ><DIV CLASS="TOC" ><DL ><DT ><B >Table of Contents</B ></DT ><DT ><A HREF="msearch-cluster.html#cluster-introduction" >Introduction</A ></DT ><DT ><A HREF="msearch-cluster.html#cluster-how-it-works" >How it works</A ></DT ><DT ><A HREF="msearch-cluster.html#cluster-database-machine-operations" >Operations done on the database machines</A ></DT ><DT ><A HREF="msearch-cluster.html#cluster-xml-response-example" >How a typical XML response looks like</A ></DT ><DT ><A HREF="msearch-cluster.html#cluster-frontend-machine-operations" >Operations done on the front-end machine</A ></DT ><DT ><A HREF="msearch-cluster.html#cluster-types" >Cluster types</A ></DT ><DT ><A HREF="msearch-cluster.html#cluster-configure-merge" >Installing and configuring a "merge" cluster</A ></DT ><DT ><A HREF="msearch-cluster.html#cluster-configure-distributed" >Installing and configuring a "distributed" cluster</A ></DT ><DT ><A HREF="msearch-cluster.html#cluster-addnode" >Using dump/restore tools to add cluster nodes</A ></DT ><DT ><A HREF="msearch-cluster.html#cluster-limitations" >Cluster limitations</A ></DT ></DL ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="cluster-introduction" >Introduction</A ></H2 ><P > Starting from the version 3.3.0, <SPAN CLASS="application" >mnoGoSearch</SPAN > provides a clustered solution, which allows to scale search on several computers, extending database size up to several dozens or even hundreds million documents. </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="cluster-how-it-works" >How it works</A ></H2 ><P > A typical cluster consists of several database machines (cluster nodes) and a single front-end machine. The front-end machine receives HTTP requests from a user's browser, forwards search queries to the database machines using HTTP protocol, receives back a limited number of the best top search results (using a simple XML format, based on OpenSearch specifications) from every database machine, then parses and merges the results ordering them by score, and displays the results applying HTML template. This approach distributes operations with high CPU and hard disk consumption between the database machines in parallel, leaving simple merge and HTML template processing functions to the the front-end machine. </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="cluster-database-machine-operations" >Operations done on the database machines</A ></H2 ><P > In a clustered <SPAN CLASS="application" >mnoGoSearch</SPAN > installation, all hard operations are done on the database machines. </P ><P > This is an approximate distribution chart of time spent on different search steps: </P ><P > <P ></P ><UL ><LI ><P >5% - fetching word information</P ></LI ><LI ><P >30% - grouping words by doc ID</P ></LI ><LI ><P >30% - calculating score for each document</P ></LI ><LI ><P >20% - sorting documents according to score</P ></LI ><LI ><P >10% - fetching cached copies and generating excerpts for the best 10 documents</P ></LI ><LI ><P >5% - generating XML or HTML results using template</P ></LI ></UL > </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="cluster-xml-response-example" >How a typical XML response looks like</A ></H2 ><P > <PRE CLASS="programlisting" > <rss> <channel> <!-- General search results information block --> <openSearch:totalResults>20</openSearch:totalResults> <openSearch:startIndex>1</openSearch:startIndex> <openSearch:itemsPerPage>2</openSearch:itemsPerPage> <!-- Word statistics block --> <mnoGoSearch:WordStatList> <mnoGoSearch:WordStatItem order="0" count="300" word="apache"/> <mnoGoSearch:WordStatItem order="1" count="103" word="web"/> <mnoGoSearch:WordStatItem order="2" count="250" word="server"/> </mnoGoSearch:WordStatList> <!-- document information block --> <item> <id>1</id> <score>70.25%</score> <title>Test page of the Apache HTTP Server</title> <link>http://hostname/index.html</link> <description>...to use the images below on Apache and Fedora Core powered HTTP servers. Thanks for using Apache ... </description> <updated>2006-12-13T18:30:02Z</updated> <content-length>3956</content-length> <content-type>text/html</content-type> </item> <!-- more items, typically 10 items total --> </channel> </rss> </PRE > </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="cluster-frontend-machine-operations" >Operations done on the front-end machine</A ></H2 ><P > The front-end machine receives XML responses from every database machine. On the first query, the front-end machine requests top 10 results from every database machine. An XML response for the top 10 results page is about 5Kb. Parsing of each XML response takes less than 1% of time. Thus, a cluster consisting of 50 machines is about 50% slower than a cluster consisting of a single machine, but allows to search through a 50-times bigger collection of documents. </P ><P > If the user is not satisfied with search results returned on the first page and navigates to higher pages, then the front-end machine requests ps*np results from each database machine, where ps is page size (10 by default), and np is page number. Thus, to display the 5th result page, the front-end machine requests 50 results from every database machine and has to do five-times more parsing job, which makes search on higher pages a little bit slower. But typically, users look through not more than 2-3 pages. </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="cluster-types" >Cluster types</A ></H2 ><P > <SPAN CLASS="application" >mnoGoSearch</SPAN > supports two cluster types. A "merge" cluster is to join results from multiple independent machines, each one created by its own <TT CLASS="filename" >indexer.conf</TT >. This type of cluster is recommended when it is possible to distribute your web space into separate databases evenly using URL patterns by means of the <A HREF="msearch-cmdref-server.html" >Server</A > or <A HREF="msearch-cmdref-realm.html" >Realm</A > commands. For example, if you need to index three sites <SPAN CLASS="emphasis" ><I CLASS="emphasis" >siteA</I ></SPAN >, <SPAN CLASS="emphasis" ><I CLASS="emphasis" >siteB</I ></SPAN > and <SPAN CLASS="emphasis" ><I CLASS="emphasis" >siteC</I ></SPAN > with an approximately equal number of documents on each site, and you have three cluster machines <SPAN CLASS="emphasis" ><I CLASS="emphasis" >nodeA</I ></SPAN >, <SPAN CLASS="emphasis" ><I CLASS="emphasis" >nodeB</I ></SPAN > and <SPAN CLASS="emphasis" ><I CLASS="emphasis" >nodeC</I ></SPAN >, you can put each site to a separate machine using a corresponding <A HREF="msearch-cmdref-server.html" >Server</A > command in the <TT CLASS="filename" >indexer.conf</TT > file on each cluster machine: <PRE CLASS="programlisting" > # indexer.conf on machine nodeA: DBAddr mysql://root@localhost/test/ Server http://siteA/ # indexer.conf on machine nodeB: DBAddr mysql://root@localhost/test/ Server http://siteB/ # indexer.conf on machine nodeC: DBAddr mysql://root@localhost/test/ Server http://siteC/ </PRE > </P ><P > A "distributed" cluster is created by a single <TT CLASS="filename" >indexer.conf</TT >, with "indexer" automatically distributing search data between database machines. This type of cluster is recommended when it is hard to distribute web space between cluster machines using URL patterns, for example when you don't know your site sizes or the site sizes are very different. </P ><P > <DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > Even distribution of search data between cluster machines is important to achieve the best performance. Search front-end waits for the slowest cluster node. Thus, if cluster machines <SPAN CLASS="emphasis" ><I CLASS="emphasis" >nodeA</I ></SPAN > and <SPAN CLASS="emphasis" ><I CLASS="emphasis" >nodeB</I ></SPAN > return search results in 0.1 seconds and <SPAN CLASS="emphasis" ><I CLASS="emphasis" >nodeC</I ></SPAN > return results in 0.5 seconds, the overall cluster response time will be about 0.5 seconds. </P ></BLOCKQUOTE ></DIV > </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="cluster-configure-merge" >Installing and configuring a "merge" cluster</A ></H2 ><P > <DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="cluster-configure-merge-database-machines" >Configuring the database machines</A ></H3 ><P > On each database machine install <SPAN CLASS="application" >mnoGoSearch</SPAN > using usual procedure: </P ><P ></P ><UL ><LI ><P > Configure <TT CLASS="filename" >indexer.conf</TT >: Edit DBAddr - usually specifying a database installed on the local host, for example: <PRE CLASS="programlisting" > DBAddr mysql://localhost/dbname/?dbmode=blob </PRE > </P ></LI ><LI ><P > Add a "Server" command corresponding to a desired collection of documents - its own collection on every database machine. Index collections on every database machines by running "indexer" then "indexer -Eblob". </P ></LI ><LI ><P > Configure <TT CLASS="filename" >search.htm</TT > by copying the DBAddr command from <TT CLASS="filename" >indexer.conf</TT >. </P ></LI ><LI ><P > Make sure that search works in "usual" (non-clustered) mode by opening http://hostname/cgi-bin/search.cgi in your browser and typing some search query, for example the word "test", or some other word which present in the document collection. </P ></LI ></UL ></DIV > <DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="cluster-configure-xml-interface" >Configuring XML interface on the database machines</A ></H3 ><P > Additionally to the usual installation steps, it's also necessary to configure XML interface on every database machine. </P ><P > Go through the following steps: <P ></P ><UL ><LI ><P >cd /usr/local/mnogosearch/etc/</P ></LI ><LI ><P >cp node.xml-dist node.xml</P ></LI ><LI ><P >Edit node.xml by specifying the same DBAddr</P ></LI ><LI ><P > make sure XML search returns a well-formed response (according to the above format) by opening http://hostname/cgi-bin/search.cgi/node.xml?q=test </P ></LI ></UL > </P ><P > After these steps, you will have several separate document collections, every collection indexed into its own database, and configured XML interfaces on all database machine. </P ></DIV > <DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="cluster-configure-merge-frontend-machine" >Configuring the front-end machine</A ></H3 ><P > Install <SPAN CLASS="application" >mnoGoSearch</SPAN > using usual procedure, then do the following additional steps: <P ></P ><UL ><LI ><P >cd /usr/local/mnogosearch/etc/</P ></LI ><LI ><P >cp search.htm-dist search.htm</P ></LI ><LI ><P > Edit <TT CLASS="filename" >search.htm</TT > by specifying URLs of XML interfaces of all database machines, adding "?${NODE_QUERY_STRING}" after "node.xml": <PRE CLASS="programlisting" > DBAddr http://hostname1/cgi-bin/search.cgi/node.xml?${NODE_QUERY_STRING} DBAddr http://hostname2/cgi-bin/search.cgi/node.xml?${NODE_QUERY_STRING} DBAddr http://hostname3/cgi-bin/search.cgi/node.xml?${NODE_QUERY_STRING} </PRE > </P ></LI ></UL > </P ><P > You're done. Now open http://frontend-hostname/cgi-bin/search.cgi in your browser and test searches. </P ><P > Note: "DBAddr file:///path/to/response.xml" is also understood - to load an XML-formatted response from a static file. This is mostly for test purposes. </P ></DIV > </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="cluster-configure-distributed" >Installing and configuring a "distributed" cluster</A ></H2 ><P > <DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="cluster-configure-distributed-database-machines" >Configuring the database machines</A ></H3 ><P > Install <SPAN CLASS="application" >mnoGoSearch</SPAN > on a single database machine. Edit <TT CLASS="filename" >indexer.conf</TT > by specifying multiple DBAddr commands: <PRE CLASS="programlisting" > DBAddr mysql://hostname1/dbname/?dbmode=blob DBAddr mysql://hostname2/dbname/?dbmode=blob DBAddr mysql://hostname3/dbname/?dbmode=blob </PRE > and describing web space using Realm or Server commands. For example: <PRE CLASS="programlisting" > # # The entire top level domain .ru, # using http://www.ru/ as a start point # Server http://www.ru/ Realm http://*.ru/* </PRE > After that, install <SPAN CLASS="application" >mnoGoSearch</SPAN > on all other database machines and copy <TT CLASS="filename" >indexer.conf</TT > from the first database machine. Configuration of indexer is done. Now you can start "indexer" on any database machine, and then "indexer -Eblob" after it finishes. indexer will distribute data between the databases specified in the DBAddr commands. </P ></DIV > <DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="cluster-how-distribute-data" >How indexer distributes data between multiple databases</A ></H3 ><P > The number of the database a document is put into is calculated as a result of division of url.seed by the number of DBAddr commands specified in <TT CLASS="filename" >indexer.conf</TT >, where url.seed is calculated using hash(URL). </P ><P > Thus, for <TT CLASS="filename" >indexer.conf</TT > having three DBAddr command, distribution is done as follows: <P ></P ><UL ><LI ><P >URLs with seed 0..85 go to the first DBAddr</P ></LI ><LI ><P >URLs with seed 85..170 go to the second DBAddr</P ></LI ><LI ><P >URLs with seed 171..255 go to the third DBAddr</P ></LI ></UL > Prior to version 3.3.0, indexer could also distribute data between several databases, but the distribution was done using reminder of division url.seed by the number of DBAddr commands. </P ><P > The new distribution style, introduced in 3.3.0, simplifies manual redistribution of an existing clustered database when adding a new DBAddr (i.e. a new database machine). Future releases will likely provide automatic tools for redistribution data when adding or deleting machines in an existing cluster, as well as more configuration commands to control distribution. </P ></DIV > <DIV CLASS="sect3" ><H3 CLASS="sect3" ><A NAME="cluster-configure-distributed-frontend-machine" >Configuring the front-end machine</A ></H3 ><P > Follow the same configuration instructions with the "merge" cluster type. </P ></DIV > </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="cluster-addnode" >Using dump/restore tools to add cluster nodes</A ></H2 ><P > Starting from the version <TT CLASS="literal" >3.3.9</TT >, <SPAN CLASS="application" >mnoGoSearch</SPAN > allows to add new nodes into a cluster (or remove nodes) without having to re-crawl the documents once again. </P ><P > Suppose you have <TT CLASS="literal" >5</TT > cluster nodes and what extend the cluster to <TT CLASS="literal" >10</TT > nodes. Please go through the following steps: <P ></P ><OL TYPE="1" ><LI ><P > Stop all <SPAN CLASS="application" >indexer</SPAN > processes. </P ></LI ><LI ><P > Create all new <TT CLASS="literal" >10</TT > <ACRONYM CLASS="acronym" >SQL</ACRONYM > databases and create a new <TT CLASS="filename" >.conf</TT > file with <TT CLASS="literal" >10</TT > <B CLASS="command" ><A HREF="msearch-cmdref-dbaddr.html" >DBAddr</A ></B > commands. Note, the old and the new <ACRONYM CLASS="acronym" >SQL</ACRONYM > databases can NOT overlap. The new databases must be freshly created empty databases. </P ></LI ><LI ><P > Run <PRE CLASS="programlisting" >indexer -d /path/to/new.conf -Ecreate</PRE > to create table structure in all <TT CLASS="literal" >10</TT > new <ACRONYM CLASS="acronym" >SQL</ACRONYM > databases. </P ></LI ><LI ><P > Make sure you have enough disk space - you'll need about <TT CLASS="literal" >2</TT > times extra disk space of the all original <ACRONYM CLASS="acronym" >SQL</ACRONYM > databases size. </P ></LI ><LI ><P > Create a directory, say, <TT CLASS="filename" >/usr/local/mnogosearch/dump-restore/</TT >, where you will put the dump file, then go to this directory. </P ></LI ><LI ><P > Run <PRE CLASS="programlisting" >indexer -Edumpdata | gzip > dumpfile.sql.gz </PRE > It will create the dump file. </P ></LI ><LI ><P > Run <PRE CLASS="programlisting" >zcat dumpfile.sql.gz | indexer -d /path/to/new.conf -Esql -v2</PRE > It will load information from the dump file and put it into the new <ACRONYM CLASS="acronym" >SQL</ACRONYM > databases. Note, all document IDs will be automatically re-assigned. </P ></LI ><LI ><P > Check that restoring worked fine. These two commands should report the same statistics: <PRE CLASS="programlisting" > indexer -Estat indexer -Estat -d /path/to/new.conf </PRE > </P ></LI ><LI ><P > Run <PRE CLASS="programlisting" >indexer -d /path/to/new.conf -Eblob</PRE > to create inverted search index in the new <ACRONYM CLASS="acronym" >SQL</ACRONYM > databases; </P ></LI ><LI ><P > Configure a new search front-end to use the new <ACRONYM CLASS="acronym" >SQL</ACRONYM > databases and check that search bring the same results from the old and the new databases. </P ></LI ></OL > </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="cluster-limitations" >Cluster limitations</A ></H2 ><P > <P ></P ><UL ><LI ><P > As of version 3.3.0, <SPAN CLASS="application" >mnoGoSearch</SPAN > allows to join up to 256 database machines into a single cluster. </P ></LI ><LI ><P > Only "body" and "title" sections are currently supported. Support for other standard sections, like meta.keywords and meta.description, as well as for user-defined sections will be added in the future 3.3.x releases. </P ></LI ><LI ><P > Cluster of clusters is not fully functional yet. It will likely be implemented in one of the near future 3.3.x release. </P ></LI ><LI ><P > <SPAN CLASS="emphasis" ><I CLASS="emphasis" >Popularity rank</I ></SPAN > does not work well with cluster. Links between documents residing on different cluster nodes are not taken into account. This will be fixed in future releases. </P ></LI ></UL > </P ></DIV ></DIV ><DIV CLASS="NAVFOOTER" ><HR ALIGN="LEFT" WIDTH="100%"><TABLE SUMMARY="Footer navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" ><A HREF="msearch-fuzzy.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="index.html" ACCESSKEY="H" >Home</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" ><A HREF="msearch-misc.html" ACCESSKEY="N" >Next</A ></TD ></TR ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" >Fuzzy search</TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" > </TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" >Miscellaneous</TD ></TR ></TABLE ></DIV ><!--#include virtual="body-after.html"--></BODY ></HTML >