<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <HTML ><HEAD ><TITLE > Segmenters for Chinese, Thai and Japanese languages </TITLE ><META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK REL="HOME" TITLE="mnoGoSearch 3.3.9 reference manual" HREF="index.html"><LINK REL="UP" TITLE="Multiple languages support" HREF="msearch-international.html"><LINK REL="PREVIOUS" TITLE="Search pages with multi-lingual interface " HREF="msearch-multilang.html"><LINK REL="NEXT" TITLE="Indexing multilingual servers" HREF="msearch-vary.html"><LINK REL="STYLESHEET" TYPE="text/css" HREF="mnogo.css"><META NAME="Description" CONTENT="mnoGoSearch - Full Featured Web site Open Source Search Engine Software over the Internet and Intranet Web Sites Based on SQL Database. It is a Free search software covered by GNU license."><META NAME="Keywords" CONTENT="shareware, freeware, download, internet, unix, utilities, search engine, text retrieval, knowledge retrieval, text search, information retrieval, database search, mining, intranet, webserver, index, spider, filesearch, meta, free, open source, full-text, udmsearch, website, find, opensource, search, searching, software, udmsearch, engine, indexing, system, web, ftp, http, cgi, php, SQL, MySQL, database, php3, FreeBSD, Linux, Unix, mnoGoSearch, MacOS X, Mac OS X, Windows, 2000, NT, 95, 98, GNU, GPL, url, grabbing"></HEAD ><BODY CLASS="sect1" BGCOLOR="#EEEEEE" TEXT="#000000" LINK="#000080" VLINK="#800080" ALINK="#FF0000" ><!--#include virtual="body-before.html"--><DIV CLASS="NAVHEADER" ><TABLE SUMMARY="Header navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TH COLSPAN="3" ALIGN="center" ><SPAN CLASS="application" >mnoGoSearch</SPAN > 3.3.9 reference manual: Full-featured search engine software</TH ></TR ><TR ><TD WIDTH="10%" ALIGN="left" VALIGN="bottom" ><A HREF="msearch-multilang.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="80%" ALIGN="center" VALIGN="bottom" >Chapter 9. Multiple languages support</TD ><TD WIDTH="10%" ALIGN="right" VALIGN="bottom" ><A HREF="msearch-vary.html" ACCESSKEY="N" >Next</A ></TD ></TR ></TABLE ><HR ALIGN="LEFT" WIDTH="100%"></DIV ><DIV CLASS="sect1" ><H1 CLASS="sect1" ><A NAME="cjk" >Segmenters for Chinese, Thai and Japanese languages</A ></H1 ><P > Unlike in the Western languages, texts in the East Asian languages Chinese, Thai and Japanese may not have spaces between words in a phrase. Thus, when indexing documents in these languages, a search engine needs to know how to split phrases into separate words, and also needs to know the word boundaries when running a search query. <SPAN CLASS="application" >mnoGoSearch</SPAN > can find Asian word boundaries with help of so called <TT CLASS="literal" >segmenters</TT >. </P ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="ja-segment" >Japanese phrase segmenter <A NAME="AEN4256" ></A ></A ></H2 ><P > <SPAN CLASS="application" >mnoGoSearch</SPAN > can use <SPAN CLASS="application" > <A HREF="http://chasen.aist-nara.ac.jp/" TARGET="_top" >ChaSen</A > </SPAN > and <SPAN CLASS="application" > <A HREF="http://mecab.sourceforge.net/" TARGET="_top" >MeCab</A > </SPAN > Japanese morphological systems to break phrases into words. </P ><P >To build <SPAN CLASS="application" >mnoGoSearch</SPAN > with Japanese phrase segmenting, use either <CODE CLASS="option" >--with-chasen</CODE > or <CODE CLASS="option" >--with-mecab</CODE > command line switches when running <SPAN CLASS="application" >configure</SPAN >. </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="zh-segment" >Chinese phrase segmenter <A NAME="AEN4272" ></A ></A ></H2 ><P > <SPAN CLASS="application" >mnoGoSearch</SPAN > uses frequency dictionaries for Chinese phrase segmenting. Segmenting is implemented using the <SPAN CLASS="emphasis" ><I CLASS="emphasis" >dynamic programming method</I ></SPAN > to maximize the cumulative frequency of the separate words produced from a phrase. </P ><P > <SPAN CLASS="application" >mnoGoSearch</SPAN > distribution includes two Chinese dictionaries: <TT CLASS="filename" >mandarin.freq</TT > - a <SPAN CLASS="emphasis" ><I CLASS="emphasis" >Simplified Chinese</I ></SPAN > dictionary and <TT CLASS="filename" >TraditionalChinese.freq</TT > - a <SPAN CLASS="emphasis" ><I CLASS="emphasis" >Traditional Chinese</I ></SPAN > dictionary. </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > When building <SPAN CLASS="application" >mnoGoSearch</SPAN > from sources for use with Chinese language, don't forget to add <CODE CLASS="option" >--with-extra-charsets=big5,gb2313</CODE > when running <SPAN CLASS="application" >configure</SPAN >. </P ></BLOCKQUOTE ></DIV ><P > Use the <B CLASS="command" ><A HREF="msearch-cmdref-loadchineselist.html" >LoadChineseList</A ></B > command to enable Chinese phrase segmenting, with this format: <PRE CLASS="programlisting" > LoadChineseList [charset filename] </PRE > You can optionally specify the character set name and the file name of a dictionary. <DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > <A HREF="msearch-cmdref-loadchineselist.html" >LoadChineseList</A > will load the dictionary for <SPAN CLASS="emphasis" ><I CLASS="emphasis" >Simplified Chinese</I ></SPAN > by default, that is using the <TT CLASS="literal" >GB2312</TT > character set set and the file <TT CLASS="filename" >mandarin.freq</TT >. Anyway, you may find it convenient to specify the default values explicitly: <PRE CLASS="programlisting" > LoadChineseList gb2312 mandarin.freq </PRE > </P ></BLOCKQUOTE ></DIV > To enable <SPAN CLASS="emphasis" ><I CLASS="emphasis" >Traditional Chinese</I ></SPAN > segmenting, use this command: <PRE CLASS="programlisting" > LoadChineseList big5 TraditionalChinese.freq </PRE > </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="th-segment" >Thai phrase segmenter <A NAME="AEN4304" ></A ></A ></H2 ><P > Thai segmenting uses the same method with segmenting for Chinese, with help of a Thai frequency dictionary <TT CLASS="filename" >thai.freq</TT >, which is included into <SPAN CLASS="application" >mnoGoSearch</SPAN > distribution. </P ><P > Use the <B CLASS="command" ><A HREF="msearch-cmdref-loadthailist.html" >LoadThaiList</A ></B > to enable Thai phrase segmenting, with this format: <PRE CLASS="programlisting" > LoadThaiList [charset dictionaryfilename] </PRE > <DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > The <TT CLASS="literal" >TIS-620</TT > character set and the file <TT CLASS="filename" >thai.freq</TT > are used by default. That is if you use <B CLASS="command" ><A HREF="msearch-cmdref-loadthailist.html" >LoadThaiList</A ></B > without any arguments, it will be effectively the same to this command: <PRE CLASS="programlisting" > LoadThaiList tis-620 thai.freq </PRE > </P ></BLOCKQUOTE ></DIV > </P ></DIV ><DIV CLASS="sect2" ><H2 CLASS="sect2" ><A NAME="cjk-segment" >The <ACRONYM CLASS="acronym" >CJK</ACRONYM > phrase segmenter <A NAME="AEN4324" ></A ></A ></H2 ><P > Starting from the version <TT CLASS="literal" >3.3.8</TT >, <SPAN CLASS="application" >mnoGoSearch</SPAN > also supports a special universal segmenter which is suitable for Japanese, Tradtitional Chinese and Simplied Chinese. The universal <ACRONYM CLASS="acronym" >CJK</ACRONYM > segmenter does not use dictionaries and does not require external libraries. </P ><P > You can enable the <ACRONYM CLASS="acronym" >CJK</ACRONYM > segmenter by adding this command into both <TT CLASS="filename" >indexer.conf</TT > and <TT CLASS="filename" >search.htm</TT >: <PRE CLASS="programlisting" > Segmenter cjk </PRE > </P ><P > The <ACRONYM CLASS="acronym" >CJK</ACRONYM > segmenter considers all ideogram characters from the Unicode blocks <TT CLASS="literal" >CJK Ideographs Extension A (U+3400 - U+4DB5)</TT > and <TT CLASS="literal" >CJK Ideographs (U+4E00 - U+9FA5)</TT > as separate words. When indexing a document using the <ACRONYM CLASS="acronym" >CJK</ACRONYM > segmenter, <SPAN CLASS="application" >mnoGoSearch</SPAN > stores information about every ideogram character separately. </P ><P > At search time, the search query you type is preprocessed by the <ACRONYM CLASS="acronym" >CJK</ACRONYM > sementer and some delimiters are inserted between the ideograms. </P ><P > If you pass the <TT CLASS="literal" >m=phrase</TT > query string parameter to <SPAN CLASS="application" >search.cgi</SPAN > (which means <TT CLASS="literal" >exact phrase search</TT >), the <ACRONYM CLASS="acronym" >CJK</ACRONYM > segmenter uses the dash character as a delimiter, and the space character otherwise (that is in case of <TT CLASS="literal" >all words</TT > and <TT CLASS="literal" >any of the words</TT > search modes). </P ><P > Imagine you type the query ``<KBD CLASS="userinput" >ABCD</KBD >'', where <TT CLASS="literal" >A</TT >, <TT CLASS="literal" >B</TT >, <TT CLASS="literal" >C</TT >, <TT CLASS="literal" >D</TT > are some ideographic characters. In case when the <TT CLASS="literal" >exact phrase search</TT > mode is <SPAN CLASS="emphasis" ><I CLASS="emphasis" >not</I ></SPAN > active, your query will be preprocessed by the <ACRONYM CLASS="acronym" >CJK</ACRONYM > segmenter to ``<TT CLASS="literal" >A B C D</TT >'' and the four individual "words" will be searched. Note, that <SPAN CLASS="application" >mnoGoSearch</SPAN > ranks the documents will smaller distance between the query words higher than the documents having the same words in different parts of the document, so if you have some documents the exact phrase <TT CLASS="literal" >ABCD</TT >, it is very likely that they will be displayed in the top <TT CLASS="literal" >10</TT > results. </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > You can try different values for the <A HREF="msearch-cmdref-worddistanceweight.html" >WordDistanceWeight</A > command to see how distances between the query words in the found documents affect their final score. </P ></BLOCKQUOTE ></DIV ><P > Now imagine you type the same query ``<KBD CLASS="userinput" >ABCD</KBD >'' with the <TT CLASS="literal" >exact phrase search</TT > mode <SPAN CLASS="emphasis" ><I CLASS="emphasis" >enabled</I ></SPAN >. The query will be preprocessed by the <ACRONYM CLASS="acronym" >CJK</ACRONYM > segmenter to ``<TT CLASS="literal" >A-B-C-D</TT >''. The dash character forces automatic phrase search (see <A HREF="msearch-doingsearch.html#search-phrase" >the Section called <I >Phrase search <A NAME="AEN5073" ></A ></I > in Chapter 10</A > for details on automatic phrase search), so as a result only those documents with exact phrase match will be found. </P ><DIV CLASS="note" ><BLOCKQUOTE CLASS="note" ><P ><B >Note: </B > You can also use the ordinary <SPAN CLASS="application" >mnoGoSearch</SPAN > query syntax with quotes to enable phrase searches without having to pass the <TT CLASS="literal" >m=all</TT > query string variable (<TT CLASS="literal" >exact phrase search</TT > mode) . For example, if you type ``<TT CLASS="literal" >"AB" "CD"</TT >'', then the documents having the ideogram <TT CLASS="literal" >A</TT > immediately followed by the ideogram <TT CLASS="literal" >B</TT >, and at the same time, the ideogram <TT CLASS="literal" >C</TT > immediately followed by the ideogram <TT CLASS="literal" >D</TT > will be found. The mutual positions of the phrases <TT CLASS="literal" >AB</TT > and <TT CLASS="literal" >CD</TT > will not affect the result set, and will affect only the result ordering. </P ></BLOCKQUOTE ></DIV ><P > Although, the <ACRONYM CLASS="acronym" >CJK</ACRONYM > phrase segmenter is not aware of the real word boundaries, tests made by the native speakers indicated that in many cases it works even better and more predictable than the <SPAN CLASS="application" >Mecab</SPAN >-based, <SPAN CLASS="application" >Chasen</SPAN >-based, and the frequency-based segmenters. </P ></DIV ></DIV ><DIV CLASS="NAVFOOTER" ><HR ALIGN="LEFT" WIDTH="100%"><TABLE SUMMARY="Footer navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" ><A HREF="msearch-multilang.html" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="index.html" ACCESSKEY="H" >Home</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" ><A HREF="msearch-vary.html" ACCESSKEY="N" >Next</A ></TD ></TR ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" >Search pages with multi-lingual interface <A NAME="AEN4113" ></A ></TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="msearch-international.html" ACCESSKEY="U" >Up</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" >Indexing multilingual servers</TD ></TR ></TABLE ></DIV ><!--#include virtual="body-after.html"--></BODY ></HTML >