Sophie

Sophie

distrib > Mandriva > 9.1 > ppc > by-pkgid > bebff3570faee357416d2588192a229a > files > 180

mnogosearch-3.2.8-1mdk.ppc.rpm

<HTML
><HEAD
><TITLE
>Languages support</TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.73
"><LINK
REL="HOME"
TITLE="mnoGoSearch 3.2 reference manual"
HREF="index.html"><LINK
REL="PREVIOUS"
TITLE="Tags

"
HREF="msearch-tags.html"><LINK
REL="NEXT"
TITLE="Making multi-language search pages

"
HREF="msearch-multilang.html"><LINK
REL="STYLESHEET"
TYPE="text/css"
HREF="mnogo.css"><META
NAME="Description"
CONTENT="mnoGoSearch - Full Featured Web site Open Source Search Engine Software over the Internet and Intranet Web Sites Based on SQL Database. It is a Free search software covered by GNU license."><META
NAME="Keywords"
CONTENT="shareware, freeware, download, internet, unix, utilities, search engine, text retrieval, knowledge retrieval, text search, information retrieval, database search, mining, intranet, webserver, index, spider, filesearch, meta, free, open source, full-text, udmsearch, website, find, opensource, search, searching, software, udmsearch, engine, indexing, system, web, ftp, http, cgi, php, SQL, MySQL, database, php3, FreeBSD, Linux, Unix, mnoGoSearch, MacOS X, Mac OS X, Windows, 2000, NT, 95, 98, GNU, GPL, url, grabbing"></HEAD
><BODY
CLASS="chapter"
BGCOLOR="#EEEEEE"
TEXT="#000000"
LINK="#000080"
VLINK="#800080"
ALINK="#FF0000"
><DIV
CLASS="NAVHEADER"
><TABLE
SUMMARY="Header navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TH
COLSPAN="3"
ALIGN="center"
>mnoGoSearch 3.2 reference manual: Full-featured search engine software</TH
></TR
><TR
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="bottom"
><A
HREF="msearch-tags.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="80%"
ALIGN="center"
VALIGN="bottom"
></TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="bottom"
><A
HREF="msearch-multilang.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
></TABLE
><HR
ALIGN="LEFT"
WIDTH="100%"></DIV
><DIV
CLASS="chapter"
><H1
><A
NAME="international"
>Chapter 7. Languages support</A
></H1
><DIV
CLASS="TOC"
><DL
><DT
><B
>Table of Contents</B
></DT
><DT
><A
HREF="msearch-international.html#charset"
>Character sets
<A
NAME="AEN2429"
></A
></A
></DT
><DT
><A
HREF="msearch-multilang.html"
>Making multi-language search pages
<A
NAME="AEN2757"
></A
></A
></DT
><DT
><A
HREF="msearch-cjk.html"
>Segmenters for chinese and japanese languages</A
></DT
></DL
></DIV
><DIV
CLASS="sect1"
><H1
CLASS="sect1"
><A
NAME="charset"
>Character sets
<A
NAME="AEN2429"
></A
></A
></H1
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="supcharsets"
>Supported character sets</A
></H2
><P
><SPAN
CLASS="application"
>mnoGoSearch</SPAN
> supports almost all known 8 bit
character sets as well as some multi-byte charsets including Korean
euc-kr, Chinese big5 and gb2312, Japanese shift-jis, euc-jp and iso-2022-jp, as well as
utf8. Some multi-byte character sets are not
supported by default, because the conversion tables for them are
rather large that leads to increase of the executable files
size. See <TT
CLASS="filename"
>configure</TT
> parameters to enable support
for these charsets.</P
><P
>mnoGoSearch also supports the following
Macintosh chatacter sets: MacCE, MacCroatian, MacGreek, MacRoman,
MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew,
MacCyrillic, MacGujarati.</P
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="charset-onedb"
>Several languages in one database</A
></H2
><P
>It is often necessary to deal with several
languages simultaneously. Number of supported languages depends on
choice of character set that <SPAN
CLASS="application"
>mnoGoSearch</SPAN
>
will use to store data. Character set is specified with
<B
CLASS="command"
>LocalCharset</B
> command.</P
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="charset-utf8"
>UTF-8 mode</A
></H2
><P
>When <TT
CLASS="literal"
>UTF-8</TT
> is specified in
<B
CLASS="command"
>LocalCharset</B
> command, you may work with any
languages supported in <A
HREF="http://www.unicode.org/"
TARGET="_top"
>Unicode</A
>. That means you may use any number of over 650 languages. However, using UTF-8 may consume large amount of disk space (up to twice for some languages), leading to slower indexation and search.</P
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="charset-nonutf"
>non-UTF-8 mode</A
></H2
><P
>Since every character set includes latin
characters, any character set supports at least two languages -
English and one or more other languages. <TT
CLASS="literal"
>US-ASCII</TT
>
is an exception - it supports only Latin characters.</P
><DIV
CLASS="note"
><BLOCKQUOTE
CLASS="note"
><P
><B
>Note: </B
>When using
<SPAN
CLASS="application"
>mnoGoSearch</SPAN
> in standard (non-UTF-8) mode,
you may use as many languages as you like if they all belong to same
language group.</P
></BLOCKQUOTE
></DIV
><DIV
CLASS="table"
><A
NAME="AEN2455"
></A
><P
><B
>Table 7-1. Language groups</B
></P
><TABLE
BORDER="1"
CLASS="CALSTABLE"
><TBODY
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Language group</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Languages</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Character sets</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 1</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Western Europe:
Albanian, Catalan, Danish, Dutch, English, Faeroese, Finnish, French,
Galician, German, Icelandic, Italian, Norwegian, Portuguese, Spanish,
Swedish</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>ASCII 8, CP437,
CP850, CP860, CP1252, ISO 8859-1, ISO 8859-15, MacRoman,
MacIceland</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 2</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Eastern Europe:
Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish,
Romanian, Slovak, Slovene</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>CP852, CP1250, ISO 8859-2, MacCentralEurope, MacRomania, MacCroatian</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 4</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Baltic</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>CP1257, iso-8859-4, iso-8859-13</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 5</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Cyrillic: Bulgarian, Byelorussian, Macedonian, Russian, Serbian, Ukrainian</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>CP855, CP866, CP1251, ISO 8859-5, Koi8-r, Koi8-u, MacCyrillic</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 6</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Arabic</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>CP864, CP1256, ISO 8859-6, MacArabic</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 7</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Greek</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>CP869, CP1253, ISO 8859-7, MacGreek</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 8</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Hebrew</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>CP1255, ISO 8859-8, MacHebrew</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 9</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Turkish</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>CP857, CP1254, ISO 8859-9, MacTurkish</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 101</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Japanese</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Shift-JIS, EUC-JP, ISO-2022-JP</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 102</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Simplified Chinese (PRC)</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>EUC-GB</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 103</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Traditional Chinese (ROC)</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Big 5</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 104</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Korean</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>EUC-KR</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 105</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Thai</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>CP874, TIS 620, MacThai</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 106</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Vietnamese</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>CP1258</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 107</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Indian</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>MacGujarati</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Group 108</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Georgian</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>geostd8</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Unicode</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>Over 650 languages</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>UTF-8 (Unicode)</TD
></TR
></TBODY
></TABLE
></DIV
><P
>E.g. in case you search engine is configured to
use <B
CLASS="command"
>LocalCharset</B
> from the 5th group (Cyrillic), you
may index servers containing documents in Bulgarian, Byelorussian,
Macedonian, Russian, Serbian and Ukrainian. Indexing a multi-language
document in UTF-8 is possible as well; however
<TT
CLASS="literal"
>indexer</TT
> will extract and save only cyrillic content
from the page. To provide support for over 650 languages, please use
<B
CLASS="command"
>LocalCharset utf-8</B
> command.</P
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="recoding"
>Recoding</A
></H2
><P
>&#13;			<TT
CLASS="literal"
>indexer</TT
> recodes all
documents to the character set specified in the
<B
CLASS="command"
>LocalCharset</B
>
			<TT
CLASS="filename"
>indexer.conf</TT
>
command. Internally recoding is implemented using Unicode. Please note
that some recoding may loose some data. For example, recoding between
any Greek and Russian charsets looses all national characters. This
does not matter for a single language sites. If you want to build
multi-lingual search engine use UTF8 character set as a
<B
CLASS="command"
>LocalCharset</B
>.</P
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="charset-searchdec"
>Recoding at search time</A
></H2
><P
>You may use <B
CLASS="command"
>BrowserCharset</B
>
command to choose a charset which will be used to display search
results. <B
CLASS="command"
>BrowserCharset</B
> may differ from
<B
CLASS="command"
>LocalCharset</B
>.</P
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="charsetsalias"
>Character sets aliases</A
></H2
><P
>Each charset is recognized by a number of its
aliases. Web servers can return the same charset in different
notation. For example, iso-8859-2, iso8859-2, latin2 are the same
charsets. There is support for charsets names aliases which search
engine can understand:</P
><DIV
CLASS="table"
><A
NAME="AEN2551"
></A
><P
><B
>Table 7-2. Charsets aliases</B
></P
><TABLE
BORDER="1"
CLASS="CALSTABLE"
><TBODY
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-1:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CP819, CSISOLATIN, IBM819, ISO-8859-1, ISO-IR-100, ISO_8859-1, ISO_8859-1:1987, L1, LATIN1
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-10:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CSISOLATIN6, ISO-8859-10, ISO-IR-157, ISO_8859-10, ISO_8859-10:1992, L6, LATIN6
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-11:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              ISO-8859-11, TIS-620, TIS620, TACTIS
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8869-13:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              ISO-8859-13, ISO-IR-179, ISO_8859-13, L7, LATIN7
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-14:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              ISO-8859-14, ISO-IR-199, ISO_8859-14, ISO_8859-14:1998, L8, LATIN8
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-15:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              ISO-8859-15, ISO-IR-203, ISO_8859-15, ISO_8859-15:1998
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-16:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              ISO-8859-16, ISO-IR-226, ISO_8859-16, ISO_8859-16:2000
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-2:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CSISOLATIN2, ISO-8859-2, ISO-IR-101, ISO_8859-2, ISO_8859-2:1987, L2, LATIN2
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-3:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CSISOLATIN3, ISO-8859-3, ISO-IR-109, ISO_8859-3, ISO_8859-3:1988, L3, LATIN3
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-4:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CSISOLATIN4, ISO-8859-4, ISO-IR-110, ISO_8859-4, ISO_8859-4:1988, L4, LATIN4
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-5:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>CSISOLATINCYRILLIC, CYRILLIC, ISO-8859-5, ISO-IR-144, ISO_8859-5, ISO_8859-5:1988</TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-6:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              ARABIC, ASMO-708, CSISOLATINARABIC, ECMA-114, ISO-8859-6, ISO-IR-127, ISO_8859-6, ISO_8859-6:1987
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-7:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CSISOLATINGREEK, ECMA-118, ELOT_928, GREEK, GREEK8, ISO-8859-7, ISO-IR-126, ISO_8859-7, ISO_8859-7:1987
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-8:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CSISOLATINHEBREW, HEBREW, ISO-8859-8, ISO-IR-138, ISO_8859-8, ISO_8859-8:1988
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>iso-8859-9:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CSISOLATIN5, ISO-8859-9, ISO-IR-148, ISO_8859-9, ISO_8859-9:1989, L5, LATIN5
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>armscii-8:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              ARMSCII-8
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>big5:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              BIG-5, BIG-FIVE, BIG5, BIGFIVE, CN-BIG5, CSBIG5
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp1250:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CP1250, MS-EE, WINDOWS-1250
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp1251:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CP1251, MS-CYRL, WINDOWS-1251
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp1252:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CP1252, MS-ANSI, WINDOWS-1252
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp1253:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CP1253, MS-GREEK, WINDOWS-1253
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp1254:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CP1254, MS-TURK, WINDOWS-1254
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp1255:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CP1255, MS-HEBR, WINDOWS-1255
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp1256:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CP1256, MS-ARAB, WINDOWS-1256
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp1257:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CP1257, WINBALTRIM, WINDOWS-1257
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp1258:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CP1258, WINDOWS-1258
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp437:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              437, CP437, IBM437
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp850:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              850, CP850, CSPC850MULTILINGUAL, IBM850
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp852:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              852, CP852, IBM852
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp855:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              855, CP855, IBM855
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp857:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              857, CP857, IBM857
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp860:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              860, CP860, IBM860
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp861:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              861, CP861, IBM861
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp862:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              862, CP862, IBM862
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp863:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              863, CP863, IBM863
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp864:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              864, CP864, IBM864
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp865:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              865, CP865, IBM865
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp866:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              866, CP866, CSIBM866, IBM866
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp869:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              869, CP869, IBM869, CP874, WINDOWS-874
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>euc-kr:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CSEUCKR, EUC-KR, EUCKR
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>gb2312:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CHINESE, CSGB2312, CSISO58GB231280, GB2312, GB_2312-80, ISO-IR-58
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>koi8-r:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CSKOI8R, KOI8-R
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>koi8-u</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              KOI8-U
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>shift-jis:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CSSHIFTJIS, MS_KANJI, S-JIS, SHIFT-JIS, SHIFT_JIS, SJIS
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>cp367:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              ANSI_X3.4-1968, ASCII, CP367, CSASCII, IBM367, ISO-IR-6, ISO646-US, ISO_646.IRV:1991, US, US-ASCII
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>utf8:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              UTF-8, UTF8
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>viscii:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              CSVISCII, VISCII, VISCII1.1-1
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>maccyrillic:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;              MACCYRILLIC, X-MAC-CYRILLIC
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>macroman:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;             MACROMAN, MACINTOSH, CSMACINTOSH,  MAC
            </TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>MacCentralEurope:</TD
><TD
ALIGN="LEFT"
VALIGN="MIDDLE"
>&#13;             MACCENTRALEUROPE, MACCE 
            </TD
></TR
></TBODY
></TABLE
></DIV
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="charsetdetect"
>Document charset detection</A
></H2
><P
>indexer detects document character set in this order:</P
><P
></P
><OL
TYPE="1"
><LI
><P
>"Content-type: text/html; charset=xxx"</P
></LI
><LI
><P
>&#60;META NAME="Content-Type" CONTENT="text/html; charset=xxx"&#62;</P
></LI
><LI
><P
>Defaults from "Charset" field in Common Parameters</P
></LI
></OL
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="charset-guesser"
>Automatic charset guesser</A
></H2
><P
>Since 3.2.0 <SPAN
CLASS="application"
>mnoGoSearch</SPAN
> has an automatic charset
and language guesser. It currently recognizes more than 100 various
charsets and languages. Charset and language detection  is implemented
using "N-Gram-Based Text Categorization" technique. There is a number
of so called "language map" files, one for each language-charset
pair. They are installed under
<TT
CLASS="filename"
>/usr/local/mnogosearch/etc/langmap/</TT
> directory by
default. Take a look there to check the list of currently provided
charset-language pairs. Guesser works fine for texts bigger than 500
characters. Shorter texts may not be guessed well.</P
><DIV
CLASS="sect3"
><H3
CLASS="sect3"
><A
NAME="mguesser"
>Build your own language maps</A
></H3
><P
>To build your own language map use <A
NAME="AEN2723"
></A
><TT
CLASS="literal"
>mguesser</TT
>
utility. In addition, your need to collect file with language samples in charset desired. For new language map creattion,
use the following command:
<PRE
CLASS="programlisting"
>&#13;        mguesser -p -c charset -l language &#60; FILENAME &#62; language.charset.lm
</PRE
>
</P
><P
>You can also use <TT
CLASS="literal"
>mguesser</TT
> utility for guessing document's language and charset by exsisting 
language maps. To do this, use following command:
<PRE
CLASS="programlisting"
>&#13;        mguesser [-n maxhits] &#60; FILENAME
</PRE
>
</P
><P
>For some languages, it may be used few different charset. To convert from one charset supported by 
<SPAN
CLASS="application"
>mnoGoSearch</SPAN
> to another, use <A
NAME="AEN2732"
></A
><TT
CLASS="literal"
>mconv</TT
>
utility.
<PRE
CLASS="programlisting"
>&#13;        mconv [OPTIONS] -f charset_from -t charset_to [configfile] &#60; infile &#62; outfile
</PRE
>
</P
><P
>By default, both <TT
CLASS="literal"
>mguesser</TT
> and <TT
CLASS="literal"
>mconv</TT
> utilities is installed into
<TT
CLASS="filename"
>/usr/local/mnoogosearch/sbin/</TT
> directory.
</P
></DIV
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="defcharset"
>Default charset
<A
NAME="AEN2742"
></A
></A
></H2
><P
>Use Charset <TT
CLASS="filename"
>indexer.conf</TT
> command to choose the default charset of indexed servers.</P
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="deflang"
>Default Language
<A
NAME="AEN2749"
></A
></A
></H2
><P
>You can set default language for Servers by using <TT
CLASS="varname"
>DefaultLang</TT
>
			<TT
CLASS="filename"
>indexer.conf</TT
> variable. This is useful while restricting search by URL language.</P
></DIV
></DIV
></DIV
><DIV
CLASS="NAVFOOTER"
><HR
ALIGN="LEFT"
WIDTH="100%"><TABLE
SUMMARY="Footer navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
><A
HREF="msearch-tags.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="index.html"
ACCESSKEY="H"
>Home</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><A
HREF="msearch-multilang.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
>Tags
<A
NAME="AEN2401"
></A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
>&nbsp;</TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
>Making multi-language search pages
<A
NAME="AEN2757"
></A
></TD
></TR
></TABLE
></DIV
></BODY
></HTML
>