Sophie

Sophie

distrib > Mandriva > 2007.0 > i586 > media > contrib-release > by-pkgid > 4c9f17ec5da473f7fb52041bb9197c5a > files > 120

kaffe-devel-1.1.8-0.20060723.1mdv2007.0.i586.rpm

# This is a historical document.
# Classes kaffe.io.ByteToCharEUC_JP and kaffe.io.CharToByteEUC_JP use
# Classes kaffe.io.ByteToCharIconv and kaffe.io.CharToByteIconv
# respectively and the tables ByteToCharEUC_JP.tbl and CharToByteEUC_JP.tbl
# are no longer used.  (Dec 12, 2003, Ito Kazumitsu <kaz@maczuka.gcd.org>)

Extended UNIX Code (EUC) Encoding Scheme

The EUC encoding scheme defines a set of encoding rules that can
support one to four character sets. The encoding rules are based on
the ISO2022 definition for the encoding of 7-bit and 8-bit data. The
EUC encoding scheme uses control characters to identify some of the
character sets. The EUC encoding table shows the basic structure of
all EUC encoding.

    CS0	  0xxxxxxx

    CS1	  1xxxxxxx
	  1xxxxxxx 1xxxxxxx
	  1xxxxxxx 1xxxxxxx 1xxxxxxx
	  ...

    CS2   10001110 1xxxxxxx
	  10001110 1xxxxxxx 1xxxxxxx
	  10001110 1xxxxxxx 1xxxxxxx 1xxxxxxx
	  ...

    CS3   10001111 1xxxxxxx
	  10001111 1xxxxxxx 1xxxxxxx
	  10001111 1xxxxxxx 1xxxxxxx 1xxxxxxx
	  ...

The term EUC denotes these general encoding rules. A code set based on
EUC conforms to the EUC encoding rules but also identifies the
specific character sets associated with the specific instances. For
example, IBM-eucJP for Japanese refers to the encoding of the Japanese
Industrial Standard characters according to the EUC encoding rules.

The first set (CS0) always contains an ISO646 character set. All of
the other sets must have the most significant bit (MSB) set to 1 and
can use any number of bytes to encode the characters. In addition, all
characters within a set must have:

   o Same number of bytes to encode all characters
   o Same column display width (number of columns on a fixed-width
     terminal)

All characters in the third set (CS2) are always preceded with the
control character SS2 (single-shift 2, 0x8e). Code sets that conform
to EUC do not use the SS2 control character other than to identify the
third set.

All characters in the fourth set (CS3) are always preceded with the
control character SS3 (single-shift 3, 0x8f). Code sets that conform
to EUC do not use the SS3 control character other than to identify the
fourth set.


The following table illustrates the Japanese representation of EUC
packed format:

  EUC Code Sets                                 Encoding Range
  ^^^^^^^^^^^^^                                 ^^^^^^^^^^^^^^
  Code set 0 (ASCII or JIS X 0201-1976 Roman):  0x21-0x7E
  Code set 1 (JIS X 0208):                      0xA1A1-0xFEFE
  Code set 2 (half-width katakana):             0x8EA1-0x8EDF
  Code set 3 (JIS X 0212-1990):                 0x8FA1A1-0x8FFEFE


Classes kaffe.io.ByteToCharEUC_JP and kaffe.io.CharToByteEUC_JP use
external tables build by class EncodeEUC_JP in developers directory.

(1) Get files JIS*.TXT from
    http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/

(2) run 'kaffe EncodeEUC_JP'

(3) copy ByteToCharEUC_JP.tbl and CharToByteEUC_JP.tbl in the directory
    kaffe/io.

By default, classes ByteToCharEUC_JP and CharToByteEUC_JP in package
kaffe.io use full US-ASCII.  If you want use exceptions defined by JIS
X 0201-1976 (aka 0x5C is U+00A5 and 0x7E is U+203E), you must change
US_ASCII to false in both classes.

Ito Kazumitsu <kaz@maczuka.gcd.org>
Edouard G. Parmelan <egp@free.fr>
Nov 20 2000