# This is a historical document. # Classes kaffe.io.ByteToCharEUC_JP and kaffe.io.CharToByteEUC_JP use # Classes kaffe.io.ByteToCharIconv and kaffe.io.CharToByteIconv # respectively and the tables ByteToCharEUC_JP.tbl and CharToByteEUC_JP.tbl # are no longer used. (Dec 12, 2003, Ito Kazumitsu <kaz@maczuka.gcd.org>) Extended UNIX Code (EUC) Encoding Scheme The EUC encoding scheme defines a set of encoding rules that can support one to four character sets. The encoding rules are based on the ISO2022 definition for the encoding of 7-bit and 8-bit data. The EUC encoding scheme uses control characters to identify some of the character sets. The EUC encoding table shows the basic structure of all EUC encoding. CS0 0xxxxxxx CS1 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx ... CS2 10001110 1xxxxxxx 10001110 1xxxxxxx 1xxxxxxx 10001110 1xxxxxxx 1xxxxxxx 1xxxxxxx ... CS3 10001111 1xxxxxxx 10001111 1xxxxxxx 1xxxxxxx 10001111 1xxxxxxx 1xxxxxxx 1xxxxxxx ... The term EUC denotes these general encoding rules. A code set based on EUC conforms to the EUC encoding rules but also identifies the specific character sets associated with the specific instances. For example, IBM-eucJP for Japanese refers to the encoding of the Japanese Industrial Standard characters according to the EUC encoding rules. The first set (CS0) always contains an ISO646 character set. All of the other sets must have the most significant bit (MSB) set to 1 and can use any number of bytes to encode the characters. In addition, all characters within a set must have: o Same number of bytes to encode all characters o Same column display width (number of columns on a fixed-width terminal) All characters in the third set (CS2) are always preceded with the control character SS2 (single-shift 2, 0x8e). Code sets that conform to EUC do not use the SS2 control character other than to identify the third set. All characters in the fourth set (CS3) are always preceded with the control character SS3 (single-shift 3, 0x8f). Code sets that conform to EUC do not use the SS3 control character other than to identify the fourth set. The following table illustrates the Japanese representation of EUC packed format: EUC Code Sets Encoding Range ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^ Code set 0 (ASCII or JIS X 0201-1976 Roman): 0x21-0x7E Code set 1 (JIS X 0208): 0xA1A1-0xFEFE Code set 2 (half-width katakana): 0x8EA1-0x8EDF Code set 3 (JIS X 0212-1990): 0x8FA1A1-0x8FFEFE Classes kaffe.io.ByteToCharEUC_JP and kaffe.io.CharToByteEUC_JP use external tables build by class EncodeEUC_JP in developers directory. (1) Get files JIS*.TXT from http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/ (2) run 'kaffe EncodeEUC_JP' (3) copy ByteToCharEUC_JP.tbl and CharToByteEUC_JP.tbl in the directory kaffe/io. By default, classes ByteToCharEUC_JP and CharToByteEUC_JP in package kaffe.io use full US-ASCII. If you want use exceptions defined by JIS X 0201-1976 (aka 0x5C is U+00A5 and 0x7E is U+203E), you must change US_ASCII to false in both classes. Ito Kazumitsu <kaz@maczuka.gcd.org> Edouard G. Parmelan <egp@free.fr> Nov 20 2000