Sophie: kaffe-2:1.0.7-2mdk ppc

kaffe-1.0.7-2mdk.ppc.rpm

Kaffe Unicode Database.

Kaffe use a compressed form of the Unicode 2.1 database for
the java.lang.Character class.

The Unicode 2.1 database, have lot of usefull compression properties.

The class java.lang.Character uses a subset of the the Unicode properties:
. The category [getType()]
. The decimal digit value [digit()]
. The numeric value [getNumericValue()]
. The uppercase, lowercase and titlecase equivalent [toUpperCase(),
  toLowerCase(), toTitleCase()]

Unicode compression properties:
. few characters have a titlecase equivalent different than the uppercase.
  [uppercase and titlecase of true titlecase character, category "Lt"]
. few characters have a numeric value and a case equivalent.
  [roman numeric letters, category "Nl"]
. all digit number "Nd" have the same decimal digit value than the numeric
  value.

Then, we define two character properties format, small and extended.

Small character proteries format:
. the category
. whitch field (none, numeric value, uppercase, lowercase)
. the generic value

xFFCCCCC GGGGGGGG GGGGGGGG

F: field of the generic value
    00 none
    01 uppercase
    10 lowercase
    11 numerical
C: Java category (with 31 as ``no-Break space separator ("Zs" <noBreak>)''
G: generic value

Extended character properties format:
. the category
. the numeric value
. the uppercase equivalent
. the lowercase equivalent
. the titlecase equivalent

xxxCCCCC NNNNNNNN NNNNNNNN UUUUUUUU UUUUUUUU
LLLLLLLL LLLLLLLL TTTTTTTT TTTTTTTT

C: Java category (with 31 as ``no-Break space separator ("Zs" <noBreak>)''
N: numerical
U: uppercase
L: lowercase
T: titlecase

Consecutives entries in the Unicode 2.1 database could be grouped with
the following rules:
. not compressed
  consecutive entries don't have the same category or the same field
  or the same value nor one increment.
  [U+0028 - U+002D]
. compressed same value:
  consecutive entries have the same category, the same field and the
  same generic value.
  [U+0000 - U+001F, control, no field]
. compressed one increment:
  consecutive entries have the same category, the same field and the
  same increment (one) for the generic value.
  [U+0041 - U+005A (A-Z) uppercase letter, one increment for lowercase]
. not compressed, extended entry:
  consecutive entried that have more than one field.
  [U+2160 - U+216F (Roman number)]


To handle all these range, we create an index with the format:
SSSSSSSS SSSSSSSS EEEEEEEE EEEEEEEE xxMMOOOO OOOOOOOO OOOOOOOO

S: start unicode value for this range
E: end unicode value for this range
M: compression method
    00 not compressed
    01 compressed zero or one field, same value
    10 compressed zero or one field,one increment
    11 not compressed, extended entry
O: offset in the properties table.


The Perl script developers/unicode.pl creates two files in the current
directory from the Unicode Character Database file UnicodeData.txt:
. unicode.idx the ranges index
. unicode.tbl the properties tables

Usage:
$ perl unicode.pl UnicodeData.txt

Latest version of UnicodeData.txt should be download from
    ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
previous version should match this URL:
    ftp://ftp.unicode.org/Public/*-Update*/UnicodeData-*.txt

Current unicode.idx and unicode.tbl (in libraries/javalib/kaffe/lang) are
build from UnicodeData-2.1.8.txt.

As of March 29, 1999, Mauve reports 156 failed tests on java.lang.Character.
There are explains in developers/README.unicode.


Edouard G. Parmelan <egp@free.fr>
March 27, 1999
Last edit May 2, 2000.