Sophie: pyrite-0.9.3-4mdk i586

pyrite-0.9.3-4mdk.i586.rpm


Pure Python Doc Compression Code
--------------------------------

The code in doc_compress.py is not intended to be used unless
absolutely necessary; the C code is much faster and better.  However,
I wanted to write pure-Python Doc compression code so that I could
experiment with it more easily, so I figured I might as well put it in
Pyrite.  The App.Doc module will try to import the C code first, and
if it can't be loaded for some reason it will use the pure version.

The main difference between the Python and C compressors is speed: the
Python compressor is at least an order of magnitude slower than the
compiled one.  However, with a reasonably fast CPU it still might not
be annoyingly slow.  For example, my system is a Cyrix M2 at 207 MHz
(83 MHz bus) with 1MB cache, and it compresses 2-3 blocks per second
using the Python code.

I should note, however, that there is one difference between the
Python and C code, at the present time.  In the Doc compression
scheme, characters with the high bit set -- accented characters,
non-ASCII symbols, and the like -- must be escaped when they are
stored in the compressed output.  This escaping takes the form of a
byte 0x01-0x08 followed by 1-8 bytes of data.

As the compressor outputs bytes, it escapes every high-bit-set
character individually, even if there are several of them in a row.
The C compressor then makes a second pass over the data, collapsing
sequences of escapes.  For example, the main compression loop might
output:

    0x01 0x9f  0x01 0x80  0x01 0x8d  0x01 0xea

and the second pass would collapse this to:

    0x04 0x9f 0x80 0x8d 0xea

The Python compressor doesn't do this.  Collapsing or not collapsing
sequences of escapes doesn't affect decompression at all; however, it
may make a compressed record slightly bigger *if there are runs of
more than one escaped character in a row*.

Whether this makes much practical difference remains to be seen.  In
ordinary English text, it doesn't make much difference at all, because
high-bit characters are rare, and unlikely to come in large clumps.
In non-English text, however, it is more likely that high-bit
characters will occur, especially if the text is heavy in accents.
Such text will produce slightly larger compressed output using the
Python code.

  -- Rob Tillotson <robt@debian.org>