Sophie: wwwoffle-2.6d-1mdk i586

wwwoffle-2.6d-1mdk.i586.rpm

            WWWOFFLE - World Wide Web Offline Explorer - Version 2.6c
            =========================================================


One feature that has often been requested in WWWOFFLE is compression.  This is
either for compression from servers on the internet or for compression of files
in the cache.  Since adding compression of any sort is a big step I have
implemented it for both requested cases and also from WWWOFFLE to the browser.

The compression options are selectable at compile time and the individual
options are chosen using the WWWOFFLE configuration file.  This means that if
you are not interested in compression, or the compression library is not
available then the rest of the WWWOFFLE functions are still available.  If the
program is compiled with the compression library then you are not forced to use
it.


zlib
----

The simplest way of adding the compression functionality to WWWOFFLE is by
compiling with zlib.  This will provide support for the deflate and gzip
compression methods.

The zlib README file describes the programs as:

    zlib 1.1.3 is a general purpose data compression library.  All the code
    is thread safe.  The data format used by the zlib library
    is described by RFCs (Request for Comments) 1950 to 1952 in the files 
    ftp://ds.internic.net/rfc/rfc1950.txt (zlib format), rfc1951.txt (deflate
    format) and rfc1952.txt (gzip format).

The zlib library is not GPL software (like WWWOFFLE is), but the copyright file
for it says:

    Copyright (C) 1995-1998 Jean-loup Gailly and Mark Adler

    This software is provided 'as-is', without any express or implied
    warranty.  In no event will the authors be held liable for any damages
    arising from the use of this software.


The zlib library adds different types of functions for the different compression
methods.

For deflate/inflate there are functions that will take a block of memory and
compress (deflate) or uncompress (inflate) the contents.  The block of memory is
considered as part of a large compressed block and therefore the output of the
compression function depends on the previous inputs.  There are also
miscellaneous functions that are used to initialise the compression functions,
to flush out the data at the end and to finish with the compression functions.

For gzip/gunzip there are functions that can open a compressed file and read
from it (uncompression) or write to it (compression).  All of the gzip/gunzip
compression/uncompression functions are based on file compression/uncompression
and not in-memory compression/uncompression.


The non-availablity of in-memory block-by-block gzip/gunzip functions is a
problem since WWWOFFLE needs to be able to compress data as it is flowing from
the server to the browser.  A temporary file could be used, but this is not a
good solution in general.  To work around this problem the gzip/gunzip source
code in the zlib library was examined and their operation (using the deflate and
inflate algorithms with some extra fiddling at start and end) was implemented as
in-memory block-by-block functions.


Compressed Files From Server To WWWOFFLE
----------------------------------------

HTTP/1.1 and Content Negotiation
- - - - - - - - - - - - - - - -

For the compression functions to be worth having on the link from the server to
the client there must be servers that support the function.

Fortunately the HTTP/1.1 standard defines a mechanism by which browsers can
indicate to servers that they will accept compressed data.  The servers reply
with compressed data and indicate the method by which the data has been
compressed.  This is a specific instance of the content negotiation functions of
HTTP/1.1.

Unfortunately the definition of HTTP/1.1 and content encoding leads to ambiguous
results.


Theory
- - -

The way that it works is the following (in theory) for gzip compression.

1) The client makes request with a header of 'Accept-Encoding: gzip'
   this means that it can handle a gzipped version of the URL data.

2) The server looks at the request and supplies a 'Content-Encoding: gzip'
   reply and a compressed version of the data for the requested URL.

3) The client receives the data, see the 'Content-Encoding: gzip'
   header and decompresses the data before using it.

An important remark to make at this point is that the HTTP standard defines the
'Content-Encoding' header to apply to the complete link from server to client
through any proxies.  It is not intended to apply separately for the server to
proxy and proxy to client links.  There is a 'Transfer-Encoding' header for
this, but it does not allow for compression.


Problem 1
- - - - -

The use of compression is fairly rare and there are problems with clients, even
without the use of WWWOFFLE.

For example Netscape version 4.76 will ask for gzip compressed data and will
display the HTML fine.  The problem is that if the images in the page are also
sent compressed then they are displayed as the 'broken image' icon.  If you view
any single image from the page then it is OK.  This indicates to me that the
browser knows how to handle gzipped data for the HTML page and for the images,
but not for images inside a page!

Mozilla version M18 works fine with the same page and same images, so it must be
a client problem.


Problem 2
- - - - -

When a request is sent for a URL that is naturally compressed then even if no
'Accept-Encoding' header is sent then the data comes back with a
'Content-Encoding' header.  So for example if a user requests the URL
http://www.foo/bar.tar.gz then the data comes back gzipped with a
'Content-Encoding: gzip' header.  If the user saves the file then he expects
that it is saved to a file called bar.tar.gz and that it contains a compressed
tar file.

The problem here is that WWWOFFLE adds in an 'Accept-Encoding' header on all
outgoing requests and decompresses the ones that come back with a
'Content-Encoding' header and removes the header.  This just breaks what I have
described above, the file bar.tar.gz that the browser writes out is actually a
tar file and not a compressed tar file.  WWWOFFLE has no way to know that the
data that was requested was compressed in its natural form or if the compression
was added as part of the content negotiation.


Problem 3
- - - - -

When WWWOFFLE is used and it performs the uncompression for the client then this
can also cause problems.

The Debian Linux package manager program 'apt-get' requests files called
Packages.gz from the server.  If WWWOFFLE uncompresses these and sends them to
the client uncompressed then apt-get fails because the file is not compressed
like it expects.


Solution
- - - -

The only solution that I can see is that WWWOFFLE does not decompress any files
that it thinks might be naturally compressed, e.g. *.gz files.

This means that the configuration file for WWWOFFLE needs to contain a list of
files that it does not request compressed and does not try to decompress.


Problem 4
- - - - -

Due to the browser problems quoted above (Problem 1) there are servers that will
only send compressed content to browsers that they know will accept it.  This
relies on the User-Agent header that the browser sends in the request.

The problem here is that when people hide the browser that they are using by
changing the User-Agent (either in the browser or using the WWWOFFLE
CensorHeader options) the compression may not be performed.

One server that does this is www.google.com who only send compressed data to
clients that can handle it.

Solution
- - - -

There are two solutions here, either the user has to choose a fake User-Agent
that will work (but there is no list of those and different servers may use
different ones) or the server needs to be modified.

For the www.google.com problem, the solution has been implemented at the server
end by accepting "WWWOFFLE" as a User-Agent string (specifically the start of
string) that performs compression correctly.  Obviously it is impossible to get
every server to change their method of choosing to use compression or not.


Compressed Cache
----------------

The problems described above with ambiguity in the meaning of the
'Content-Encoding' header also cause problems with the compressed cache.


Problem 1
- - - - -

If the file is stored in the cache with a 'Content-Encoding' header then
WWWOFFLE would need to decide if it should decompress it before sending it to
the brower.  This needs the same list of files not to compress that are
mentioned in the solution above.


Solution 1
- - - - -

Two solutions to this problem present themselves.

1) Make the 'Content-Encoding' something that WWWOFFLE will recognise as being
compressed by itself.  For example 'Content-Encoding: wwwoffle-deflate' could be
used to indicate files that WWWOFFLE compressed in the cache and that need to be
uncompressed when they are read out again.

2) Add another header into the cached file and use a standard content encoding.
There is a header called 'Pragma' that can be added to any HTTP header, its
meaning can be defined by the user, unrecognised headers should be ignored.

The first option is the simplest, but leaves the cache files in a non-standard
format.  The second option means that the file itself is still a valid HTTP
header followed by data.

The second option is the one that is implemented.


Problem 2
- - - - -

Another problem with the compressed cache format is that many files are already
compressed.  For example images will nearly always be compressed (GIF, JPEG and
PNG all include compression).  These files will not benefit from being compressed again


Solution 2
- - - - -

As for the solution listed for the server transfer problem a list of files not
to compress in the cache is needed.  In this case since the file exists in the
cache already it is possible to add a list of MIME types not to be compressed,
e.g. image/jpeg.


Compressed Files From Server To WWWOFFLE
----------------------------------------

Now that the problems have been examined for the previous two cases the problem
is solved.  The list of MIME types that are used for the cache compression are
used for deciding if it is worth compressing the file to send to the browser.



Andrew M. Bishop
12 April 2001