XMLPPM 0.95 README James Cheney 11/30/2000 ABOUT XMLPPM This directory contains version 0.95 of XMLPPM, an XML-specific compressor. XMLPPM reads well-formed XML text from standard input, compresses it, and sends the compressed bits to standard output. The companion decompressor, XMLUNPPM, restores the text version of the XML data from the compressed bits. (Actually, the restored version might be slightly different, for example, some whitespace might be stripped). XMLPPM is *experimental*. I do *not* recommend that you use XMLPPM to archive important files, as XMLPPM is not fully tested and future versions of XMLPPM may not be compatible with this initial version. This version is being made available for research purposes. COPYRIGHT and LICENSE TERMS Portions of the XMLPPM source code are based on Alistair Moffat's arithmetic coding sources and Bill Teahan's sources for the PPMD+ text compressor, used with permission. Those files are copyright their respective authors as described in the source files. The rest of the source code is copyright James Cheney, November 2000. This code (or whatever portions of it I speak for) is covered by the Gnu Public License. INSTALLATION This is the XMLPPM source code distribution, so to use XMLPPM you need to compile the sources. XMLPPM uses version 1.95 of the "expat" XML parser, and so you need to get and install the development version of that parser before you can compile XMLPPM. In the future, if there is demand, I may make statically linked binaries available for selected platforms. Expat (and the installation instructions whereof) is available at: http://expat.sourceforge.net/ Once you have installed expat, go to src/xmlppm-0.95 (or wherever you installed the XMLPPM sources) and do: make all This should create two binary files, xmlppm and xmlunppm. Because XMLPPM is still undergoing development, I don't recommend performing further installation steps like putting xmlppm in /usr/bin, because then other users of your machine might think it's a "real" (i.e. fully tested) utility. USING XMLPPM XMLPPM and its companion decompressor XMLUNPPM are command-line driven and interact only with stdin and stdout. Also, XMLPPM only reads and compresses XML text files. What counts as an XML text file actually depends on the underlying XML parser, expat; if expat does not know how to parse a document, XMLPPM will print expat's error message and quit. If XMLPPM won't compress your document, it's most likely due to a problem in expat, not in XMLPPM, so I may not be able to do anything about it. Supposing you do have an XML file that expat likes, to compress it do: ./xmlppm < doc.xml > doc.xppm You can of course call the compressed file anything you like, but I'm planning on making xppm the depault extension (xpm already being taken). To expand the compressed document, do: ./xmlunppm < doc.xppm > doc.new.xml (I don't recommend that you overwrite the original document). BUGS As far as I know, XMLPPM works on all XML documents. I have tested it on a wide variety of XML documents, and found and fixed many bugs. It's likely that there are still some in there. XMLPPM doesn't compress the XML text directly, but rather the SAX events generated by expat as it parses. This makes XMLPPM slightly lossy in that some information such as exact whitespace is not reported in these events, in particular in internal DTDs. Also, XMLPPM runs into problems with entities. Currently, XMLPPM conservatively replaces all occurrences of reserved characters such as &, ;, and < with their predefined entity references. This may change your document in an essential way. Be warned! TO DO * Fix the above entity bug so that the XML text is preserved as exactly as possible * Factor the XMLUNPPM component into an "event decoder" that decodes the compressed event stream and calls SAX event handlers, and a "printing" event handler. * Port to other XML parsing libraries * Add the capability to directly compress/decompress XML stored in memory as DOM trees * Generally nicen up the code -- it's pretty messy and ugly CONTACT James Cheney, jcheney@cs.cornell.edu