<!DOCTYPE article PUBLIC "-//HaL and O'Reilly//DTD DocBook//EN" [ <!ENTITY % ISOpub PUBLIC "ISO 8879:1986//ENTITIES Publishing//EN"> %ISOpub; <!ENTITY % ISOnum PUBLIC "ISO 8879:1986//ENTITIES Numeric and Special Graphic//EN"> %ISOnum; <!ENTITY sgml "<ulink url='http://www.sil.org/sgml/sgml.html'><acronym>SGML</acronym></ulink>"> <!ENTITY esis "<acronym>ESIS</acronym>"> <!ENTITY sgmls.pm "<ulink url='../SGMLSpm/sgmlspm.html'><application>SGMLS.pm</application></ulink>"> <!ENTITY output.pm "<link linkend='output'><application>SGMLS::Output</application></link>"> <!ENTITY sgmlspl "<link linkend='sgmlspl'><application>sgmlspl</application></link>"> <!ENTITY perl5 "<ulink url='http://www.metronet.com/0/perlinfo/perl5/manual/perl.html'><application>perl5</application></ulink>"> <!ENTITY sgmls "<application>sgmls</application>"> <!ENTITY nsgmls "<ulink url='http://www.jclark.com/sp.html'><application>nsgmls</application></ulink>"> <!ENTITY LaTeX SDATA "[LaTeX]"> ]> <article id=sgmlspl> <artheader> <title>sgmlspl: a simple post-processor for SGMLS and NSGMLS (for use with &sgmls.pm; version 1.03)</title> <authorgroup> <author> <firstname>David</firstname> <surname>Megginson</surname> <affiliation> <orgname>University of Ottawa</orgname> <orgdiv>Department of English</orgdiv> <address><email>dmeggins@aix1.uottawa.ca</email></address> </affiliation> </author> </authorgroup> <artpagenums>[unpublished]</artpagenums> </artheader> <para>Welcome to &sgmlspl;, a simple sample &perl5; application which uses the &sgmls.pm; class library.</para> <important id=terms> <title>Terms</title> <para>This program, along with its documentation, is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.</para> <para>This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.</para> <para>You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.</para> </important> <sect1 id=definition> <title>What is &sgmlspl;?</title> <para>&sgmlspl; is a sample application distributed with the &sgmls.pm; &perl5; class library — you can use it to convert &sgml; documents to other formats by providing a <link linkend='specs'><glossterm>specification file</glossterm></link> detailing exactly how you want to handle each element, external data entity, subdocument entity, CDATA string, record end, SDATA string, and processing instruction. &sgmlspl; also uses the &output.pm; library (included in this distribution) to allow you to redirect or capture output.</para> <para>To use &sgmlspl;, you simply prepare a specification file containing regular &perl5; code. If your &sgml; document were named <filename>doc.sgml</filename>, your &sgmlspl; specification file were named, <filename>spec.pl</filename>, and the name of your new file were <filename>doc.latex</filename>, then you could use the following command in a Unix shell to convert your &sgml; document:</para> <programlisting> sgmls doc.sgml | sgmlspl spec.pl > doc.latex </programlisting> <para>&sgmlspl will pass any additional arguments on to the specification file, which can process them in the regular &perl5; fashion. The specification files used to convert this manual — <filename>tolatex.pl</filename> and <filename>tohtml.pl</filename> — are available with the &sgmls.pm; distribution.</para> </sect1> <sect1 id=installation> <title>How do I install &sgmlspl; on my system?</title> <para>To use &sgmlspl;, you need to install &sgmls.pm; on your system, by copying the &sgmls.pm; file to a directory searched by &perl5;. You also need to install &output.pm; in the same directory, and &sgmlspl; (with execute permission) somewhere on your <symbol>PATH</symbol>. The easiest way to do all of this on a Unix system is to change to the root directory of this distribution (<filename>SGMLSpm</filename>), edit the <filename>Makefile</filename> appropriately, and type</para> <programlisting> make install </programlisting> </sect1> <sect1 id=dsssl> <title>Is &sgmlspl; the best way to convert &sgml; documents?</title> <para>Not necessarily. While &sgmlspl; is fully functional, it is not always particularly intuitive or pleasant to use. There is a new proposed standard, <glossterm>Document Style Semantics and Specification Language</glossterm> (<acronym>DSSSL</acronym>), based on the <application>Scheme</application> programming language, and implementations should soon be available. To read more about the <application>DSSSL</application> standard, see <ulink url="http://www.jclark.com/dsssl"><filename>http://www.jclark.com/dsssl/</filename></ulink> on the Internet.</para> <para>That said, <acronym>DSSSL</acronym> is a declarative, side-effect-free programming language, while &sgmlspl; allows you to use any programming constructions available in &perl5;, including those with side-effects. This means that if you want to do more than simply format the document or convert it from one <glossterm>Document Type Definition</glossterm> (<acronym>DTD</acronym>) to another, &sgmlspl; might be a good choice.</para> </sect1> <sect1 id=specs> <title>How does the specification file tell &sgmlspl; what to do?</title> <para>&sgmlspl; uses an <glossterm>event model</glossterm> rather than a <glossterm>procedural model</glossterm> — instead of saying <quote>do A then B then C</quote> you say <quote>whenever X happens, do A; whenever Y happens, do B; whenever Z happens, do C</quote>. In other words, while you design the code, &sgmlspl; decides when and how often to run it.</para> <para>The specification file, which contains your instructions, is regular &perl5; code, and you can define packages and subroutines, display information, read files, create variables, etc. For processing the &sgml; document, however, &sgmlspl; exports a single subroutine, <command>sgml(<parameter>event</parameter>, <parameter>handler</parameter>)</command>, into the 'main' package — each time you call <command>sgml</command>, you declare a handler for a specific type of &sgmls; <ulink url='../SGMLSpm/events.html'>event</ulink>, and &sgmlspl; will then execute that handler every time the event occurs. You may use <command>sgml</command> to declare a <link linkend='handlers'>handler</link> for a <link linkend='generic'><glossterm>generic event</glossterm></link>, like <literal>'start_element'</literal>, or a <link linkend='specific'><glossterm>specific event</glossterm></link>, like <literal>'<DOC>'</literal> — a specific event will always take precedence over a generic event, so when the <symbol>DOC</symbol> element begins, &sgmlspl; will execute the <literal>'<DOC>'</literal> handler rather than the <literal>'start_element'</literal> handler.</para> </sect1> <sect1 id=handlers> <title>What about the <parameter>handler</parameter> argument?</title> <para>The second argument to the <command>sgml</command> subroutine is the actual code or data associated with each event. If it is a string, it will be printed literally using the <command>output</command> subroutine from the &output.pm; library; if it is a reference to a &perl5; subroutine, the subroutine will be called whenever the event occurs. The following three <command>sgml</command> commands will have identical results:</para> <programlisting> # Example 1 sgml('<DOC>', "\\begin{document}\n"); # Example 2 sgml('<DOC>', sub { output "\\begin{document}\n"; }); # Example 3 sub do_begin_document { output "\\begin{document}\n"; } sgml('<DOC>', \&do_begin_document); </programlisting> <para>For simply printing a string, of course, it does not make sense to use a subroutine; however, the subroutines can be useful when you need to check the value of an attribute, perform different actions in different contexts, or perform other types of relatively more complicated post-processing.</para> <para>If your handler is a subroutine, then it will receive two arguments: the &sgmls.pm; event's data, and the &sgmls.pm; event itself (see the &sgmls.pm; <ulink url='../SGMLSpm/events.html'>documentation</ulink> for a description of event and data types). The following example will print <literal>'\begin{enumerate}'</literal> if the value of the attribute <symbol>TYPE</symbol> is <literal>'ORDERED'</literal>, and <literal>'\begin{itemize}'</literal> if the value of the attribute <symbol>TYPE</symbol> is <literal>'UNORDERED'</literal>:</para> <programlisting> sgml('<LIST>', sub { my ($element,$event) = @_; my $type = $element->attribute('TYPE')->value; if ($type eq 'ORDERED') { output "\\begin{enumerate}\n"; } elsif ($type eq 'UNORDERED') { output "\\begin{itemize}\n"; } else { die "Bad TYPE '$type' for element LIST at line " . $event->line . " in " . $event->file . "\n"; } }); </programlisting> <para>You will not always need to use the <parameter>event</parameter> argument, but it can be useful if you want to report line numbers or file names for errors (presuming that you called &sgmls; or &nsgmls; with the <parameter>-l</parameter> option). If you have a new version of &nsgmls; which accepts the <parameter>-h</parameter> option, you can also use the <parameter>event</parameter> argument to look up arbitrary entities declared by the program. See the <ulink url="../SGMLSpm/sgmlsevent.html">SGMLS_Event</ulink> documentation for more information.</para> </sect1> <sect1 id=generic> <title>What are the generic events?</title> <para>&sgmlspl; recognises the twelve generic events listed in table <xref linkend=table.events.generic>. You may provide any one of these as the first argument to <command>sgml</command> to declare a handler (string or subroutine) for that event.</para> <table id=table.events.generic> <title>&sgmlspl; generic events</title> <tgroup cols=2> <thead> <row> <entry>Event</entry> <entry>Description</entry> </row> </thead> <tbody> <row> <entry><literal>'start'</literal></entry> <entry>Execute <parameter>handler</parameter> (with no arguments) at the beginning of the parse.</entry> </row> <row> <entry><literal>'end'</literal></entry> <entry>Execute <parameter>handler</parameter> (with no arguments) at the end of the parse.</entry> </row> <row> <entry><literal>'start_element'</literal></entry> <entry>Execute <parameter>handler</parameter> at the beginning of every element without a specific start handler.</entry> </row> <row> <entry><literal>'end_element'</literal></entry> <entry>Execute <parameter>handler</parameter> at the end of every element without a specific end handler.</entry> </row> <row> <entry><literal>'cdata'</literal></entry> <entry>Execute <parameter>handler</parameter> for every character-data string.</entry> </row> <row> <entry><literal>'sdata'</literal></entry> <entry>Execute <parameter>handler</parameter> for every special-data string without a specific handler.</entry> </row> <row> <entry><literal>'re'</literal></entry> <entry>Execute <parameter>handler</parameter> for every record end.</entry> </row> <row> <entry><literal>'pi'</literal></entry> <entry>Execute <parameter>handler</parameter> for every processing instruction.</entry> </row> <row> <entry><literal>'entity'</literal></entry> <entry>Execute <parameter>handler</parameter> for every external data entity without a specific handler.</entry> </row> <row> <entry><literal>'start_subdoc'</literal></entry> <entry>Execute <parameter>handler</parameter> at the beginning of every subdocument entity without a specific handler.</entry> </row> <row> <entry><literal>'end_subdoc'</literal></entry> <entry>Execute <parameter>handler</parameter> at the end of every subdocument entity without a specific handler.</entry> </row> <row> <entry><literal>'conforming'</literal></entry> <entry>Execute <parameter>handler</parameter> once, at the end of the document parse, if and only if the document was conforming.</entry> </row> </tbody> </tgroup> </table> <para>The handlers for all of these except the document events <literal>'start'</literal> and <literal>'end'</literal> will receive two arguments whenever they are called: the first will be the data associated with the event (if any), and the second will be the <classname>SGMLS_Event</classname> object itself (see the document for &sgmls.pm;). Note the following example, which allows processing instructions for including the date or the hostname in the document at parse time:</para> <programlisting> sgml('pi', sub { my ($instruction) = @_; if ($instruction eq 'date') { output `date`; } elsif ($instruction eq 'hostname') { output `hostname`; } else { print STDERR "Warning: unknown processing instruction: $instruction\n"; } }); </programlisting> <para>With this <link linkend='handlers'>handler</link>, any occurance of <literal><?date></literal> in the original &sgml; document would be replaced by the current date and time, and any occurance of <literal><?hostname></literal> would be replaced by the name of the host.</para> </sect1> <sect1 id=specific> <title>What are the specific events?</title> <para>In addition to the <link linkend=generic>generic events</link> listed in the previous section, &sgmlspl; allows special, specific handlers for the beginning and end of elements and subdocument entities, for SDATA strings, and for external data entities. Table <xref linkend=table.events.specific> lists the different specific event types available.</para> <table id=table.events.specific> <title>Specific event types</title> <tgroup cols=2> <thead> <row> <entry>Event</entry> <entry>Description</entry> </row> </thead> <tbody> <row> <entry><literal>'<GI>'</literal></entry> <entry>Execute <parameter>handler</parameter> at the beginning of every element named <literal>'GI'</literal>.</entry> </row> <row> <entry><literal>'</GI>'</literal></entry> <entry>Execute <parameter>handler</parameter> at the end of every element named <literal>'GI'</literal>.</entry> </row> <row> <entry><literal>'|SDATA|'</literal></entry> <entry>Execute <parameter>handler</parameter> for every special-data string <literal>'SDATA'</literal>.</entry> </row> <row> <entry><literal>'&ENTITY;'</literal></entry> <entry>Execute <parameter>handler</parameter> for every external data entity named <literal>'ENTITY'</literal>.</entry> </row> <row> <entry><literal>'{ENTITY}'</literal></entry> <entry>Execute <parameter>handler</parameter> at the beginning of every subdocument entity named <literal>'ENTITY'</literal>.</entry> </row> <row> <entry><literal>'{/ENTITY}'</literal></entry> <entry>Execute <parameter>handler</parameter> at the end of every subdocument entity named <literal>'ENTITY'</literal>.</entry> </row> </tbody> </tgroup> </table> <para>Note that these override the <link linkend=generic>generic-event</link> handlers. For example, if you were to type</para> <programlisting> sgml('&FOO;', sub { output "Found a \"foo\" entity!\n"; }); sgml('entity', sub { output "Found an entity!\n"; }); </programlisting> <para>And the external data entity <literal>&FOO;</literal> appeared in your &sgml; document, &sgmlspl; would call the first handler rather than the second.</para> <para>Note also that start and end handlers are entirely separate things: if an element has a specific start handler but no specific end handler, the generic end handler will still be called at the end of the element. To prevent this, declare a handler with an empty string:</para> <programlisting> sgml('</HACK>', ''); </programlisting> </sect1> <sect1 id=output> <title>Why does &sgmlspl; use <command>output</command> instead of <command>print</command>?</title> <para>&sgmlspl; uses a special &perl5; library &output.pm; for printing text. &output.pm; exports the subroutines <command>output(<parameter>string</parameter>…)</command>, <command>push_output(<parameter>type</parameter>[,<parameter>data</parameter>])</command>, and <command>pop_output</command>. The subroutine <command>output</command> works much like the regular &perl5; function <command>print</command>, except that you are not able to specify a file handle, and you may include multiple strings as arguments.</para> <para>When you want to write data to somewhere other than <symbol>STDOUT</symbol> (the default), then you use the subroutines <link linkend=pushoutput><command>push_output</command></link> and <link linkend=popoutput><command>pop_output</command></link> to set a new destination or to restore an old one.</para> <para>You can use the &output.pm; package in other programs by adding the following line:</para> <programlisting> use SGMLS::Output; </programlisting> </sect1> <sect1 id=pushoutput> <title>How do I use <command>push_output</command>?</title> <para>The subroutine <command>push_output(<parameter>type</parameter>[,<parameter>data</parameter>])</command> takes two arguments: the <parameter>type</parameter>, which is always required, and the <parameter>data</parameter>, which is needed for certain types of output. Table <xref linkend=table.output.push.output> lists the different types which you can push onto the output stack.</para> <table id=table.output.push.output> <title>Types for <command>push_output</command></title> <tgroup cols=3> <thead> <row> <entry>Type</entry> <entry>Data</entry> <entry>Description</entry> </row> </thead> <tbody> <row> <entry><literal>'handle'</literal></entry> <entry>a filehandle</entry> <entry>Send all output to the supplied filehandle.</entry> </row> <row> <entry><literal>'file'</literal></entry> <entry>a filename</entry> <entry>Open the supplied file for writing, erasing its current contents (if any), and send all output to it.</entry> </row> <row> <entry><literal>'append'</literal></entry> <entry>a filename</entry> <entry>Open the supplied file for writing and append all output to its current contents.</entry> </row> <row> <entry><literal>'pipe'</literal></entry> <entry>a shell command</entry> <entry>Pipe all output to the supplied shell command.</entry> </row> <row> <entry><literal>'string'</literal></entry> <entry>a string [optional]</entry> <entry>Append all output to the supplied string, which will be returned by <command>pop_output</command>.</entry> </row> <row> <entry><literal>'nul'</literal></entry> <entry>[none]</entry> <entry>Ignore all output.</entry> </row> </tbody> </tgroup> </table> <para>Because the output is stack-based, you do not lose the previous output destination when you push a new one. This is especially convenient for dealing with data in tree-structures, like &sgml; data — for example, you can capture the contents of sub-elements as strings, ignore certain types of elements, and split the output from one &sgml; parse into a series of sub-files. Here are some examples:</para> <programlisting> push_output('string'); # append output to an empty string push_output('file','/tmp/foo'); # send output to this file push_output('pipe','mail webmaster'); # mail output to 'webmaster' (!!) push_output('nul'); # just ignore all output </programlisting> </sect1> <sect1 id=popoutput> <title>How do I use <command>pop_output</command>?</title> <para>When you want to restore the previous output after using <link linkend=pushoutput><command>push_output</command></link>, simply call the subroutine <command>pop_output</command>. If the output type was a string, <command>pop_output</command> will return the string (containing all of the output); otherwise, the return value is not useful.</para> <para>Usually, you will want to use <command>push_output</command> in the start handler for an element or subdocument entity, and <command>pop_output</command> in the end handler.</para> </sect1> <sect1 id=outputex> <title>How about an example for <command>output</command>?</title> <para>Here is a simple example to demonstrate how <link linkend=output><command>output</command></link>, <link linkend="pushoutput"><command>push_output</command></link>, and <link linkend=popoutput><command>pop_output</command></link> work:</para> <programlisting> output "Hello, world!\n"; # (Written to STDOUT by default) push_output('nul'); # Push 'nul' ahead of STDOUT output "Hello, again!\n"; # (Discarded) push_output('file','foo.out'); # Push file 'foo.out' ahead of 'nul' output "Hello, again!\n"; # (Written to the file 'foo.out') pop_output; # Pop 'foo.out' and revert to 'nul' output "Hello, again!\n"; # (Discarded) push_output('string'); # Push 'string' ahead of 'nul' output "Hello, "; # (Written to the string) output "again!\n"; # (Also written to the string) # Pop the string "Hello, again!\n" $foo = pop_output; # and revert to 'nul' output "Hello, again!\n"; # (Discarded) pop_output; # Pop 'nul' and revert to STDOUT output "Hello, at last!\n"; # (Written to STDOUT) </programlisting> </sect1> <sect1 id=skel> <title>Is there an easier way to make specification files?</title> <para>Yes. The script <filename>skel.pl</filename>, included in this package, is an &sgmlspl; specification which writes a specification (!!!). To use it under Unix, try something like</para> <programlisting> sgmls foo.sgml | sgmlspl skel.pl > foo-spec.pl </programlisting> <para>(presuming that there is a copy of <filename>skel.pl</filename> in the current directory or in a directory searched by &perl5;) to generate a new, blank template named <filename>foo-spec.pl</filename>.</para> </sect1> <sect1 id=forward> <title>How should I handle forward references?</title> <para>Because &sgmlspl; processes the document as a linear data stream, from beginning to end, it is easy to refer <emphasis>back</emphasis> to information, but relatively difficult to refer <emphasis>forward</emphasis>, since you do not know what will be coming later in the parse. Here are a few suggestions.</para> <para>First, you could use <link linkend=pushoutput><command>push_output</command></link> and <link linkend=popoutput><command>pop_output</command></link> to save up output in a large string. When you have found the information which you need, you can make any necessary modifications to the string and print it then. This will work for relatively small chunks of a document, but you would not want to try it for anything larger.</para> <para>Next, you could use the <ulink url="../SGMLSpm/extend.html"><command>ext</command></ulink> method to add extra pointers, and build a parse tree of the whole document before processing any of it. This method will work well for small documents, but large documents will place some serious stress on your system's memory and/or swapping.</para> <para>A more sophisticated solution, however, involves the <application>Refs.pm</application> module, included in this distribution. In your &sgmlspl; script, include the line</para> <programlisting> use SGMLS::Refs.pm; </programlisting> <para>to activate the library. The library will create a database file to keep track of references between passes, and to tell you if any references have changed. For example, you might want to try something like this:</para> <programlisting> sgml('start', sub { my $Refs = new SGMLS::Refs('references.refs'); }); sgml('end', sub { $Refs->warn; destroy $Refs; }); </programlisting> <para>This code will create an object, $Refs, linked to a file of references called <filename>references.refs</filename>. The <classname>SGMLS::Refs</classname> class understands the methods listed in table <xref linkend=table.class.refs></para> <table id=table.class.refs> <title>The SGMLS::Refs class</title> <tgroup cols=3> <thead> <row> <entry>Method</entry> <entry>Return Type</entry> <entry>Description</entry> </row> </thead> <tbody> <row> <entry><command>new</command>(<parameter>filename</parameter>,[<parameter>logfile_handle</parameter>])</entry> <entry><classname>SGMLS::Refs</classname></entry> <entry>Create a new <classname>SGMLS::Refs</classname> object. Arguments are the name of the hashfile and (optionally) a writable filehandle for logging changes.</entry> </row> <row> <entry><command>get</command>(<parameter>key</parameter>)</entry> <entry>string</entry> <entry>Look up a reference key in the hash file and return its value.</entry> </row> <row> <entry><command>put</command>(<parameter>key</parameter>,<parameter>value</parameter>)</entry> <entry>[none]</entry> <entry>Set a new value for the key in the hashfile.</entry> </row> <row> <entry><command>count</command></entry> <entry>number</entry> <entry>Return the number of references whose values have changed (thus far).</entry> </row> <row> <entry><command>warn</command></entry> <entry>1 or 0</entry> <entry>Print a warning mentioning the number of references which have changed, and return 1 if a warning was printed.</entry> </row> </tbody> </tgroup> </table> </sect1> <sect1 id=bugs> <title>Are there any bugs?</title> <para>Any bugs in &sgmls.pm; will be here too, since &sgmlspl; relies heavily on that &perl5; library.</para> </sect1> </article> <!-- Keep this comment at the end of the file Local variables: mode: sgml sgml-declaration:"/usr/local/lib/sgml/sgmldecl/docbook.dcl" End: -->