<!DOCTYPE article PUBLIC "-//HaL and O'Reilly//DTD DocBook//EN" [ <!ENTITY % ISOpub PUBLIC "ISO 8879:1986//ENTITIES Publishing//EN"> %ISOpub; <!ENTITY % ISOnum PUBLIC "ISO 8879:1986//ENTITIES Numeric and Special Graphic//EN"> %ISOnum; <!ENTITY sample.program SYSTEM "sample.pl" CDATA linespecific> <!ENTITY sgml "<ulink url='http://www.sil.org/sgml/sgml.html'><acronym>SGML</acronym></ulink>"> <!ENTITY esis "<acronym>ESIS</acronym>"> <!ENTITY sgmls.pm "<link linkend=SGMLSpm><application>SGMLS.pm</application></link>"> <!ENTITY perl5 "<ulink url='http://www.metronet.com/0/perlinfo/perl5/manual/perl.html'><application>perl5</application></ulink>"> <!ENTITY perl5 "<application>perl5</application>"> <!ENTITY sgmls "<application>sgmls</application>"> <!ENTITY nsgmls "<ulink url='http://www.jclark.com/sp.html'><application>nsgmls</application></ulink>"> ]> <article id=SGMLSpm> <artheader> <title>SGMLS.pm: a perl5 class library for handling output from the SGMLS and NSGMLS parsers (version 1.03)</title> <authorgroup> <author> <firstname>David</firstname> <surname>Megginson</surname> <affiliation> <orgname>University of Ottawa</orgname> <orgdiv>Department of English</orgdiv> <address><email>dmeggins@aix1.uottawa.ca</email></address> </affiliation> </author> </authorgroup> <artpagenums>[unpublished]</artpagenums> </artheader> <para>Welcome to &sgmls.pm;, an extensible &perl5; class library for processing the output from the &sgmls; and &nsgmls; parsers. &sgmls.pm; is free, copyrighted software available by anonymous ftp in the directory <ulink url="ftp://aix1.uottawa.ca/pub/dmeggins/">ftp://aix1.uottawa.ca/pub/dmeggins/</ulink>. You might also want to look at the documentation for <ulink url="../sgmlspl/sgmlspl.html"><application>sgmlspl</application></ulink>, a simple sample script which uses &sgmls.pm; to convert documents from &sgml; to other formats.</para> <important id=terms> <title>Terms</title> <para>This program, along with its documentation, is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.</para> <para>This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.</para> <para>You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.</para> </important> <sect1 id=definition> <title>What is &sgmls.pm;?</title> <para>&sgmls.pm; is an <link linkend=extend>extensible</link> &perl5; class library for parsing the output from James Clark's popular &sgmls; and &nsgmls; parsers, available on the Internet at <ulink url="ftp://jclark.com/"><filename>ftp://jclark.com</filename></ulink>. This is <emphasis>not</emphasis> a complete system for translating documents written the the <glossterm>Standard Generalised Markup Language</glossterm> (&sgml;) into other formats, but it can easily form the basis of such a system (for a simple example, see the <ulink url="../sgmlspl/sgmlspl.html"><application>sgmlspl</application></ulink> program included in this package).</para> <para>The library recognises four basic types of &sgml; objects: the <link linkend=sgmlselement><glossterm>element</glossterm></link>, the <link linkend=sgmlsattribute><glossterm>attribute</glossterm></link>, the <link linkend=sgmlsnotation><glossterm>notation</glossterm></link>, and the <link linkend=sgmlsentity><glossterm>entity</glossterm></link>; each of these is a fully-developed class with methods for accessing important information.</para> </sect1> <sect1 id=sgml> <title>How do I produce &sgml; documents?</title> <para>I am presuming here that you are already experienced with &sgml; and the &sgmls; or &nsgmls; parser. For help with the parsers see the manual pages accompanying each one; for help with &sgml; see Robin Cover's SGML Web Page at <ulink url="http://www.sil.org/sgml/sgml.html"><filename>http://www.sil.org/sgml/sgml.html</filename></ulink> on the Internet.</para> </sect1> <sect1 id=perl5> <title>How do I program in &perl5;?</title> <para>If you have to ask this question, you probably should not be trying to use this library right now, since it is intended only for experienced &perl5; programmers. That said, however, you can find the &perl5; documentation with the &perl5; source distribution or on the World-Wide Web at <ulink url="http://www.metronet.com/0/perlinfo/perl5/manual/perl.html"><filename>http://www.metronet.com/0/perlinfo/perl5/manual/perl.html</filename></ulink>.</para> <para><emphasis>Please</emphasis> do not write to me for help on using &perl5;.</para> </sect1> <sect1 id=sgmls> <title>How do I use &sgmls.pm;?</title> <para>First, you need to copy the file &sgmls.pm; to a directory where perl can find it (on a Unix system, it might be <filename>/usr/lib/perl5</filename> or <filename>/usr/local/lib/perl5</filename>, or whatever the environment variable <symbol>PERL5LIB</symbol> is set to) and make certain that it is world-readable.</para> <para>Next, near the top of your &perl5; program, type the following line:</para> <programlisting> use SGMLS; </programlisting> <para>You must then open up a file handle from which &sgmls.pm; can read the data from an &sgmls; or &nsgmls; process, unless you are reading from a standard handle like <symbol>STDIN</symbol> — for example, if you are piping the output from &sgmls; to a &perl5; script, using something like</para> <programlisting> sgmls foo.sgml | perl myscript.pl </programlisting> <para>then the predefined filehandle <symbol>STDIN</symbol> will be sufficient. In DOS, you might want to dump the sgmls output to a file and use it as standard input (or open it explicitly in perl), and in Unix, you might actually want to open a pipe or socket for the input. &sgmls.pm; doesn't need to seek, so any input stream should work.</para> <para>To parse the &sgmls; or &nsgmls; output from the handle, create a new object instance of the <classname>SGMLS</classname> class with the handle as an argument, i.e.</para> <programlisting> $parse = new SGMLS(STDIN); </programlisting> <para>(You may create more than one <classname>SGMLS</classname> object at once, but each object <emphasis>must</emphasis> have a unique handle pointing to a unique stream, or <emphasis>chaos</emphasis> will result.) Now, you can retrieve and process events using the <command>next_event</command> method:</para> <programlisting> while ($event = $parse->next_event) { #do something with each event } </programlisting> </sect1> <sect1 id=sgmlsevent> <title>So what do I do with an event?</title> <para>The <command>next_event</command> method for the <link linkend=sgmls><classname>SGMLS</classname></link> class returns an object belonging to the class <classname>SGMLS_Event</classname>. This class has several methods available, as listed in table <xref linkend=table.class.sgmls>.</para> <table id=table.class.sgmls> <title>The <classname>SGMLS_Event</classname> class</title> <tgroup cols=3> <thead> <row> <entry>Method</entry> <entry>Return Type</entry> <entry>Description</entry> </row> </thead> <tbody> <row> <entry><command>type</command></entry> <entry>string</entry> <entry>Return the type of the event.</entry> </row> <row> <entry><command>data</command></entry> <entry>string, <classname>SGMLS_Element</classname>, or <classname>SGMLS_Entity</classname></entry> <entry>Return any data associated with the event.</entry> </row> <row> <entry><command>file</command></entry> <entry>string</entry> <entry>Return the name of the &sgml; source file which generated the event, if available.</entry> </row> <row> <entry><command>line</command></entry> <entry>string</entry> <entry>Return the line number of the &sgml; source file which generated the event, if available.</entry> </row> <row> <entry><command>element</command></entry> <entry><classname>SGMLS_Element</classname></entry> <entry>Return the element in force when the event was generated.</entry> </row> <row> <entry><command>parse</command></entry> <entry>Return the <classname>SGMLS</classname> object for the current parse.</entry> </row> <row> <entry><command>entity(<parameter>ename</parameter>)</command></entry> <entry>Look up an entity from those currently known to the parse. An alias for <literal>->parse->entity($ename)</literal></entry> </row> <row> <entry><command>notation(<parameter>nname</parameter>)</command></entry> <entry>Look up the notation from those currently known to the parse: an alias for <literal>->parse->notation($nname)</literal>.</entry> </row> </tbody> </tgroup> </table> <para>The <command>file</command> and <command>line</command> methods will return useful information only if you called &sgmls; or &nsgmls; with the <parameter>-l</parameter> flag to include file and line-number information.</para> </sect1> <sect1 id=events> <title>What are the different event types and data?</title> <para>Table <xref linkend=table.class.sgmls.event> lists the ten different event types returned by the <command>next_event</command> method of an <link linkend=sgmls><classname>SGMLS</classname></link> object and the different types of data associated with each of these (note that these do <emphasis>not</emphasis> correspond to the standard &esis; events).</para> <table id=table.class.sgmls.event> <title>The <classname>SGMLS_Event</classname> types</title> <tgroup cols=3> <thead> <row> <entry>Event Type</entry> <entry>Event Data</entry> <entry>Description</entry> </row> </thead> <tbody> <row> <entry><returnvalue>'start_element'</returnvalue></entry> <entry><classname>SGMLS_Element</classname></entry> <entry>The beginning of an element.</entry> </row> <row> <entry><returnvalue>'end_element'</returnvalue></entry> <entry><classname>SGMLS_Element</classname></entry> <entry>The end of an element.</entry> </row> <row> <entry><returnvalue>'cdata'</returnvalue></entry> <entry>string</entry> <entry>Regular character data.</entry> </row> <row> <entry><returnvalue>'sdata'</returnvalue></entry> <entry>string</entry> <entry>Special system data.</entry> </row> <row> <entry><returnvalue>'re'</returnvalue></entry> <entry>[none]</entry> <entry>A record-end (i.e., a newline).</entry> </row> <row> <entry><returnvalue>'pi'</returnvalue></entry> <entry>string</entry> <entry>A processing instruction</entry> </row> <row> <entry><returnvalue>'entity'</returnvalue></entry> <entry><classname>SGMLS_Entity</classname></entry> <entry>A non-SGML external entity.</entry> </row> <row> <entry><returnvalue>'start_subdoc'</returnvalue></entry> <entry><classname>SGMLS_Entity</classname></entry> <entry>The beginning of an SGML subdocument.</entry> </row> <row> <entry><returnvalue>'end_subdoc'</returnvalue></entry> <entry><classname>SGMLS_Entity</classname></entry> <entry>The end of an SGML subdocument.</entry> </row> <row> <entry><returnvalue>'conforming'</returnvalue></entry> <entry>[none]</entry> <entry>The document was valid.</entry> </row> </tbody> </tgroup> </table> <para>For example, if <literal>$event->type</literal> returns <returnvalue>'start_element'</returnvalue>, then <literal>$event->data</literal> will return an object belonging to the <link linkend=sgmlselement><classname>SGMLS_Element</classname></link> class (which will contain a list of attributes, etc. — see below), <literal>$event->file</literal> and <literal>$event->line</literal> will return the file and line-number in which the element appeared (if you called &sgmls; or &nsgmls; with the <parameter>-l</parameter> flag), and <literal>$event->element</literal> will return the element currently in force (in this case, the same as <literal>$event->data</literal>).</para> </sect1> <sect1 id=sgmlselement> <title>What do I do with an <classname>SGMLS_Element</classname>?</title> <para>Altogether, there are six classes in &sgmls.pm;, each with its own methods: in addition to <link linkend=sgmls><classname>SGMLS</classname></link> (for the parse) and <link linkend=sgmlsevent><classname>SGMLS_Event</classname></link> (for a specific event), the classes are <classname>SGMLS_Element</classname>, <link linkend=sgmlsattribute><classname>SGMLS_Attribute</classname></link>, <link linkend=sgmlsentity><classname>SGMLS_Entity</classname></link>, and <link linkend=sgmlsnotation><classname>SGMLS_Notation</classname></link>. Like all of these, <classname>SGMLS_Element</classname> has a number of methods available for obtaining different types of information. For example, if you were to use</para> <programlisting> my $element = $event->data </programlisting> <para>to retrieve the data for a <literal>'start_element'</literal> or <literal>'end_element'</literal> event, then you could use the methods listed in table <xref linkend=table.class.sgmls.element> to find more information about the element.</para> <table id=table.class.sgmls.element> <title>The <classname>SGMLS_Element</classname> class</title> <tgroup cols=3> <thead> <row> <entry>Method</entry> <entry>Return Type</entry> <entry>Description</entry> </row> </thead> <tbody> <row> <entry><command>name</command></entry> <entry>string</entry> <entry>The name (or GI), in upper-case.</entry> </row> <row> <entry><command>parent</command></entry> <entry><classname>SGMLS_Element</classname></entry> <entry>The parent element, or <literal>''</literal> if this is the top element.</entry> </row> <row> <entry><command>attributes</command></entry> <entry>HASH</entry> <entry>Return a reference to a hash table of <classname>SGMLS_Attribute</classname> objects, keyed by the attribute names (in upper-case).</entry> </row> <row> <entry><command>attribute_names</command></entry> <entry>ARRAY</entry> <entry>A list of all attribute names for the current element (in upper-case).</entry> </row> <row> <entry><command>attribute(<parameter>aname</parameter>)</command></entry> <entry><classname>SGMLS_Attribute</classname></entry> <entry>Return the attribute named ANAME.</entry> </row> <row> <entry><command>set_attribute(<parameter>attribute</parameter>)</command></entry> <entry>[none]</entry> <entry>The <parameter>attribute</parameter> argument should be an object belonging to the <ulink url=sgmlsattribute.html><classname>SGMLS_Attribute</classname></ulink> class. Add it to the element, replacing any previous attribute with the same name.</entry> </row> <row> <entry><command>in(<parameter>name</parameter>)</command></entry> <entry><classname>SGMLS_Element</classname></entry> <entry>If the current element's parent is named <parameter>name</parameter>, return the parent; otherwise, return <literal>''</literal>.</entry> </row> <row> <entry><command>within(<parameter>name</parameter>)</command></entry> <entry><classname>SGMLS_Element</classname></entry> <entry>If any ancestor of the current element is named <parameter>name</parameter>, return it; otherwise, return <literal>''</literal>.</entry> </row> </tbody> </tgroup> </table> </sect1> <sect1 id=sgmlsattribute> <title>What do I do with an <classname>SGMLS_Attribute</classname>?</title> <para>Note that objects of the <classname>SGMLS_Attribute</classname> class do not have events in their own right, and are available only through the <command>attributes</command> or <command>attribute(<parameter>aname</parameter>)</command> methods for <link linkend=sgmlselement><classname>SGMLS_Element</classname></link> objects. An object belonging to the <classname>SGMLS_Attribute</classname> class will recognise the methods listed in table <xref linkend=table.class.sgmls.attribute>.</para> <table id=table.class.sgmls.attribute> <title>The <classname>SGMLS_Attribute</classname> class</title> <tgroup cols=3> <thead> <row> <entry>Method</entry> <entry>Return Type</entry> <entry>Description</entry> </row> </thead> <tbody> <row> <entry><command>name</command></entry> <entry>string</entry> <entry>The name of the attribute (in upper-case).</entry> </row> <row> <entry><command>type</command></entry> <entry>string</entry> <entry>The type of the attribute: <literal>'IMPLIED'</literal>, <literal>'CDATA'</literal>, <literal>'NOTATION'</literal>, <literal>'ENTITY'</literal>, or <literal>'TOKEN'</literal>.</entry> </row> <row> <entry><command>value</command></entry> <entry>string, <classname>SGMLS_Entity</classname>, or <classname>SGMLS_Notation</classname>.</entry> <entry>The value of the attribute. If the type is <literal>'CDATA'</literal> or <literal>'TOKEN'</literal>, it will be a simple string; if it is <literal>'NOTATION'</literal> it will be an object belonging to the <classname>SGMLS_Notation</classname> class, and if it is <literal>'Entity'</literal> it will be an object belonging to the <classname>SGMLS_Entity</classname> class.</entry> </row> <row> <entry><command>is_implied</command></entry> <entry>boolean</entry> <entry>Return true if the value of the attribute is implied, or false if it has an explicit value.</entry> </row> <row> <entry><command>set_type(<parameter>type</parameter>)</command></entry> <entry>[none]</entry> <entry>Provide a new type for the current attribute -- no sanity checking will be performed, so be careful.</entry> </row> <row> <entry><command>set_value(<parameter>value</parameter>)</command></entry> <entry>[none]</entry> <entry>Provide a new value for the current attribute -- no sanity checking will be performed, so be careful.</entry> </row> </tbody> </tgroup> </table> <para>Note that the type <literal>'TOKEN'</literal> includes both individual tokens and lists of tokens (ie <literal>'TOKENS'</literal>, <literal>'IDS'</literal>, or <literal>'IDREFS'</literal> in the original &sgml; document), so you might need to use the perl function 'split' to break the value string into a list.</para> </sect1> <sect1 id=sgmlsentity> <title>What do I do with an <classname>SGMLS_Entity</classname>?</title> <para>An <classname>SGMLS_Entity</classname> object can come in an <literal>'entity'</literal> <link linkend=events>event</link> (in which case it is always external), in a <literal>'start_subdoc'</literal> or <literal>'end_subdoc'</literal> event (in which case it always has the type <literal>'SUBDOC'</literal>), or as the value of an attribute (in which case it may be internal or external). An object belonging to the <classname>SGMLS_Entity</classname> class may use the methods listed in table <xref linkend=table.class.sgmls.entity>.</para> <table id=table.class.sgmls.entity> <title>The <classname>SGMLS_Entity</classname> class</title> <tgroup cols=3> <thead> <row> <entry>Method</entry> <entry>Return Type</entry> <entry>Description</entry> </row> </thead> <tbody> <row> <entry><command>name</command></entry> <entry>string</entry> <entry>The entity name.</entry> </row> <row> <entry><command>type</command></entry> <entry>string</entry> <entry>The entity type: <literal>'CDATA'</literal>, <literal>'SDATA'</literal>, <literal>'NDATA'</literal>, or <literal>'SUBDOC'</literal>.</entry> </row> <row> <entry><command>value</command></entry> <entry>string</entry> <entry>The entity replacement text (internal entities only).</entry> </row> <row> <entry><command>sysid</command></entry> <entry>string</entry> <entry>The system identifier (external entities only).</entry> </row> <row> <entry><command>pubid</command></entry> <entry>string</entry> <entry>The public identifier (external entities only).</entry> </row> <row> <entry><command>filenames</command></entry> <entry>ARRAY</entry> <entry>A list of file names generated from the sysid and pubid (external entities only).</entry> </row> <row> <entry><command>notation</command></entry> <entry><classname>SGMLS_Notation</classname></entry> <entry>The associated notation (external data entities only).</entry> </row> </tbody> </tgroup> </table> <para>An entity of type <literal>'SUBDOC'</literal> will have a sysid and pubid, and external data entity will have a sysid, pubid, filenames, and a notation, and an internal data entity will have a value.</para> </sect1> <sect1 id=sgmlsnotation> <title>What do I do with an <classname>SGMLS_Notation</classname>?</title> <para>The fourth data class is the notation, which is available only as a return value from the <command>notation</command> method of an <link linkend=sgmlsentity><classname>SGMLS_Entity</classname></link> or the <command>value</command> method of an <link linkend=sgmlsattribute><classname>SGMLS_Attribute</classname></link> with type <literal>'NOTATION'</literal>. You can use the notation to decide how to treat non-SGML data (such as graphics). An object belonging to the <classname>SGMLS_Notation</classname> class will have access to the methods listed in table <xref linkend=table.class.sgmls.notation>.</para> <table id=table.class.sgmls.notation> <title>The <classname>SGMLS_Notation class</classname></title> <tgroup cols=3> <thead> <row> <entry>Method</entry> <entry>Return Type</entry> <entry>Description</entry> </row> </thead> <tbody> <row> <entry><command>name</command></entry> <entry>string</entry> <entry>The notation's name.</entry> </row> <row> <entry><command>sysid</command></entry> <entry>string</entry> <entry>The notation's system identifier.</entry> </row> <row> <entry><command>pubid</command></entry> <entry>string</entry> <entry>The notation's public identifier.</entry> </row> </tbody> </tgroup> </table> <para>What you do with this information is <emphasis>entirely</emphasis> up to you.</para> </sect1> <sect1 id=xtrainfo> <title>Is there any extra information available from the &sgml; document?</title> <para>The <link linkend=sgmls><classname>SGMLS</classname></link> object which you created at the beginning of the parse has several methods available in addition to <command>next_event</command> — you will find them all listed in table <xref linkend=table.class.sgmls.extra>. There should normally be no need to use the <command>notation</command> and <command>entity</command> methods, since &sgmls.pm; will look up entities and notations for you automatically as needed.</para> <table id=table.class.sgmls.extra> <title>Additional methods for the <classname>SGMLS</classname> class</title> <tgroup cols=3> <thead> <row role=label> <entry>Method</entry> <entry>Return Type</entry> <entry>Description</entry> </row> </thead> <tbody> <row> <entry><command>next_event</command></entry> <entry><classname>SGMLS_Event</classname></entry> <entry>Return the next event.</entry> </row> <row> <entry><command>appinfo</command></entry> <entry>string</entry> <entry>Return the APPINFO parameter from the &sgml; declaration, if any.</entry> </row> <row> <entry><command>notation(<parameter>nname</parameter>)</command></entry> <entry><classname>SGMLS_Notation</classname></entry> <entry>Look up a notation by name.</entry> </row> <row> <entry><command>entity(<parameter>ename</parameter>)</command></entry> <entry><classname>SGMLS_Entity</classname></entry> <entry>Look up an entity by name.</entry> </row> </tbody> </tgroup> </table> </sect1> <sect1 id=example> <title>How about a simple example?</title> <para>OK. The following script simply reports its events:</para> <programlisting> &sample.program; </programlisting> <para>To use it under Unix, try something like</para> <programlisting> sgmls document.sgml | perl sample.pl </programlisting> <para>and watch the output scroll down.</para> </sect1> <sect1 id=extend> <title>How do I design my <emphasis>own</emphasis> classes?</title> <para>In addition to the methods listed above, all of the classes used in &sgmls.pm; have an <command>ext</command> method which returns a reference to an initially-empty hash table. You are free to use this hash table to store <emphasis>anything</emphasis> you want — it should be especially useful if you are building your own, derived classes from the ones provided here. The following example builds a derived class <classname>My_Element</classname> from the <link linkend=sgmlselement><classname>SGMLS_Element</classname></link> class, adding methods to set and get the current font:</para> <programlisting> use SGMLS; package My_Element; @ISA = qw(SGMLS_Element); sub new { my ($class,$element,$font) = @_; $element->ext->{'font'} = $font; return bless $element; } sub get_font { my ($self) = @_; return $self->ext->{'font'}; } sub set_font { my ($self,$font) = @_; $self->ext->{'font'} = $font; } </programlisting> <para>Note that the derived class does not need to have any knowledge about the underlying structure of the <link linkend=sgmlselement><classname>SGMLS_Element</classname></link> class, and need only avoid shadowing any of the methods currently existing there.</para> <para>If you decide to create a derived class from the <link linkend=sgmls><classname>SGMLS</classname></link>, please note that in addition to the methods listed above, that class uses internal methods named <command>element</command>, <command>line</command>, and <command>file</command>, similar to the same methods in <link linkend=sgmlsevent><classname>SGMLS_Event</classname></link> — it is essential that you not shadow these method names.</para> </sect1> <sect1 id=bugs> <title>Are there any bugs?</title> <para>Of course! Right now, &sgmls.pm; silently ignores link attributes (&nsgmls; only) and data attributes, and there may be many other bugs which I have not yet found.</para> </sect1> </article> <!-- Keep this comment at the end of the file Local variables: mode: sgml sgml-declaration:"/usr/local/lib/sgml/sgmldecl/docbook.dcl" End: -->