Sophie

Sophie

distrib > Mandriva > 9.1 > i586 > by-pkgid > b9ba69a436161613d8fb030c8c726a8e > files > 572

spirit-1.5.1-2mdk.noarch.rpm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
  <head>
    <title>RFC Date Parser Documentation</title>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    <link rel="stylesheet" href="../../../doc/theme/style.css" type="text/css">
  </head>
  <body>
    <div class="logo"><a href="http://spirit.sourceforge.net/"><img
          src="../../../doc/theme/spirit.gif" alt="[Home]"></a></div>
    <h1>RFC Date Parser Documentation</h1>
    <h3>Peter Simons <a href="mailto:simons@computer.org">&lt;simons@computer.org&gt;</a></h3>
    <hr>

    <h3>Introduction</h3>

    <p>The standard C library environment provides various routines that will format a time
from the internal binary representation to a textual representation. What it lacks, though, is
a routine that does the opposite: Parsing a textual date and time representation into a binary
representation. This is exactly what the parser provided in this example does.</p>

    <p>The <code>rfcdate_parser</code> class understands all date and time specifications
described in section&nbsp;5 of <a
href="http://rfc.fh-koeln.de/rfc/html_gz/rfc822.html.gz">RFC&nbsp;822</a>. This does not
include many other popular syntaxes, such as the ISO format, the ASN.1 format, etc., but using
this class as an example, it should be trivial to write appropriate parsers for these formats
as well.</p>

    <p>Aside from being outright useful, the <code>rfcdate_parser</code> class is intended to
serve as a (more complex) example of how to use the Spirit parser framework. Thus, this
document has been written more as a tutorial than as a reference documentation, in the hope
that it will help new users understand how to apply Spirit to similar problems.</p>

    <h3>RFC Date and Time Specifications</h3>

    <p>The RFC format for date and time specifications is originally defined in <a
href="http://rfc.fh-koeln.de/rfc/html_gz/rfc822.html.gz">RFC&nbsp;822</a> and has since then be
re-used in many RFC formats and protocols. The exact specification in the RFC's <q>augmented
BNF</q> is as follows:</p>

    <pre>date-time   =  [ day "," ] date time

day         =  "Mon"  | "Tue" |  "Wed"  | "Thu"
            |  "Fri"  | "Sat" |  "Sun"

date        =  1*2DIGIT month 2DIGIT

month       =  "Jan"  |  "Feb" |  "Mar"  |  "Apr"
            |  "May"  |  "Jun" |  "Jul"  |  "Aug"
            |  "Sep"  |  "Oct" |  "Nov"  |  "Dec"

time        =  hour zone

hour        =  2DIGIT ":" 2DIGIT [":" 2DIGIT]

zone        =  "UT"  | "GMT"
            |  "EST" | "EDT"
            |  "CST" | "CDT"
            |  "MST" | "MDT"
            |  "PST" | "PDT"
            |  1ALPHA
            | ( ("+" | "-") 4DIGIT )</pre>

    <p>The syntax actually understood by the <code>rfcdate_parser</code> class varies from this
grammar in three points:</p>

    <ol>
      <li>The year may consist of <em>at least</em> 2 digits, not exactly 2.</li>

      <li>The <code>time</code> rule is optional. If omitted, <q>00:00</q> is assumed.</li>

      <li>Within the <code>time</code> rule, the <code>zone</code> rule is optional. If
omitted, <q>UTC</q> is assumed.</li>
    </ol>

    <p>Concerning the specification of the date's year: A two-digit year <q><var>XY</var></q>
is interpreted as <q>19<var>XY</var></q>; everything else is taken literally. Hence, the parser
will understand a date such as <q>1 Jan 1312</q>, even though you system is probably not able
to handle that date correctly, because it cannot be expressed as a <code>time_t</code> (seconds
since 1 Jan 1970). Thus: Be careful to check for errors when dealing with such dates.</p>

    <p>At first sight, this doesn't look too unreasonable, but unfortunately, a few section
earlier, the RFC goes and states that any atom may be delimited by either white space (space or
tab), continued linear whitespace (carriage return + newline + white space), or comments
(pretty much anything in brackets). Furthermore, comments may nest, any character may be
escaped, and so on and so forth. In effect, this means that the rather sane input</p>

    <pre>12 Jun 82</pre>

    <p>is identical to the rather insane input:</p>

    <pre>12 (\((
 This is a nested
    comment\), still), and still)
 Jun            (hehe)                       82</pre>

    <p>Of course, it is almost impossible to specify an EBNF that parses such a thing -- which
is exactly why the RFC does not and why most parsers do, in fact, not handle it.</p>

    <p>Using Spirit, though, parsing this beast is astonshingly easy; you just have to split
the functionality into an actual parser and a skipper. If you want to find out how, read
on&nbsp;...</p>

    <h3>The <code>rfc_skipper</code> class</h3>

    <p>The most complicated part of parsing <em>anything</em> that is based on the grammar
defined in RFC822 is the crazy comment and line continuation syntax. Once you have that out of
your way, the rest is rather simple. Fortunately, Spirit provides a great mechanism that solves
this problem altogether for us: The skipper. A skipper is basically a parser that will be
applied every time a token of the actual grammar has matched. If the skipper matches the input
following the token, all matching characters will be skipped. That is, the real parser will not
see them.</p>

    <p>Thus, if you want to parse a sequence of numbers separated by blanks, like this:</p>

    <pre>input = number ( " " number )*</pre>

    <p>You can either write the parser accordinly, expecting those blanks, or you can you
say</p>

    <pre>input = number ( number )*</pre>

    <p>and combine it with a skipper that will match a blank, such as
<code>spirit::space_p</code>. (By the way: If you want to disable the skipper in certain parts
of the grammar, which have to be parsed litarally, you can wrap them in a
<code>spirit::lexeme_d</code> directive.)</p>

    <p>Thus, once we have a skipper that skips all that comment-junk for us, parsing the actual
contents will be much easier. Here is the code:</p>

    <pre>struct rfc_skipper : public spirit::grammar&lt;rfc_skipper&gt;
    {
    rfc_skipper()
        {
        }
    template&lt;typename scannerT&gt;
    struct definition
        {
        definition(const rfc_skipper&amp; self)
            {
            using namespace spirit;

            first =
                (
                    junk    = lwsp | comment

                    lwsp    = +(    !str_p("\r\n")
                                    &gt;&gt; chset_p(" \t")
                               ),

                    comment =  ch_p('(')
                               &gt;&gt;  *(   lwsp
                                    |   ctext
                                    |   qpair
                                    |   comment
                                    )
                               &gt;&gt; ')',

                    ctext   =  anychar_p - chset_p("()\\\r"),

                    qpair   =  ch_p('\\') &gt;&gt; anychar_p
                );
            }
        const spirit::rule&lt;scannerT&gt;&amp; start() const
            {
            return first;
            }
        spirit::subrule&lt;0&gt;     junk;
        spirit::subrule&lt;1&gt;     lwsp;
        spirit::subrule&lt;2&gt;     comment;
        spirit::subrule&lt;3&gt;     ctext;
        spirit::subrule&lt;4&gt;     qpair;
        spirit::rule&lt;scannerT&gt; first;
        };
    };
const rfc_skipper rfc_skipper_p;</pre>

    <p>As you can see, the skipper will match anything that is not an actual token according to
the RFC:</p>

    <dl>
      <dt>linear white space</dt>
      <dd>The name <q>linear white space</q> is somewhat misleading. The RFC defines this as an
end of line (<tt>\r\n</tt>) followed by at least one white space character (space or
<tt>\t</tt>). This is also known as a <q>continued line</q>.</dd>

      <dt>comments</dt>

      <dd>A comment is basically <em>any</em> text included in brackets. Also, comments may
nest. Thus, if you want to specify a literal <tt>(</tt> or <tt>)</tt>character in a comment,
you'll have to quote it with a backslash character (<tt>\</tt>). Obviously, a literal
<tt>\</tt> character will have to be quoted as well. Comments may not contain single carriage
return (<tt>\r</tt>), but they <em>may</em> contain linear white space.</dd>

      <dt>white space</dt>

      <dd>Of course, ordinary white space such as the blank or the <tt>\t</tt> may alse be used
to separate tokens.</dd>
    </dl>

    <p>Having a generic RFC-style skipper available is -- by the way -- much more useful than
just for parsing dates! Consider the case where you want to know the actual address stated in
an e-mail or a news posting. Then you could use the mini-parser</p>

    <pre>char weird[] =                                                    \
    "From: (Some \r\n"                                            \
    "       comment) simons (stuff) \r\n"                         \
    "        @      computer (inserted) . (between) org(tokens)";

string output;
parse(weird, (    str_p("From:")
                  &gt;&gt; *( anychar_p [append(output)] )
             ),
      rfc_skipper_p);

cout &lt;&lt; "Stripped address is: '" &lt;&lt; output &lt;&lt; "'" &lt;&lt; endl;
assert(output == "simons@computer.org");</pre>

    <p>to get rid of the comments and white space -- the result being an address in the canonic
representation. (Of course, RFC822 address lines are much more complicated than this&nbsp;...
Consider this to be an example.)</p>

    <h3>The <code>rfcdate_parser</code> class (and helpers)</h3>

    <p></p>

  </body>
</html>

<!--
Local Variables:
mode: sgml
fill-column:95
End:
-->