<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html lang="en"> <head> <title>RFC Date Parser Documentation</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <link rel="stylesheet" href="../../../doc/theme/style.css" type="text/css"> </head> <body> <div class="logo"><a href="http://spirit.sourceforge.net/"><img src="../../../doc/theme/spirit.gif" alt="[Home]"></a></div> <h1>RFC Date Parser Documentation</h1> <h3>Peter Simons <a href="mailto:simons@computer.org"><simons@computer.org></a></h3> <hr> <h3>Introduction</h3> <p>The standard C library environment provides various routines that will format a time from the internal binary representation to a textual representation. What it lacks, though, is a routine that does the opposite: Parsing a textual date and time representation into a binary representation. This is exactly what the parser provided in this example does.</p> <p>The <code>rfcdate_parser</code> class understands all date and time specifications described in section 5 of <a href="http://rfc.fh-koeln.de/rfc/html_gz/rfc822.html.gz">RFC 822</a>. This does not include many other popular syntaxes, such as the ISO format, the ASN.1 format, etc., but using this class as an example, it should be trivial to write appropriate parsers for these formats as well.</p> <p>Aside from being outright useful, the <code>rfcdate_parser</code> class is intended to serve as a (more complex) example of how to use the Spirit parser framework. Thus, this document has been written more as a tutorial than as a reference documentation, in the hope that it will help new users understand how to apply Spirit to similar problems.</p> <h3>RFC Date and Time Specifications</h3> <p>The RFC format for date and time specifications is originally defined in <a href="http://rfc.fh-koeln.de/rfc/html_gz/rfc822.html.gz">RFC 822</a> and has since then be re-used in many RFC formats and protocols. The exact specification in the RFC's <q>augmented BNF</q> is as follows:</p> <pre>date-time = [ day "," ] date time day = "Mon" | "Tue" | "Wed" | "Thu" | "Fri" | "Sat" | "Sun" date = 1*2DIGIT month 2DIGIT month = "Jan" | "Feb" | "Mar" | "Apr" | "May" | "Jun" | "Jul" | "Aug" | "Sep" | "Oct" | "Nov" | "Dec" time = hour zone hour = 2DIGIT ":" 2DIGIT [":" 2DIGIT] zone = "UT" | "GMT" | "EST" | "EDT" | "CST" | "CDT" | "MST" | "MDT" | "PST" | "PDT" | 1ALPHA | ( ("+" | "-") 4DIGIT )</pre> <p>The syntax actually understood by the <code>rfcdate_parser</code> class varies from this grammar in three points:</p> <ol> <li>The year may consist of <em>at least</em> 2 digits, not exactly 2.</li> <li>The <code>time</code> rule is optional. If omitted, <q>00:00</q> is assumed.</li> <li>Within the <code>time</code> rule, the <code>zone</code> rule is optional. If omitted, <q>UTC</q> is assumed.</li> </ol> <p>Concerning the specification of the date's year: A two-digit year <q><var>XY</var></q> is interpreted as <q>19<var>XY</var></q>; everything else is taken literally. Hence, the parser will understand a date such as <q>1 Jan 1312</q>, even though you system is probably not able to handle that date correctly, because it cannot be expressed as a <code>time_t</code> (seconds since 1 Jan 1970). Thus: Be careful to check for errors when dealing with such dates.</p> <p>At first sight, this doesn't look too unreasonable, but unfortunately, a few section earlier, the RFC goes and states that any atom may be delimited by either white space (space or tab), continued linear whitespace (carriage return + newline + white space), or comments (pretty much anything in brackets). Furthermore, comments may nest, any character may be escaped, and so on and so forth. In effect, this means that the rather sane input</p> <pre>12 Jun 82</pre> <p>is identical to the rather insane input:</p> <pre>12 (\(( This is a nested comment\), still), and still) Jun (hehe) 82</pre> <p>Of course, it is almost impossible to specify an EBNF that parses such a thing -- which is exactly why the RFC does not and why most parsers do, in fact, not handle it.</p> <p>Using Spirit, though, parsing this beast is astonshingly easy; you just have to split the functionality into an actual parser and a skipper. If you want to find out how, read on ...</p> <h3>The <code>rfc_skipper</code> class</h3> <p>The most complicated part of parsing <em>anything</em> that is based on the grammar defined in RFC822 is the crazy comment and line continuation syntax. Once you have that out of your way, the rest is rather simple. Fortunately, Spirit provides a great mechanism that solves this problem altogether for us: The skipper. A skipper is basically a parser that will be applied every time a token of the actual grammar has matched. If the skipper matches the input following the token, all matching characters will be skipped. That is, the real parser will not see them.</p> <p>Thus, if you want to parse a sequence of numbers separated by blanks, like this:</p> <pre>input = number ( " " number )*</pre> <p>You can either write the parser accordinly, expecting those blanks, or you can you say</p> <pre>input = number ( number )*</pre> <p>and combine it with a skipper that will match a blank, such as <code>spirit::space_p</code>. (By the way: If you want to disable the skipper in certain parts of the grammar, which have to be parsed litarally, you can wrap them in a <code>spirit::lexeme_d</code> directive.)</p> <p>Thus, once we have a skipper that skips all that comment-junk for us, parsing the actual contents will be much easier. Here is the code:</p> <pre>struct rfc_skipper : public spirit::grammar<rfc_skipper> { rfc_skipper() { } template<typename scannerT> struct definition { definition(const rfc_skipper& self) { using namespace spirit; first = ( junk = lwsp | comment lwsp = +( !str_p("\r\n") >> chset_p(" \t") ), comment = ch_p('(') >> *( lwsp | ctext | qpair | comment ) >> ')', ctext = anychar_p - chset_p("()\\\r"), qpair = ch_p('\\') >> anychar_p ); } const spirit::rule<scannerT>& start() const { return first; } spirit::subrule<0> junk; spirit::subrule<1> lwsp; spirit::subrule<2> comment; spirit::subrule<3> ctext; spirit::subrule<4> qpair; spirit::rule<scannerT> first; }; }; const rfc_skipper rfc_skipper_p;</pre> <p>As you can see, the skipper will match anything that is not an actual token according to the RFC:</p> <dl> <dt>linear white space</dt> <dd>The name <q>linear white space</q> is somewhat misleading. The RFC defines this as an end of line (<tt>\r\n</tt>) followed by at least one white space character (space or <tt>\t</tt>). This is also known as a <q>continued line</q>.</dd> <dt>comments</dt> <dd>A comment is basically <em>any</em> text included in brackets. Also, comments may nest. Thus, if you want to specify a literal <tt>(</tt> or <tt>)</tt>character in a comment, you'll have to quote it with a backslash character (<tt>\</tt>). Obviously, a literal <tt>\</tt> character will have to be quoted as well. Comments may not contain single carriage return (<tt>\r</tt>), but they <em>may</em> contain linear white space.</dd> <dt>white space</dt> <dd>Of course, ordinary white space such as the blank or the <tt>\t</tt> may alse be used to separate tokens.</dd> </dl> <p>Having a generic RFC-style skipper available is -- by the way -- much more useful than just for parsing dates! Consider the case where you want to know the actual address stated in an e-mail or a news posting. Then you could use the mini-parser</p> <pre>char weird[] = \ "From: (Some \r\n" \ " comment) simons (stuff) \r\n" \ " @ computer (inserted) . (between) org(tokens)"; string output; parse(weird, ( str_p("From:") >> *( anychar_p [append(output)] ) ), rfc_skipper_p); cout << "Stripped address is: '" << output << "'" << endl; assert(output == "simons@computer.org");</pre> <p>to get rid of the comments and white space -- the result being an address in the canonic representation. (Of course, RFC822 address lines are much more complicated than this ... Consider this to be an example.)</p> <h3>The <code>rfcdate_parser</code> class (and helpers)</h3> <p></p> </body> </html> <!-- Local Variables: mode: sgml fill-column:95 End: -->