Sophie

Sophie

distrib > Fedora > 18 > i386 > by-pkgid > 7e03e96dde1cbbdbc7cc96424cd9e059 > files > 302

python-feedparser-doc-5.1.3-3.fc18.noarch.rpm



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>Date Parsing &mdash; feedparser 5.1.3 documentation</title>
    
    <link rel="stylesheet" href="_static/default.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    <link rel="stylesheet" href="_static/feedparser.css" type="text/css" />
    
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '',
        VERSION:     '5.1.3',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <link rel="top" title="feedparser 5.1.3 documentation" href="index.html" />
    <link rel="up" title="Advanced Features" href="advanced.html" />
    <link rel="next" title="Sanitization" href="html-sanitization.html" />
    <link rel="prev" title="Advanced Features" href="advanced.html" /> 
  </head>
  <body>
    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="html-sanitization.html" title="Sanitization"
             accesskey="N">next</a> |</li>
        <li class="right" >
          <a href="advanced.html" title="Advanced Features"
             accesskey="P">previous</a> |</li>
        <li><a href="index.html">feedparser 5.1.3 documentation</a> &raquo;</li>
          <li><a href="advanced.html" accesskey="U">Advanced Features</a> &raquo;</li> 
      </ul>
    </div>  

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body">
            
  <div class="section" id="date-parsing">
<span id="advanced-date"></span><h1>Date Parsing<a class="headerlink" href="#date-parsing" title="Permalink to this headline">¶</a></h1>
<p>Different feed types and versions use wildly different date formats.
<strong class="program">Universal Feed Parser</strong> will attempt to auto-detect the date format
used in any date element, and parse it into a standard <strong class="program">Python</strong>
9-tuple, as documented in <a class="reference external" href="http://docs.python.org/lib/module-time.html">the Python time module</a>.</p>
<p>The following elements are parsed as dates:</p>
<ul class="simple">
<li><a class="reference internal" href="reference-feed-updated.html#reference-feed-updated"><em>feed.updated</em></a> is parsed into <a class="reference internal" href="reference-feed-updated_parsed.html#reference-feed-updated-parsed"><em>feed.updated_parsed</em></a>.</li>
<li><a class="reference internal" href="reference-entry-published.html#reference-entry-published"><em>entries[i].published</em></a> is parsed into <a class="reference internal" href="reference-entry-published_parsed.html#reference-entry-published-parsed"><em>entries[i].published_parsed</em></a>.</li>
<li><a class="reference internal" href="reference-entry-updated.html#reference-entry-updated"><em>entries[i].updated</em></a> is parsed into <a class="reference internal" href="reference-entry-updated_parsed.html#reference-entry-updated-parsed"><em>entries[i].updated_parsed</em></a>.</li>
<li><a class="reference internal" href="reference-entry-created.html#reference-entry-created"><em>entries[i].created</em></a> is parsed into <a class="reference internal" href="reference-entry-created_parsed.html#reference-entry-created-parsed"><em>entries[i].created_parsed</em></a>.</li>
<li><a class="reference internal" href="reference-entry-expired.html#reference-entry-expired"><em>entries[i].expired</em></a> is parsed into <a class="reference internal" href="reference-entry-expired_parsed.html#reference-entry-expired-parsed"><em>entries[i].expired_parsed</em></a>.</li>
</ul>
<div class="section" id="history-of-date-formats">
<h2>History of Date Formats<a class="headerlink" href="#history-of-date-formats" title="Permalink to this headline">¶</a></h2>
<p>Here is a brief history of feed date formats:</p>
<ul class="simple">
<li><abbr title="Channel Definition Format">CDF</abbr> states that all date values must
conform to ISO 8601:1988.  ISO 8601:1988 is not a freely
available specification, but a brief (non-normative) description of the date
formats it describes is available here: <a class="reference external" href="http://hydracen.com/dx/iso8601.htm">ISO 8601:1988 Date/Time Representations</a>.</li>
<li><abbr title="Rich Site Summary">RSS</abbr> 0.90 has no date elements.</li>
<li>Netscape <abbr title="Rich Site Summary">RSS</abbr> 0.91 does not specify a date format,
but examples within the specification show <abbr title="Request For Comments">RFC</abbr>
822-style dates with 4-digit years.</li>
<li>Userland <abbr title="Rich Site Summary">RSS</abbr> 0.91 states, &#8220;All date-times in
<abbr title="Rich Site Summary">RSS</abbr> conform to the Date and Time Specification of
<abbr title="Request For Comments">RFC</abbr> 822.&#8221; <a class="reference external" href="http://www.ietf.org/rfc/rfc822.txt">RFC 822</a>
mandates 2-digit years; it does not allow 4-digit years.</li>
<li><abbr title="Rich Site Summary">RSS</abbr> 1.0 states that all date elements must
conform to <a class="reference external" href="http://www.w3.org/TR/NOTE-datetime">W3CDTF</a>,
which is a profile of ISO 8601:1988.</li>
<li><abbr title="Rich Site Summary">RSS</abbr> 2.0 states, &#8220;All date-times in <abbr title="Rich Site Summary">RSS</abbr> conform to the Date and Time Specification of RFC 822, with the exception that the year may be expressed with two characters or four characters (four preferred).&#8221;</li>
<li>Atom 0.3 states that all date elements must conform to
<a class="reference external" href="http://www.w3.org/TR/NOTE-datetime">W3CDTF</a>.</li>
<li>Atom 1.0 states that all date elements &#8220;MUST conform to the date-time
production in <a class="reference external" href="http://www.ietf.org/rfc/rfc3339.txt">RFC 3339</a>.
In addition, an uppercase T character MUST be used to separate date and time,
and an uppercase Z character MUST be present in the absence of a numeric time
zone offset.&#8221;</li>
</ul>
</div>
<div class="section" id="recognized-date-formats">
<h2>Recognized Date Formats<a class="headerlink" href="#recognized-date-formats" title="Permalink to this headline">¶</a></h2>
<p>Here is a representative list of the formats that <strong class="program">Universal Feed
Parser</strong> can recognize in any date element:</p>
<p>Recognized Date Formats</p>
<table border="1" class="docutils">
<colgroup>
<col width="39%" />
<col width="29%" />
<col width="32%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Description</th>
<th class="head">Example</th>
<th class="head">Parsed Value</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>valid RFC 822 (2-digit year)</td>
<td>Thu, 01 Jan 04 19:48:21 GMT</td>
<td>(2004, 1, 1, 19, 48, 21, 3, 1, 0)</td>
</tr>
<tr class="row-odd"><td>valid RFC 822 (4-digit year)</td>
<td>Thu, 01 Jan 2004 19:48:21 GMT</td>
<td>(2004, 1, 1, 19, 48, 21, 3, 1, 0)</td>
</tr>
<tr class="row-even"><td>invalid RFC 822 (no time)</td>
<td>01 Jan 2004</td>
<td>(2004, 1, 1, 0, 0, 0, 3, 1, 0)</td>
</tr>
<tr class="row-odd"><td>invalid RFC 822 (no seconds)</td>
<td>01 Jan 2004 00:00 GMT</td>
<td>(2004, 1, 1, 0, 0, 0, 3, 1, 0)</td>
</tr>
<tr class="row-even"><td>valid W3CDTF (numeric timezone)</td>
<td>2003-12-31T10:14:55-08:00</td>
<td>(2003, 12, 31, 18, 14, 55, 2, 365, 0)</td>
</tr>
<tr class="row-odd"><td>valid W3CDTF (UTC timezone)</td>
<td>2003-12-31T10:14:55Z</td>
<td>(2003, 12, 31, 10, 14, 55, 2, 365, 0)</td>
</tr>
<tr class="row-even"><td>valid W3CDTF (yyyy)</td>
<td>2003</td>
<td>(2003, 1, 1, 0, 0, 0, 2, 1, 0)</td>
</tr>
<tr class="row-odd"><td>valid W3CDTF (yyyy-mm)</td>
<td>2003-12</td>
<td>(2003, 12, 1, 0, 0, 0, 0, 335, 0)</td>
</tr>
<tr class="row-even"><td>valid W3CDTF (yyyy-mm-dd)</td>
<td>2003-12-31</td>
<td>(2003, 12, 31, 0, 0, 0, 2, 365, 0)</td>
</tr>
<tr class="row-odd"><td>valid ISO 8601 (yyyymmdd)</td>
<td>20031231</td>
<td>(2003, 12, 31, 0, 0, 0, 2, 365, 0)</td>
</tr>
<tr class="row-even"><td>valid ISO 8601 (-yy-mm)</td>
<td>-03-12</td>
<td>(2003, 12, 1, 0, 0, 0, 0, 335, 0)</td>
</tr>
<tr class="row-odd"><td>valid ISO 8601 (-yymm)</td>
<td>-0312</td>
<td>(2003, 12, 1, 0, 0, 0, 0, 335, 0)</td>
</tr>
<tr class="row-even"><td>valid ISO 8601 (-yy-mm-dd)</td>
<td>-03-12-31</td>
<td>(2003, 12, 31, 0, 0, 0, 2, 365, 0)</td>
</tr>
<tr class="row-odd"><td>valid ISO 8601 (yymmdd)</td>
<td>031231</td>
<td>(2003, 12, 31, 0, 0, 0, 2, 365, 0)</td>
</tr>
<tr class="row-even"><td>valid ISO 8601 (yyyy-o)</td>
<td>2003-335</td>
<td>(2003, 12, 1, 0, 0, 0, 0, 335, 0)</td>
</tr>
<tr class="row-odd"><td>valid ISO 8601 (yyo)</td>
<td>03335</td>
<td>(2003, 12, 1, 0, 0, 0, 0, 335, 0)</td>
</tr>
<tr class="row-even"><td>valid asctime</td>
<td>Sun Jan  4 16:29:06 PST 2004</td>
<td>(2004, 1, 5, 0, 29, 6, 0, 5, 0)</td>
</tr>
<tr class="row-odd"><td>bogus RFC 822 (invalid day/month)</td>
<td>Thu, 31 Jun 2004 19:48:21 GMT</td>
<td>(2004, 7, 1, 19, 48, 21, 3, 183, 0)</td>
</tr>
<tr class="row-even"><td>bogus RFC 822 (invalid month)</td>
<td>Mon, 26 January 2004 16:31:00 EST</td>
<td>(2004, 1, 26, 21, 31, 0, 0, 26, 0)</td>
</tr>
<tr class="row-odd"><td>bogus RFC 822 (invalid timezone)</td>
<td>Mon, 26 Jan 2004 16:31:00 ET</td>
<td>(2004, 1, 26, 21, 31, 0, 0, 26, 0)</td>
</tr>
<tr class="row-even"><td>bogus W3CDTF (invalid hour)</td>
<td>2003-12-31T25:14:55Z</td>
<td>(2004, 1, 1, 1, 14, 55, 3, 1, 0)</td>
</tr>
<tr class="row-odd"><td>bogus W3CDTF (invalid minute)</td>
<td>2003-12-31T10:61:55Z</td>
<td>(2003, 12, 31, 11, 1, 55, 2, 365, 0)</td>
</tr>
<tr class="row-even"><td>bogus W3CDTF (invalid second)</td>
<td>2003-12-31T10:14:61Z</td>
<td>(2003, 12, 31, 10, 15, 1, 2, 365, 0)</td>
</tr>
<tr class="row-odd"><td>bogus (MSSQL)</td>
<td>2004-07-08 23:56:58.0</td>
<td>(2004, 7, 8, 14, 56, 58, 3, 190, 0)</td>
</tr>
<tr class="row-even"><td>bogus (MSSQL-ish, without fractional second)</td>
<td>2004-07-08 23:56:58</td>
<td>(2004, 7, 8, 14, 56, 58, 3, 190, 0)</td>
</tr>
<tr class="row-odd"><td>bogus (Korean)</td>
<td>2004-05-25 오 11:23:17</td>
<td>(2004, 5, 25, 14, 23, 17, 1, 146, 0)</td>
</tr>
<tr class="row-even"><td>bogus (Greek)</td>
<td>Κυρ, 11 Ιούλ 2004 12:00:00 EST</td>
<td>(2004, 7, 11, 17, 0, 0, 6, 193, 0)</td>
</tr>
<tr class="row-odd"><td>bogus (Hungarian)</td>
<td>július-13T9:15-05:00</td>
<td>(2004, 7, 13, 14, 15, 0, 1, 195, 0)</td>
</tr>
</tbody>
</table>
<p><strong class="program">Universal Feed Parser</strong> recognizes all character-based timezone
abbreviations defined in <abbr title="Request For Comments">RFC</abbr> 822.  In addition,
<strong class="program">Universal Feed Parser</strong> recognizes the following invalid timezones:</p>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">AT</span></tt> is treated as <tt class="docutils literal"><span class="pre">AST</span></tt></li>
<li><tt class="docutils literal"><span class="pre">ET</span></tt> is treated as <tt class="docutils literal"><span class="pre">EST</span></tt></li>
<li><tt class="docutils literal"><span class="pre">CT</span></tt> is treated as <tt class="docutils literal"><span class="pre">CST</span></tt></li>
<li><tt class="docutils literal"><span class="pre">MT</span></tt> is treated as <tt class="docutils literal"><span class="pre">MST</span></tt></li>
<li><tt class="docutils literal"><span class="pre">PT</span></tt> is treated as <tt class="docutils literal"><span class="pre">PST</span></tt></li>
</ul>
</div>
<div class="section" id="supporting-additional-date-formats">
<h2>Supporting Additional Date Formats<a class="headerlink" href="#supporting-additional-date-formats" title="Permalink to this headline">¶</a></h2>
<p><strong class="program">Universal Feed Parser</strong> supports many different date formats, but
there are probably many more in the wild that are still unsupported.  If you
find other date formats, you can support them by registering them with
<tt class="docutils literal"><span class="pre">registerDateHandler</span></tt>.  It takes a single argument, a callback function.  The
callback function should take a single argument, a string, and return a single
value, a 9-tuple <strong class="program">Python</strong> date in UTC.</p>
<div class="section" id="registering-a-third-party-date-handler">
<h3>Registering a third-party date handler<a class="headerlink" href="#registering-a-third-party-date-handler" title="Permalink to this headline">¶</a></h3>
<div class="highlight-python"><div class="highlight"><pre><span class="kn">import</span> <span class="nn">feedparser</span>
<span class="kn">import</span> <span class="nn">re</span>

<span class="n">_my_date_pattern</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span>
    <span class="s">r&#39;(\d{,2})/(\d{,2})/(\d{4}) (\d{,2}):(\d{2}):(\d{2})&#39;</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">myDateHandler</span><span class="p">(</span><span class="n">aDateString</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;parse a UTC date in MM/DD/YYYY HH:MM:SS format&quot;&quot;&quot;</span>
    <span class="n">month</span><span class="p">,</span> <span class="n">day</span><span class="p">,</span> <span class="n">year</span><span class="p">,</span> <span class="n">hour</span><span class="p">,</span> <span class="n">minute</span><span class="p">,</span> <span class="n">second</span> <span class="o">=</span> \
        <span class="n">_my_date_pattern</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">aDateString</span><span class="p">)</span><span class="o">.</span><span class="n">groups</span><span class="p">()</span>
    <span class="k">return</span> <span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">year</span><span class="p">),</span> <span class="nb">int</span><span class="p">(</span><span class="n">month</span><span class="p">),</span> <span class="nb">int</span><span class="p">(</span><span class="n">day</span><span class="p">),</span> \
        <span class="nb">int</span><span class="p">(</span><span class="n">hour</span><span class="p">),</span> <span class="nb">int</span><span class="p">(</span><span class="n">minute</span><span class="p">),</span> <span class="nb">int</span><span class="p">(</span><span class="n">second</span><span class="p">),</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>

<span class="n">feedparser</span><span class="o">.</span><span class="n">registerDateHandler</span><span class="p">(</span><span class="n">myDateHandler</span><span class="p">)</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">feedparser</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
</pre></div>
</div>
<p>Your newly-registered date handler will be tried before all the other date
handlers built into <strong class="program">Universal Feed Parser</strong>.  (More specifically, all
date handlers are tried in &#8220;last in, first out&#8221; order; i.e. the last handler to
be registered is the first one tried, and so on in reverse order of
registration.)</p>
<p>If your date handler returns <tt class="docutils literal"><span class="pre">None</span></tt>, or anything other than a
<strong class="program">Python</strong> 9-tuple date, or raises an exception of any kind, the error
will be silently ignored and the other registered date handlers will be tried
in order.  If no date handlers succeed, then the date is not parsed, and the
*_parsed value will not be present in the results dictionary.  The original
date string will still be available in the appropriate element in the results
dictionary.</p>
<div class="admonition tip">
<p class="first admonition-title">Tip</p>
<p class="last">If you write a new date handler, you are encouraged (but not required) to
<a class="reference external" href="http://sourceforge.net/projects/feedparser/">submit a patch</a> so it can be
integrated into the next version of <strong class="program">Universal Feed Parser</strong>.</p>
</div>
</div>
</div>
</div>


          </div>
        </div>
      </div>
      <div class="sphinxsidebar">
        <div class="sphinxsidebarwrapper">
  <h3><a href="index.html">Table Of Contents</a></h3>
  <ul>
<li><a class="reference internal" href="#">Date Parsing</a><ul>
<li><a class="reference internal" href="#history-of-date-formats">History of Date Formats</a></li>
<li><a class="reference internal" href="#recognized-date-formats">Recognized Date Formats</a></li>
<li><a class="reference internal" href="#supporting-additional-date-formats">Supporting Additional Date Formats</a><ul>
<li><a class="reference internal" href="#registering-a-third-party-date-handler">Registering a third-party date handler</a></li>
</ul>
</li>
</ul>
</li>
</ul>

  <h4>Previous topic</h4>
  <p class="topless"><a href="advanced.html"
                        title="previous chapter">Advanced Features</a></p>
  <h4>Next topic</h4>
  <p class="topless"><a href="html-sanitization.html"
                        title="next chapter">Sanitization</a></p>
  <h3>This Page</h3>
  <ul class="this-page-menu">
    <li><a href="_sources/date-parsing.txt"
           rel="nofollow">Show Source</a></li>
  </ul>
<div id="searchbox" style="display: none">
  <h3>Quick search</h3>
    <form class="search" action="search.html" method="get">
      <input type="text" name="q" />
      <input type="submit" value="Go" />
      <input type="hidden" name="check_keywords" value="yes" />
      <input type="hidden" name="area" value="default" />
    </form>
    <p class="searchtip" style="font-size: 90%">
    Enter search terms or a module, class or function name.
    </p>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
        </div>
      </div>
      <div class="clearer"></div>
    </div>
    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             >index</a></li>
        <li class="right" >
          <a href="html-sanitization.html" title="Sanitization"
             >next</a> |</li>
        <li class="right" >
          <a href="advanced.html" title="Advanced Features"
             >previous</a> |</li>
        <li><a href="index.html">feedparser 5.1.3 documentation</a> &raquo;</li>
          <li><a href="advanced.html" >Advanced Features</a> &raquo;</li> 
      </ul>
    </div>
    <div class="footer">
        &copy; Copyright 2004-2008 Mark Pilgrim, 2010-2012 Kurt McKee.
      Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 1.1.3.
    </div>
  </body>
</html>