Sophie

Sophie

distrib > Fedora > 18 > i386 > by-pkgid > 7e03e96dde1cbbdbc7cc96424cd9e059 > files > 298

python-feedparser-doc-5.1.3-3.fc18.noarch.rpm



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>Character Encoding Detection &mdash; feedparser 5.1.3 documentation</title>
    
    <link rel="stylesheet" href="_static/default.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    <link rel="stylesheet" href="_static/feedparser.css" type="text/css" />
    
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '',
        VERSION:     '5.1.3',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <link rel="top" title="feedparser 5.1.3 documentation" href="index.html" />
    <link rel="up" title="Advanced Features" href="advanced.html" />
    <link rel="next" title="Bozo Detection" href="bozo.html" />
    <link rel="prev" title="Feed Type and Version Detection" href="version-detection.html" /> 
  </head>
  <body>
    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="bozo.html" title="Bozo Detection"
             accesskey="N">next</a> |</li>
        <li class="right" >
          <a href="version-detection.html" title="Feed Type and Version Detection"
             accesskey="P">previous</a> |</li>
        <li><a href="index.html">feedparser 5.1.3 documentation</a> &raquo;</li>
          <li><a href="advanced.html" accesskey="U">Advanced Features</a> &raquo;</li> 
      </ul>
    </div>  

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body">
            
  <div class="section" id="character-encoding-detection">
<span id="advanced-encoding"></span><h1>Character Encoding Detection<a class="headerlink" href="#character-encoding-detection" title="Permalink to this headline">¶</a></h1>
<div class="admonition tip">
<p class="first admonition-title">Tip</p>
<p class="last">Feeds may be published in any character encoding.  <strong class="program">Python</strong>
supports only a few character encodings by default.  To support the maximum
number of character encodings (and be able to parse the maximum number of
feeds), you should install <tt class="file docutils literal"><span class="pre">cjkcodecs</span></tt> and <tt class="file docutils literal"><span class="pre">iconv_codec</span></tt>.  Both are
available at <a class="reference external" href="http://cjkpython.i18n.org/">http://cjkpython.i18n.org/</a>.</p>
</div>
<p><a class="reference external" href="http://www.ietf.org/rfc/rfc3023.txt">RFC 3023</a> defines the interaction
between <abbr title="Extensible Markup Language">XML</abbr> and <abbr title="Hypertext Transfer Protocol">HTTP</abbr>
as it relates to character encoding.  <abbr title="Extensible Markup Language">XML</abbr>
and <abbr title="Hypertext Transfer Protocol">HTTP</abbr> have different ways of
specifying character encoding and different defaults in case no encoding is
specified, and determining which value takes precedence depends on a variety of
factors.</p>
<div class="section" id="introduction-to-character-encoding">
<h2>Introduction to Character Encoding<a class="headerlink" href="#introduction-to-character-encoding" title="Permalink to this headline">¶</a></h2>
<p>In <abbr title="Extensible Markup Language">XML</abbr>, the character encoding is optional
and may be given in the <abbr title="Extensible Markup Language">XML</abbr> declaration in
the first line of the document, like this:</p>
<div class="highlight-xml"><div class="highlight"><pre><span class="cp">&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;</span>
</pre></div>
</div>
<p>If no encoding is given, <abbr title="Extensible Markup Language">XML</abbr> supports the
use of a Byte Order Mark to identify the document as some flavor of UTF-32,
UTF-16, or UTF-8.  <a class="reference external" href="http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info">Section F of the XML specification</a>
outlines the process for determining the character encoding based on unique
properties of the Byte Order Mark in the first two to four bytes of the
document.</p>
<p>If no encoding is specified and no Byte Order Mark is present, <abbr title="Extensible Markup Language">XML</abbr>
defaults to UTF-8.</p>
<p><abbr title="Hypertext Transfer Protocol">HTTP</abbr> uses <abbr>MIME</abbr> to define a method
of specifying the character encoding, as part of the Content-Type <abbr title="Hypertext Transfer Protocol">HTTP</abbr>
header, which looks like this:</p>
<div class="highlight-python"><pre>Content-Type: text/html; charset="utf-8"</pre>
</div>
<p>If no charset is specified, <abbr title="Hypertext Transfer Protocol">HTTP</abbr> defaults
to iso-8859-1, but only for text/* media types. For other media types, the
default encoding is undefined, which is where <abbr title="Request For Comments">RFC</abbr> 3023 comes in.</p>
<p>According to <abbr title="Request For Comments">RFC</abbr> 3023, if the media type given
in the Content-Type <abbr title="Hypertext Transfer Protocol">HTTP</abbr> header is
application/xml, application/xml-dtd, application/xml-external-parsed-entity,
or any one of the subtypes of application/xml such as application/atom+xml or
application/rss+xml or even application/rdf+xml, then the encoding is</p>
<ol class="arabic simple">
<li>the encoding given in the <tt class="docutils literal"><span class="pre">charset</span></tt> parameter of the Content-Type <abbr title="Hypertext Transfer Protocol">HTTP</abbr> header, or</li>
<li>the encoding given in the encoding attribute of the <abbr title="Extensible Markup Language">XML</abbr> declaration within the document, or</li>
<li>utf-8.</li>
</ol>
<p>On the other hand, if the media type given in the Content-Type
<abbr title="Hypertext Transfer Protocol">HTTP</abbr> header is text/xml,
text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then
the encoding attribute of the <abbr title="Extensible Markup Language">XML</abbr>
declaration within the document is ignored completely, and the encoding is</p>
<ol class="arabic simple">
<li>the encoding given in the charset parameter of the Content-Type <abbr title="Hypertext Transfer Protocol">HTTP</abbr> header, or</li>
<li>us-ascii.</li>
</ol>
</div>
<div class="section" id="handling-incorrectly-declared-encodings">
<h2>Handling Incorrectly-Declared Encodings<a class="headerlink" href="#handling-incorrectly-declared-encodings" title="Permalink to this headline">¶</a></h2>
<p><strong class="program">Universal Feed Parser</strong> initially uses the rules specified in
<abbr title="Request For Comments">RFC</abbr> 3023 to determine the character encoding of
the feed.  If parsing succeeds, then that&#8217;s that.  If parsing fails,
<strong class="program">Universal Feed Parser</strong> sets the <tt class="docutils literal"><span class="pre">bozo</span></tt> bit to <tt class="docutils literal"><span class="pre">1</span></tt> and sets
<tt class="docutils literal"><span class="pre">bozo_exception</span></tt> to <tt class="docutils literal"><span class="pre">feedparser.CharacterEncodingOverride</span></tt>.  Then it tries
to reparse the feed with the following character encodings:</p>
<ol class="arabic simple">
<li>the encoding specified in the <abbr title="Extensible Markup Language">XML</abbr> declaration</li>
<li>the encoding sniffed from the first four bytes of the document (as per <a class="reference external" href="http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info">Section F</a>)</li>
<li>the encoding auto-detected by the <a class="reference external" href="http://chardet.feedparser.org/">Universal Encoding Detector</a>, if installed</li>
<li>utf-8</li>
<li>windows-1252</li>
</ol>
<p>If the character encoding can not be determined, <strong class="program">Universal Feed Parser</strong>
sets the <tt class="docutils literal"><span class="pre">bozo</span></tt> bit to <tt class="docutils literal"><span class="pre">1</span></tt> and sets <tt class="docutils literal"><span class="pre">bozo_exception</span></tt> to
<tt class="docutils literal"><span class="pre">feedparser.CharacterEncodingUnknown</span></tt>.  In this case, parsed values will be
strings, not Unicode strings.</p>
</div>
<div class="section" id="handling-incorrectly-declared-media-types">
<h2>Handling Incorrectly-Declared Media Types<a class="headerlink" href="#handling-incorrectly-declared-media-types" title="Permalink to this headline">¶</a></h2>
<p><abbr title="Request For Comments">RFC</abbr> 3023 only applies when the feed is served
over <abbr title="Hypertext Transfer Protocol">HTTP</abbr> with a Content-Type that
declares the feed to be some kind of <abbr title="Extensible Markup Language">XML</abbr>.
However, some web servers are severely misconfigured and serve feeds with a
Content-Type of text/plain, application/octet-stream, or some completely bogus
media type.</p>
<p><strong class="program">Universal Feed Parser</strong> will attempt to parse such feeds, but it will
set the <tt class="docutils literal"><span class="pre">bozo</span></tt> bit to <tt class="docutils literal"><span class="pre">1</span></tt> and set <tt class="docutils literal"><span class="pre">bozo_exception</span></tt> to
<tt class="docutils literal"><span class="pre">feedparser.NonXMLContentType</span></tt>.</p>
<div class="admonition-see-also admonition seealso">
<p class="first admonition-title">See also</p>
<ul class="last simple">
<li><a class="reference external" href="http://www.ietf.org/rfc/rfc3023.txt">RFC 3023</a></li>
<li><a class="reference external" href="http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info">Section F of the XML specification</a></li>
<li><a class="reference external" href="http://www.imc.org/atom-syntax/mail-archive/msg05575.html">On the well-formedness of XML documents served as text/plain</a></li>
<li><a class="reference external" href="http://cjkpython.i18n.org/">CJKCodecs and iconv_codec</a></li>
</ul>
</div>
</div>
</div>


          </div>
        </div>
      </div>
      <div class="sphinxsidebar">
        <div class="sphinxsidebarwrapper">
  <h3><a href="index.html">Table Of Contents</a></h3>
  <ul>
<li><a class="reference internal" href="#">Character Encoding Detection</a><ul>
<li><a class="reference internal" href="#introduction-to-character-encoding">Introduction to Character Encoding</a></li>
<li><a class="reference internal" href="#handling-incorrectly-declared-encodings">Handling Incorrectly-Declared Encodings</a></li>
<li><a class="reference internal" href="#handling-incorrectly-declared-media-types">Handling Incorrectly-Declared Media Types</a></li>
</ul>
</li>
</ul>

  <h4>Previous topic</h4>
  <p class="topless"><a href="version-detection.html"
                        title="previous chapter">Feed Type and Version Detection</a></p>
  <h4>Next topic</h4>
  <p class="topless"><a href="bozo.html"
                        title="next chapter">Bozo Detection</a></p>
  <h3>This Page</h3>
  <ul class="this-page-menu">
    <li><a href="_sources/character-encoding.txt"
           rel="nofollow">Show Source</a></li>
  </ul>
<div id="searchbox" style="display: none">
  <h3>Quick search</h3>
    <form class="search" action="search.html" method="get">
      <input type="text" name="q" />
      <input type="submit" value="Go" />
      <input type="hidden" name="check_keywords" value="yes" />
      <input type="hidden" name="area" value="default" />
    </form>
    <p class="searchtip" style="font-size: 90%">
    Enter search terms or a module, class or function name.
    </p>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
        </div>
      </div>
      <div class="clearer"></div>
    </div>
    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             >index</a></li>
        <li class="right" >
          <a href="bozo.html" title="Bozo Detection"
             >next</a> |</li>
        <li class="right" >
          <a href="version-detection.html" title="Feed Type and Version Detection"
             >previous</a> |</li>
        <li><a href="index.html">feedparser 5.1.3 documentation</a> &raquo;</li>
          <li><a href="advanced.html" >Advanced Features</a> &raquo;</li> 
      </ul>
    </div>
    <div class="footer">
        &copy; Copyright 2004-2008 Mark Pilgrim, 2010-2012 Kurt McKee.
      Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 1.1.3.
    </div>
  </body>
</html>