  <div class="section" id="character-encoding-detection">
<span id="advanced-encoding"></span><h1>Character Encoding Detection<a class="headerlink" href="#character-encoding-detection" title="Permalink to this headline">¶</a></h1>
<div class="admonition tip">
<p class="first admonition-title">Tip</p>
<p class="last">Feeds may be published in any character encoding.  <strong class="program">Python</strong>
supports only a few character encodings by default.  To support the maximum
number of character encodings (and be able to parse the maximum number of
feeds), you should install <tt class="file docutils literal"><span class="pre">cjkcodecs</span></tt> and <tt class="file docutils literal"><span class="pre">iconv_codec</span></tt>.  Both are
available at <a class="reference external" href=""></a>.</p>
<p><a class="reference external" href="">RFC 3023</a> defines the interaction
between <abbr title="Extensible Markup Language">XML</abbr> and <abbr title="Hypertext Transfer Protocol">HTTP</abbr>
as it relates to character encoding.  <abbr title="Extensible Markup Language">XML</abbr>
and <abbr title="Hypertext Transfer Protocol">HTTP</abbr> have different ways of
specifying character encoding and different defaults in case no encoding is
specified, and determining which value takes precedence depends on a variety of
<div class="section" id="introduction-to-character-encoding">
<h2>Introduction to Character Encoding<a class="headerlink" href="#introduction-to-character-encoding" title="Permalink to this headline">¶</a></h2>
<p>In <abbr title="Extensible Markup Language">XML</abbr>, the character encoding is optional
and may be given in the <abbr title="Extensible Markup Language">XML</abbr> declaration in
the first line of the document, like this:</p>
<div class="highlight-xml"><div class="highlight"><pre><span class="cp">&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;</span>
<p>If no encoding is given, <abbr title="Extensible Markup Language">XML</abbr> supports the
use of a Byte Order Mark to identify the document as some flavor of UTF-32,
UTF-16, or UTF-8.  <a class="reference external" href="">Section F of the XML specification</a>
outlines the process for determining the character encoding based on unique
properties of the Byte Order Mark in the first two to four bytes of the
<p>If no encoding is specified and no Byte Order Mark is present, <abbr title="Extensible Markup Language">XML</abbr>
defaults to UTF-8.</p>
<p><abbr title="Hypertext Transfer Protocol">HTTP</abbr> uses <abbr>MIME</abbr> to define a method
of specifying the character encoding, as part of the Content-Type <abbr title="Hypertext Transfer Protocol">HTTP</abbr>
header, which looks like this:</p>
<div class="highlight-python"><pre>Content-Type: text/html; charset="utf-8"</pre>
<p>If no charset is specified, <abbr title="Hypertext Transfer Protocol">HTTP</abbr> defaults
to iso-8859-1, but only for text/* media types. For other media types, the
default encoding is undefined, which is where <abbr title="Request For Comments">RFC</abbr> 3023 comes in.</p>
<p>According to <abbr title="Request For Comments">RFC</abbr> 3023, if the media type given
in the Content-Type <abbr title="Hypertext Transfer Protocol">HTTP</abbr> header is
application/xml, application/xml-dtd, application/xml-external-parsed-entity,
or any one of the subtypes of application/xml such as application/atom+xml or
application/rss+xml or even application/rdf+xml, then the encoding is</p>
<ol class="arabic simple">
<li>the encoding given in the <tt class="docutils literal"><span class="pre">charset</span></tt> parameter of the Content-Type <abbr title="Hypertext Transfer Protocol">HTTP</abbr> header, or</li>
<li>the encoding given in the encoding attribute of the <abbr title="Extensible Markup Language">XML</abbr> declaration within the document, or</li>
<p>On the other hand, if the media type given in the Content-Type
<abbr title="Hypertext Transfer Protocol">HTTP</abbr> header is text/xml,
text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then
the encoding attribute of the <abbr title="Extensible Markup Language">XML</abbr>
declaration within the document is ignored completely, and the encoding is</p>
<ol class="arabic simple">
<li>the encoding given in the charset parameter of the Content-Type <abbr title="Hypertext Transfer Protocol">HTTP</abbr> header, or</li>
<div class="section" id="handling-incorrectly-declared-encodings">
<h2>Handling Incorrectly-Declared Encodings<a class="headerlink" href="#handling-incorrectly-declared-encodings" title="Permalink to this headline">¶</a></h2>
<p><strong class="program">Universal Feed Parser</strong> initially uses the rules specified in
<abbr title="Request For Comments">RFC</abbr> 3023 to determine the character encoding of
the feed.  If parsing succeeds, then that&#8217;s that.  If parsing fails,
<strong class="program">Universal Feed Parser</strong> sets the <tt class="docutils literal"><span class="pre">bozo</span></tt> bit to <tt class="docutils literal"><span class="pre">1</span></tt> and sets
<tt class="docutils literal"><span class="pre">bozo_exception</span></tt> to <tt class="docutils literal"><span class="pre">feedparser.CharacterEncodingOverride</span></tt>.  Then it tries
to reparse the feed with the following character encodings:</p>
<ol class="arabic simple">
<li>the encoding specified in the <abbr title="Extensible Markup Language">XML</abbr> declaration</li>
<li>the encoding sniffed from the first four bytes of the document (as per <a class="reference external" href="">Section F</a>)</li>
<li>the encoding auto-detected by the <a class="reference external" href="">Universal Encoding Detector</a>, if installed</li>
<p>If the character encoding can not be determined, <strong class="program">Universal Feed Parser</strong>
sets the <tt class="docutils literal"><span class="pre">bozo</span></tt> bit to <tt class="docutils literal"><span class="pre">1</span></tt> and sets <tt class="docutils literal"><span class="pre">bozo_exception</span></tt> to
<tt class="docutils literal"><span class="pre">feedparser.CharacterEncodingUnknown</span></tt>.  In this case, parsed values will be
strings, not Unicode strings.</p>
<div class="section" id="handling-incorrectly-declared-media-types">
<h2>Handling Incorrectly-Declared Media Types<a class="headerlink" href="#handling-incorrectly-declared-media-types" title="Permalink to this headline">¶</a></h2>
<p><abbr title="Request For Comments">RFC</abbr> 3023 only applies when the feed is served
over <abbr title="Hypertext Transfer Protocol">HTTP</abbr> with a Content-Type that
declares the feed to be some kind of <abbr title="Extensible Markup Language">XML</abbr>.
However, some web servers are severely misconfigured and serve feeds with a
Content-Type of text/plain, application/octet-stream, or some completely bogus
media type.</p>
<p><strong class="program">Universal Feed Parser</strong> will attempt to parse such feeds, but it will
set the <tt class="docutils literal"><span class="pre">bozo</span></tt> bit to <tt class="docutils literal"><span class="pre">1</span></tt> and set <tt class="docutils literal"><span class="pre">bozo_exception</span></tt> to
<tt class="docutils literal"><span class="pre">feedparser.NonXMLContentType</span></tt>.</p>
<div class="admonition-see-also admonition seealso">
<p class="first admonition-title">See also</p>
<ul class="last simple">
<li><a class="reference external" href="">RFC 3023</a></li>
<li><a class="reference external" href="">Section F of the XML specification</a></li>
<li><a class="reference external" href="">On the well-formedness of XML documents served as text/plain</a></li>
<li><a class="reference external" href="">CJKCodecs and iconv_codec</a></li>

