Sophie

Sophie

distrib > Fedora > 13 > x86_64 > by-pkgid > cfa16497d7c3916ac261a69956d6a391 > files > 248

python-pygments-1.3.1-2.fc13.noarch.rpm

=====================
Unicode and Encodings
=====================

Since Pygments 0.6, all lexers use unicode strings internally. Because of that
you might encounter the occasional `UnicodeDecodeError` if you pass strings with the
wrong encoding.

Per default all lexers have their input encoding set to `latin1`.
If you pass a lexer a string object (not unicode), it tries to decode the data
using this encoding.
You can override the encoding using the `encoding` lexer option. If you have the
`chardet`_ library installed and set the encoding to ``chardet`` if will ananlyse
the text and use the encoding it thinks is the right one automatically:

.. sourcecode:: python

    from pygments.lexers import PythonLexer
    lexer = PythonLexer(encoding='chardet')

The best way is to pass Pygments unicode objects. In that case you can't get
unexpected output.

The formatters now send Unicode objects to the stream if you don't set the
output encoding. You can do so by passing the formatters an `encoding` option:

.. sourcecode:: python

    from pygments.formatters import HtmlFormatter
    f = HtmlFormatter(encoding='utf-8')

**You will have to set this option if you have non-ASCII characters in the
source and the output stream does not accept Unicode written to it!**
This is the case for all regular files and for terminals.

Note: The Terminal formatter tries to be smart: if its output stream has an
`encoding` attribute, and you haven't set the option, it will encode any
Unicode string with this encoding before writing it. This is the case for
`sys.stdout`, for example. The other formatters don't have that behavior.

Another note: If you call Pygments via the command line (`pygmentize`),
encoding is handled differently, see `the command line docs <cmdline.txt>`_.

*New in Pygments 0.7*: the formatters now also accept an `outencoding` option
which will override the `encoding` option if given. This makes it possible to
use a single options dict with lexers and formatters, and still have different
input and output encodings.

.. _chardet: http://chardet.feedparser.org/