<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content= "HTML Tidy for Linux/x86 (vers 1 September 2005), see www.w3.org" /> <meta http-equiv="Content-Type" content= "text/html; charset=us-ascii" /> <title>docbook2X: Character set conversion</title> <link rel="stylesheet" href="docbook2X.css" type="text/css" /> <link rev="made" href="mailto:stevecheng@users.sourceforge.net" /> <meta name="generator" content="DocBook XSL Stylesheets V1.68.1" /> <meta name="description" content= "Discussion on reproducing non-ASCII characters in the converted output" /> <link rel="start" href="docbook2X.html" title= "docbook2X: Documentation Table of Contents" /> <link rel="up" href="docbook2X.html" title= "docbook2X: Documentation Table of Contents" /> <link rel="prev" href="sgml2xml-isoent.html" title= "docbook2X: sgml2xml-isoent" /> <link rel="next" href="utf8trans.html" title= "docbook2X: utf8trans" /> </head> <body> <div class="navheader"> <table width="100%" summary="Navigation header"> <tr> <th colspan="3" align="center">Character set conversion</th> </tr> <tr> <td width="20%" align="left"><a accesskey="p" href= "sgml2xml-isoent.html"><< Previous</a> </td> <th width="60%" align="center"> </th> <td width="20%" align="right"> <a accesskey="n" href= "utf8trans.html">Next >></a></td> </tr> </table> <hr /></div> <div class="sect1" lang="en" xml:lang="en"> <div class="titlepage"> <div> <div> <h2 class="title"><a id="charsets" name="charsets"></a>Character set conversion</h2> </div> </div> </div> <a id="id2537977" class="indexterm" name="id2537977"></a><a id= "id2537983" class="indexterm" name="id2537983"></a><a id= "id2537990" class="indexterm" name="id2537990"></a><a id= "id2537997" class="indexterm" name="id2537997"></a><a id= "id2538612" class="indexterm" name="id2538612"></a><a id= "id2538619" class="indexterm" name="id2538619"></a><a id= "id2538625" class="indexterm" name="id2538625"></a><a id= "id2538632" class="indexterm" name="id2538632"></a><a id= "id2538639" class="indexterm" name="id2538639"></a><a id= "id2538649" class="indexterm" name="id2538649"></a><a id= "id2538656" class="indexterm" name="id2538656"></a> <p>When translating XML to legacy ASCII-based formats with poor support for Unicode, such as man pages and Texinfo, there is always the problem that Unicode characters in the source document also have to be translated somehow.</p> <p>A straightforward character set conversion from Unicode does not suffice, because the target character set, usually US-ASCII or ISO Latin-1, do not contain common characters such as dashes and directional quotation marks that are widely used in XML documents. But document formatters (man and Texinfo) allow such characters to be entered by a markup escape: for example, <span class= "markup">\(lq</span> for the left directional quote <code class= "literal">“</code>. And if a markup-level escape is not available, an ASCII transliteration might be used: for example, using the ASCII less-than sign <span class="markup"><</span> for the angle quotation mark <span class="markup">⟨</span>.</p> <p>So the Unicode character problem can be solved in two steps:</p> <div class="orderedlist"> <ol type="1"> <li> <p><a href="utf8trans.html"><span><strong class= "command">utf8trans</strong></span></a>, a program included in docbook2X, maps Unicode characters to markup-level escapes or transliterations.</p> <p>Since there is not necessarily a fixed, official mapping of Unicode characters, <span><strong class= "command">utf8trans</strong></span> can read in user-modifiable character mappings expressed in text files and apply them. (Unlike most character set converters.)</p> <p>In <code class="filename">charmaps/man/roff.charmap</code> and <code class="filename">charmaps/man/texi.charmap</code> are character maps that may be used for man-page and Texinfo conversion. The programs <a href= "db2x_manxml.html"><span><strong class= "command">db2x_manxml</strong></span></a> and <a href= "db2x_texixml.html"><span><strong class= "command">db2x_texixml</strong></span></a> will apply these character maps, or another character map specified by the user, automatically.</p> </li> <li> <p>The rest of the Unicode text is converted to some other character set (encoding). For example, a French document with accented characters (such as <code class="literal">é</code>) might be converted to ISO Latin 1.</p> <p>This step is applied after <span><strong class= "command">utf8trans</strong></span> character mapping, using the <span class="citerefentry"><span class= "refentrytitle"><span><strong class= "command">iconv</strong></span></span></span> encoding conversion tool. Both <a href="db2x_manxml.html"><span><strong class= "command">db2x_manxml</strong></span></a> and <a href= "db2x_texixml.html"><span><strong class= "command">db2x_texixml</strong></span></a> can call <span class= "citerefentry"><span class="refentrytitle"><span><strong class= "command">iconv</strong></span></span></span> automatically when producing their output.</p> </li> </ol> </div> </div> <div class="navfooter"> <hr /> <table width="100%" summary="Navigation footer"> <tr> <td width="40%" align="left"><a accesskey="p" href= "sgml2xml-isoent.html"><< Previous</a> </td> <td width="20%" align="center"> </td> <td width="40%" align="right"> <a accesskey="n" href= "utf8trans.html">Next >></a></td> </tr> <tr> <td width="40%" align="left" valign="top"><span><strong class= "command">sgml2xml-isoent</strong></span> </td> <td width="20%" align="center"><a accesskey="h" href= "docbook2X.html">Table of Contents</a></td> <td width="40%" align="right" valign="top"> <span><strong class="command">utf8trans</strong></span></td> </tr> </table> </div> <p class="footer-homepage"><a href= "http://docbook2x.sourceforge.net/" title= "docbook2X: Home page">docbook2X home page</a></p> </body> </html>