Sophie

Sophie

distrib > Mageia > 1 > i586 > media > core-release > by-pkgid > f0bc842dcf666302badcfd2545f3387c > files > 222

libfreetds0-doc-0.82-12.mga1.i586.rpm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML
><HEAD
><TITLE
>Unicode Transformation Format: UTF-8</TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK
REL="HOME"
TITLE="FreeTDS User Guide"
HREF="index.htm"><LINK
REL="UP"
TITLE="About Unicode, UCS-2, and UTF-8"
HREF="aboutunicode.htm"><LINK
REL="PREVIOUS"
TITLE="Unicode's Pluses and Minuses"
HREF="unicodegoodbad.htm"><LINK
REL="NEXT"
TITLE="Unicode and FreeTDS"
HREF="unicodefreetds.htm"><LINK
REL="STYLESHEET"
TYPE="text/css"
HREF="userguide.css"></HEAD
><BODY
CLASS="SECTION"
BGCOLOR="#FFFFFF"
TEXT="#000000"
LINK="#0000FF"
VLINK="#840084"
ALINK="#0000FF"
><DIV
CLASS="NAVHEADER"
><TABLE
SUMMARY="Header navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TH
COLSPAN="3"
ALIGN="center"
><SPAN
CLASS="PRODUCTNAME"
>FreeTDS</SPAN
> User Guide: A Guide to Installing, Configuring, and Running <SPAN
CLASS="PRODUCTNAME"
>FreeTDS</SPAN
></TH
></TR
><TR
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="bottom"
><A
HREF="unicodegoodbad.htm"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="80%"
ALIGN="center"
VALIGN="bottom"
>Appendix B. About Unicode, UCS-2, and UTF-8</TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="bottom"
><A
HREF="unicodefreetds.htm"
ACCESSKEY="N"
>Next</A
></TD
></TR
></TABLE
><HR
ALIGN="LEFT"
WIDTH="100%"></DIV
><DIV
CLASS="SECTION"
><H1
CLASS="SECTION"
><A
NAME="UNICODEUTF"
>Unicode Transformation Format: UTF-8</A
></H1
><P
>The presence of nulls embedded in character data and of byte order issues make straight Unicode i.e., UCS-2 or UCS-4 hard to work with in a heterogeneous environment.  Too many opportunities arise for the data to be truncated or misinterpreted, and too many systems would fail even to transmit such data.  In short, when 16-bit data are thrust into a multi-architecture 8-bit world, it frequently bodes ill for the data.  
			</P
><P
>To answer that problem, to make Unicode transmissible and unambiguous to most machines, several <SPAN
CLASS="QUOTE"
>"transformation formats"</SPAN
> were adopted.  Their goals were generally similar: to create a generally recognized format that would unambiguously and safely convey Unicode information between machines and across the Internet.  To do that, they sought to remove nulls and endianism from the data stream.  The most popular one &mdash; practically the only one used &mdash; is known as UTF-8.  
			</P
><P
>UTF-8 found wide acceptance for many reasons.  UTF-8 represents any Unicode character as a combination of 1-4 bytes.  The number of bytes required depends on the integer value of the Unicode character, and only one byte is used to represent the old <ACRONYM
CLASS="ACRONYM"
>ASCII</ACRONYM
> range (0-127).  UTF-8 does not use zero to represent any part of any character (except for the <ACRONYM
CLASS="ACRONYM"
>ASCII</ACRONYM
> NUL).  In consequence, UTF-8 is efficient with respect to space, has no endianism issues, and embeds no nulls.  UTF-8 strings can be treated as plain old <ACRONYM
CLASS="ACRONYM"
>ASCII</ACRONYM
> strings.  These properties make UTF-8 data relatively easy for systems accustomed to processing <ACRONYM
CLASS="ACRONYM"
>ASCII</ACRONYM
> data.  
			</P
><P
>Here's a small example showing the difference between UCS-2 and UTF-8.  
			<DIV
CLASS="EXAMPLE"
><A
NAME="E.G.HELLO"
></A
><P
><B
>Example B-1. <SPAN
CLASS="QUOTE"
>"HELLO"</SPAN
> in UCS-2 and UTF-8</B
></P
><PRE
CLASS="SCREEN"
>$ <KBD
CLASS="USERINPUT"
>echo HELLO | iconv -f ascii -t UCS-2 | hexdump -C</KBD
>
00000000  00 48 00 45 00 4c 00 4c  00 4f 00 0a              |.H.E.L.L.O..|
0000000c
$ <KBD
CLASS="USERINPUT"
>echo HELLO | iconv -f ascii -t utf-8 | hexdump -C</KBD
>
00000000  48 45 4c 4c 4f 0a                                 |HELLO.|
00000006
$ <KBD
CLASS="USERINPUT"
>echo HELLO | hexdump -C</KBD
>
00000000  48 45 4c 4c 4f 0a                                 |HELLO.|
00000006</PRE
></DIV
>
It is the similarity of the last two outputs that makes UTF-8 so attractive.  It behaves like <ACRONYM
CLASS="ACRONYM"
>ASCII</ACRONYM
> when <ACRONYM
CLASS="ACRONYM"
>ASCII</ACRONYM
>'s all that's needed.  But it lacks <ACRONYM
CLASS="ACRONYM"
>ASCII</ACRONYM
>'s limitations.  			</P
><P
>While UTF-8 solves many technical problems, it doesn't magically transform every <ACRONYM
CLASS="ACRONYM"
>ASCII</ACRONYM
>-assuming system into a Unicode system.  For example, to display Unicode data correctly &mdash; even Unicode data in UTF-8 format &mdash; the system still needs a suitable font.  And it must distinguish the buffer size (and byte count) from the character count.  
			</P
></DIV
><DIV
CLASS="NAVFOOTER"
><HR
ALIGN="LEFT"
WIDTH="100%"><TABLE
SUMMARY="Footer navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
><A
HREF="unicodegoodbad.htm"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="index.htm"
ACCESSKEY="H"
>Home</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><A
HREF="unicodefreetds.htm"
ACCESSKEY="N"
>Next</A
></TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
>Unicode's Pluses and Minuses</TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="aboutunicode.htm"
ACCESSKEY="U"
>Up</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
>Unicode and FreeTDS</TD
></TR
></TABLE
></DIV
></BODY
></HTML
>