<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <HTML ><HEAD ><TITLE >Unicode Transformation Format: UTF-8</TITLE ><META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK REL="HOME" TITLE="FreeTDS User Guide" HREF="index.htm"><LINK REL="UP" TITLE="About Unicode, UCS-2, and UTF-8" HREF="aboutunicode.htm"><LINK REL="PREVIOUS" TITLE="Unicode's Pluses and Minuses" HREF="unicodegoodbad.htm"><LINK REL="NEXT" TITLE="Unicode and FreeTDS" HREF="unicodefreetds.htm"><LINK REL="STYLESHEET" TYPE="text/css" HREF="userguide.css"><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"></HEAD ><BODY CLASS="SECTION" BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#840084" ALINK="#0000FF" ><DIV CLASS="NAVHEADER" ><TABLE SUMMARY="Header navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TH COLSPAN="3" ALIGN="center" ><SPAN CLASS="PRODUCTNAME" >FreeTDS</SPAN > User Guide: A Guide to Installing, Configuring, and Running <SPAN CLASS="PRODUCTNAME" >FreeTDS</SPAN ></TH ></TR ><TR ><TD WIDTH="10%" ALIGN="left" VALIGN="bottom" ><A HREF="unicodegoodbad.htm" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="80%" ALIGN="center" VALIGN="bottom" >Appendix C. About Unicode, UCS-2, and UTF-8</TD ><TD WIDTH="10%" ALIGN="right" VALIGN="bottom" ><A HREF="unicodefreetds.htm" ACCESSKEY="N" >Next</A ></TD ></TR ></TABLE ><HR ALIGN="LEFT" WIDTH="100%"></DIV ><DIV CLASS="SECTION" ><H1 CLASS="SECTION" ><A NAME="UNICODEUTF" >Unicode Transformation Format: UTF-8</A ></H1 ><P >The presence of nulls embedded in character data and of byte order issues make straight Unicode i.e., UCS-2 or UCS-4 hard to work with in a heterogeneous environment. Too many opportunities arise for the data to be truncated or misinterpreted, and too many systems would fail even to transmit such data. In short, when 16-bit data are thrust into a multi-architecture 8-bit world, it frequently bodes ill for the data.</P ><P >To answer that problem, to make Unicode transmissible and unambiguous to most machines, several <SPAN CLASS="QUOTE" >"transformation formats"</SPAN > were adopted. Their goals were generally similar: to create a generally recognized format that would unambiguously and safely convey Unicode information between machines and across the Internet. To do that, they sought to remove nulls and endianism from the data stream. The most popular one — practically the only one used — is known as UTF-8.</P ><P >UTF-8 found wide acceptance for many reasons. UTF-8 represents any Unicode character as a combination of 1-4 bytes. The number of bytes required depends on the integer value of the Unicode character, and only one byte is used to represent the old <ACRONYM CLASS="ACRONYM" >ASCII</ACRONYM > range (0-127). UTF-8 does not use zero to represent any part of any character (except for the <ACRONYM CLASS="ACRONYM" >ASCII</ACRONYM > NUL). In consequence, UTF-8 is efficient with respect to space, has no endianism issues, and embeds no nulls. UTF-8 strings can be treated as plain old <ACRONYM CLASS="ACRONYM" >ASCII</ACRONYM > strings. These properties make UTF-8 data relatively easy for systems accustomed to processing <ACRONYM CLASS="ACRONYM" >ASCII</ACRONYM > data.</P ><P >Here's a small example showing the difference between UCS-2 and UTF-8. <DIV CLASS="EXAMPLE" ><A NAME="E.G.HELLO" ></A ><P ><B >Example C-1. <SPAN CLASS="QUOTE" >"HELLO"</SPAN > in UCS-2 and UTF-8</B ></P ><PRE CLASS="SCREEN" > $ <KBD CLASS="USERINPUT" >echo HELLO | iconv -f ascii -t UCS-2 | hexdump -C</KBD > 00000000 00 48 00 45 00 4c 00 4c 00 4f 00 0a |.H.E.L.L.O..| 0000000c $ <KBD CLASS="USERINPUT" >echo HELLO | iconv -f ascii -t utf-8 | hexdump -C</KBD > 00000000 48 45 4c 4c 4f 0a |HELLO.| 00000006 $ <KBD CLASS="USERINPUT" >echo HELLO | hexdump -C</KBD > 00000000 48 45 4c 4c 4f 0a |HELLO.| 00000006</PRE ></DIV > It is the similarity of the last two outputs that makes UTF-8 so attractive. It behaves like <ACRONYM CLASS="ACRONYM" >ASCII</ACRONYM > when <ACRONYM CLASS="ACRONYM" >ASCII</ACRONYM >'s all that's needed. But it lacks <ACRONYM CLASS="ACRONYM" >ASCII</ACRONYM >'s limitations.</P ><P >While UTF-8 solves many technical problems, it doesn't magically transform every <ACRONYM CLASS="ACRONYM" >ASCII</ACRONYM >-assuming system into a Unicode system. For example, to display Unicode data correctly — even Unicode data in UTF-8 format — the system still needs a suitable font. And it must distinguish the buffer size (and byte count) from the character count.</P ></DIV ><DIV CLASS="NAVFOOTER" ><HR ALIGN="LEFT" WIDTH="100%"><TABLE SUMMARY="Footer navigation table" WIDTH="100%" BORDER="0" CELLPADDING="0" CELLSPACING="0" ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" ><A HREF="unicodegoodbad.htm" ACCESSKEY="P" >Prev</A ></TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="index.htm" ACCESSKEY="H" >Home</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" ><A HREF="unicodefreetds.htm" ACCESSKEY="N" >Next</A ></TD ></TR ><TR ><TD WIDTH="33%" ALIGN="left" VALIGN="top" >Unicode's Pluses and Minuses</TD ><TD WIDTH="34%" ALIGN="center" VALIGN="top" ><A HREF="aboutunicode.htm" ACCESSKEY="U" >Up</A ></TD ><TD WIDTH="33%" ALIGN="right" VALIGN="top" >Unicode and FreeTDS</TD ></TR ></TABLE ></DIV ></BODY ></HTML >