<html> <head> <title>Character Sets</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <link rel="stylesheet" href="theme/style.css" type="text/css"> </head> <body> <table width="100%" border="0" background="theme/bkd2.gif" cellspacing="2"> <tr> <td width="10"> </td> <td width="85%"> <font size="6" face="Verdana, Arial, Helvetica, sans-serif"><b>Character Sets</b></font> </td> <td width="112"><a href="http://spirit.sf.net"><img src="theme/spirit.gif" width="112" height="48" align="right" border="0"></a></td> </tr> </table> <br> <table border="0"> <tr> <td width="10"></td> <td width="30"><a href="../index.html"><img src="theme/u_arr.gif" border="0"></a></td> <td width="30"><a href="loops.html"><img src="theme/l_arr.gif" border="0"></a></td> <td width="20"><a href="confix.html"><img src="theme/r_arr.gif" border="0"></a></td> </tr> </table> <p>The character set <tt>chset</tt> matches a set of characters over a finite range bounded by the limits of its template parameter <tt>CharT</tt>. This class is an optimization of a parser that acts on a set of single characters. The template class is parameterized by the character type <tt>CharT</tt> and can work efficiently with 8, 16 and 32 and even 64 bit characters.</p> <pre><span class=identifier> </span><span class=keyword>template </span><span class=special><</span><span class=keyword>typename </span><span class=identifier>CharT </span><span class=special>= </span><span class=keyword>char</span><span class=special>> </span><span class=keyword>class </span><span class=identifier>chset</span><span class=special>;</span></pre> <p>The <tt>chset</tt> is constructed from literals (e.g. <tt>'x'</tt>), <tt>ch_p</tt> or <tt>chlit<></tt>, <tt>range_p</tt> or <tt>range<></tt>, <tt>anychar_p</tt> and <tt>nothing_p</tt> (see <a href="primitives.html">primitives</a>) or copy-constructed from another <tt>chset</tt>. The <tt>chset</tt> class uses a copy-on-write scheme that enables instances to be passed along easily by value.</p> <table width="80%" border="0" align="center"> <tr> <td class="note_box"><img src="theme/lens.gif" width="15" height="16"> <b>Sparse bit vectors</b><br> <br> In order to accomodate 16/32 and 64 bit characters, the <tt>chset</tt> class statically switches from a <tt>std::bitset</tt> implementation when the character type is not greater than 8 bits, to a sparse bit/boolean set which uses a sorted vector of disjoint ranges (<tt>range_run</tt>). The set is constructed from ranges such that adjacent or overlapping ranges are coalesced.<br> <br> range_runs are very space-economical in situations where there are lots of ranges and a few individual disjoint values. Searching is O(log n) where n is the number of ranges.</td> </tr> </table> <p> Examples:<br> </p> <pre><span class=identifier> </span><span class=identifier>chset</span><span class=special><> </span><span class=identifier>s1</span><span class=special>(</span><span class=literal>'x'</span><span class=special>); </span><span class=identifier>chset</span><span class=special><> </span><span class=identifier>s2</span><span class=special>(</span><span class=identifier>anychar_p </span><span class=special>- </span><span class=identifier>s1</span><span class=special>);</span></pre> <p>Optionally, character sets may also be constructed using a definition string following a syntax that resembles posix style regular expression character sets, except that double quotes delimit the set elements instead of square brackets and there is no special negation <tt>^</tt> character.</p> <pre> <span class=identifier>range </span><span class=special>= </span><span class=identifier>anychar_p </span><span class=special>>> </span><span class=literal>'-' </span><span class=special>>> </span><span class=identifier>anychar_p</span><span class=special>; </span><span class=identifier>set </span><span class=special>= *(</span><span class=identifier>range_p </span><span class=special>| </span><span class=identifier>anychar_p</span><span class=special>);</span></pre> <p>Since we are defining the set using a C string, the usual C/C++ literal string syntax rules apply. Examples:<br> </p> <pre> <span class=identifier>chset</span><span class=special><> </span><span class=identifier>s1</span><span class=special>(</span><span class=string>"a-zA-Z"</span><span class=special>); </span><span class=comment>// alphabetic characters </span><span class=identifier>chset</span><span class=special><> </span><span class=identifier>s2</span><span class=special>(</span><span class=string>"0-9a-fA-F"</span><span class=special>); </span><span class=comment>// hexadecimal characters </span><span class=identifier>chset</span><span class=special><> </span><span class=identifier>s3</span><span class=special>(</span><span class=string>"actgACTG"</span><span class=special>); </span><span class=comment>// DNA identifiers </span><span class=identifier>chset</span><span class=special><> </span><span class=identifier>s4</span><span class=special>(</span><span class=string>"\x7f\x7e"</span><span class=special>); </span><span class=comment>// Hexadecimal 0x7F and 0x7E</span></pre> <p>The standard Spirit set operators apply (see <a href="operators.html">operators</a>) plus an additional character-set-specific inverse (negation <tt>~</tt>) operator:<span class=comment></span></p> <table width="90%" border="0" align="center"> <tr> <td class="table_title" colspan="2">Character set operators</td> </tr> <tr> <td class="table_cells" width="28%"><b>~a</b></td> <td class="table_cells" width="72%">Set inverse</td> </tr> <tr> <td class="table_cells" width="28%"><b>a | b</b></td> <td class="table_cells" width="72%">Set union</td> </tr> <tr> <td class="table_cells" width="28%"><b>a & </b></td> <td class="table_cells" width="72%">Set intersection</td> </tr> <tr> <td class="table_cells" width="28%"><b>a - b</b></td> <td class="table_cells" width="72%">Set difference</td> </tr> <tr> <td class="table_cells" width="28%"><b>a ^ b</b></td> <td class="table_cells" width="72%">Set xor</td> </tr> </table> <p></p> <p></p> <p></p> <p></p> <p></p> <p></p> <p></p> <p></p> <p>where operands a and b are both <tt>chsets</tt> or one of the operand is either a literal character, <tt>ch_p</tt> or <tt>chlit</tt>, <tt>range_p</tt> or <tt>range</tt>, <tt>anychar_p</tt> or <tt>nothing_p</tt>. Special optimized overloads are provided for <tt>anychar_p</tt> and <tt>nothing_p</tt> operands. A <tt>nothing_p</tt> operand is converted to an empty set, while an <tt>anychar_p</tt> operand is converted to a set having elements of the full range of the character type used (e.g. 0-255 for unsigned 8 bit chars).</p> <p>A special case is <tt>~anychar_p</tt> which yields <tt>nothing_p</tt>, but <tt>~nothing_p</tt> is illegal. Inversion of <tt>anychar_p</tt> is asymmetrical, a one-way trip comparable to converting <tt>T*</tt> to a <tt>void*.</tt></p> <table width="90%" border="0" align="center"> <tr> <td class="table_title" colspan="2">Special conversions</td> </tr> <tr> <td class="table_cells" width="28%"><b>chset<CharT>(nothing_p)</b></td> <td class="table_cells" width="72%">empty set</td> </tr> <tr> <td class="table_cells" width="28%"><b>chset<CharT>(anychar_p)</b></td> <td class="table_cells" width="72%">full range of CharT (e.g. 0-255 for unsigned 8 bit chars)</td> </tr> <tr> <td class="table_cells" width="28%"><b>~anychar_p</b></td> <td class="table_cells" width="72%">nothing_p</td> </tr> <tr> <td class="table_cells" width="28%"><b>~nothing_p</b></td> <td class="table_cells" width="72%">illegal</td> </tr> </table> <p></p><table border="0"> <tr> <td width="10"></td> <td width="30"><a href="../index.html"><img src="theme/u_arr.gif" border="0"></a></td> <td width="30"><a href="loops.html"><img src="theme/l_arr.gif" border="0"></a></td> <td width="20"><a href="confix.html"><img src="theme/r_arr.gif" border="0"></a></td> </tr> </table> <br> <hr size="1"> <p class="copyright">Copyright © 1998-2002 Joel de Guzman<br> <br> <font size="2">Permission to copy, use, modify, sell and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose. </font> </p> </body> </html>