<!-- 95% W3C COMPLIANT, 95% CSS FREE, RAW HTML --> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"> <title>BiglooA ``practical Scheme compiler''User manual for version 3.2bJune 2009</title> <style type="text/css"> <!-- pre { font-family: monospace } tt { font-family: monospace } code { font-family: monospace } p.flushright { text-align: right } p.flushleft { text-align: left } span.sc { font-variant: small-caps } span.sf { font-family: sans-serif } span.skribetitle { font-family: sans-serif; font-weight: bolder; font-size: x-large; } span.refscreen { } span.refprint { display: none; } --> </style> </head> <body class="chapter" bgcolor="#ffffff"> <table width="100%" class="skribetitle" cellspacing="0" cellpadding="0"><tbody> <tr><td align="center" bgcolor="#8381de"><div class="skribetitle"><strong><big><big><big>13. Bigloo<br/>A ``practical Scheme compiler''<br/>User manual for version 3.2b<br/>June 2009 -- Posix Regular Expressions</big></big></big></strong></div><center> </center> </td></tr></tbody></table> <table cellpadding="3" cellspacing="0" width="100%" class="skribe-margins"><tr> <td align="left" valign="top" class="skribe-left-margin" width="20%" bgcolor="#dedeff"><div class="skribe-left-margin"> <br/><center id='center28159' ><table width="97%" border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse;" frame="box" rules="none"><tbody> <tr bgcolor="#8381de"><th id="tc28149" align="center" colspan="1"><font color="#ffffff"><strong id='bold28147' >main page</strong></font></th></tr> <tr bgcolor="#ffffff"><td id="tc28156" align="center" colspan="1"><table width="100%" border="0" style="border-collapse: collapse;" frame="void" rules="none"><tbody> <tr><td id="tc28152" align="left" valign="top" colspan="1"><strong id='bold28151' >top:</strong></td><td id="tc28153" align="right" valign="top" colspan="1"><a href="bigloo.html#Bigloo-A-``practical-Scheme-compiler''-User-manual-for-version-3.2b-June-2009" class="inbound">Bigloo<br/>A ``practical Scheme compiler''<br/>User manual for version 3.2b<br/>June 2009</a></td></tr> </tbody></table> </td></tr> </tbody></table> </center> <br/><br/><center id='center28169' ><table width="97%" border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse;" frame="box" rules="none"><tbody> <tr bgcolor="#8381de"><th id="tc28163" align="center" colspan="1"><font color="#ffffff"><strong id='bold28161' >Posix Regular Expressions</strong></font></th></tr> <tr bgcolor="#ffffff"><td id="tc28166" align="center" colspan="1"><table cellspacing="1" cellpadding="1" width="100%" class="toc"> <tbody> <tr><td valign="top" align="left">13.1</td><td colspan="4" width="100%"><a href="bigloo-14.html#Regular-Expressions-Procedures">Regular Expressions Procedures</a></td></tr> <tr><td valign="top" align="left">13.2</td><td colspan="4" width="100%"><a href="bigloo-14.html#Regular-Expressions-Pattern-Language">Regular Expressions Pattern Language</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.1</td><td colspan="3" width="100%"><a href="bigloo-14.html#Basic-assertions">Basic assertions</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.2</td><td colspan="3" width="100%"><a href="bigloo-14.html#Characters-and-character-classes">Characters and character classes</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.3</td><td colspan="3" width="100%"><a href="bigloo-14.html#Some-frequently-used-character-classes">Some frequently used character classes</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.4</td><td colspan="3" width="100%"><a href="bigloo-14.html#POSIX-character-classes">POSIX character classes</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.5</td><td colspan="3" width="100%"><a href="bigloo-14.html#Quantifiers">Quantifiers</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.6</td><td colspan="3" width="100%"><a href="bigloo-14.html#Numeric-quantifiers">Numeric quantifiers</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.7</td><td colspan="3" width="100%"><a href="bigloo-14.html#Non-greedy-quantifiers">Non-greedy quantifiers</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.8</td><td colspan="3" width="100%"><a href="bigloo-14.html#Clusters">Clusters</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.9</td><td colspan="3" width="100%"><a href="bigloo-14.html#Backreferences">Backreferences</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.10</td><td colspan="3" width="100%"><a href="bigloo-14.html#Non-capturing-clusters">Non-capturing clusters</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.11</td><td colspan="3" width="100%"><a href="bigloo-14.html#Cloisters">Cloisters</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.12</td><td colspan="3" width="100%"><a href="bigloo-14.html#Alternation">Alternation</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.13</td><td colspan="3" width="100%"><a href="bigloo-14.html#Backtracking">Backtracking</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.14</td><td colspan="3" width="100%"><a href="bigloo-14.html#Disabling-backtracking">Disabling backtracking</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.15</td><td colspan="3" width="100%"><a href="bigloo-14.html#Looking-ahead-and-behind">Looking ahead and behind</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.16</td><td colspan="3" width="100%"><a href="bigloo-14.html#Lookahead">Lookahead</a></td></tr> <tr><td></td><td valign="top" align="left">13.2.17</td><td colspan="3" width="100%"><a href="bigloo-14.html#Lookbehind">Lookbehind</a></td></tr> <tr><td valign="top" align="left">13.3</td><td colspan="4" width="100%"><a href="bigloo-14.html#An-Extended-Example">An Extended Example</a></td></tr> </tbody> </table> </td></tr> </tbody></table> </center> <br/><br/><center id='center28179' ><table width="97%" border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse;" frame="box" rules="none"><tbody> <tr bgcolor="#8381de"><th id="tc28173" align="center" colspan="1"><font color="#ffffff"><strong id='bold28171' >Chapters</strong></font></th></tr> <tr bgcolor="#ffffff"><td id="tc28176" align="center" colspan="1"><table cellspacing="1" cellpadding="1" width="100%" class="toc"> <tbody> <tr><td valign="top" align="left"></td><td colspan="4" width="100%"><a href="bigloo-1.html#Acknowledgements">Acknowledgements</a></td></tr> <tr><td valign="top" align="left">1</td><td colspan="4" width="100%"><a href="bigloo-2.html#Table-of-contents">Table of contents</a></td></tr> <tr><td valign="top" align="left">2</td><td colspan="4" width="100%"><a href="bigloo-3.html#Overview-of-Bigloo">Overview of Bigloo</a></td></tr> <tr><td valign="top" align="left">3</td><td colspan="4" width="100%"><a href="bigloo-4.html#Modules">Modules</a></td></tr> <tr><td valign="top" align="left">4</td><td colspan="4" width="100%"><a href="bigloo-5.html#Core-Language">Core Language</a></td></tr> <tr><td valign="top" align="left">5</td><td colspan="4" width="100%"><a href="bigloo-6.html#DSSSL-support">DSSSL support</a></td></tr> <tr><td valign="top" align="left">6</td><td colspan="4" width="100%"><a href="bigloo-7.html#Standard-Library">Standard Library</a></td></tr> <tr><td valign="top" align="left">7</td><td colspan="4" width="100%"><a href="bigloo-8.html#Pattern-Matching">Pattern Matching</a></td></tr> <tr><td valign="top" align="left">8</td><td colspan="4" width="100%"><a href="bigloo-9.html#Fast-search">Fast search</a></td></tr> <tr><td valign="top" align="left">9</td><td colspan="4" width="100%"><a href="bigloo-10.html#Structures-and-Records">Structures and Records</a></td></tr> <tr><td valign="top" align="left">10</td><td colspan="4" width="100%"><a href="bigloo-11.html#Object-System">Object System</a></td></tr> <tr><td valign="top" align="left">11</td><td colspan="4" width="100%"><a href="bigloo-12.html#Regular-parsing">Regular parsing</a></td></tr> <tr><td valign="top" align="left">12</td><td colspan="4" width="100%"><a href="bigloo-13.html#Lalr(1)-parsing">Lalr(1) parsing</a></td></tr> <tr><td valign="top" align="left">13</td><td colspan="4" width="100%"><a href="bigloo-14.html#Posix-Regular-Expressions">Posix Regular Expressions</a></td></tr> <tr><td valign="top" align="left">14</td><td colspan="4" width="100%"><a href="bigloo-15.html#Command-Line-Parsing">Command Line Parsing</a></td></tr> <tr><td valign="top" align="left">15</td><td colspan="4" width="100%"><a href="bigloo-16.html#Cryptography">Cryptography</a></td></tr> <tr><td valign="top" align="left">16</td><td colspan="4" width="100%"><a href="bigloo-17.html#Errors-Assertions-and-Traces">Errors, Assertions, and Traces</a></td></tr> <tr><td valign="top" align="left">17</td><td colspan="4" width="100%"><a href="bigloo-18.html#Threads">Threads</a></td></tr> <tr><td valign="top" align="left">18</td><td colspan="4" width="100%"><a href="bigloo-19.html#Database-library">Database library</a></td></tr> <tr><td valign="top" align="left">19</td><td colspan="4" width="100%"><a href="bigloo-20.html#Multimedia-library">Multimedia library</a></td></tr> <tr><td valign="top" align="left">20</td><td colspan="4" width="100%"><a href="bigloo-21.html#Mail-library">Mail library</a></td></tr> <tr><td valign="top" align="left">21</td><td colspan="4" width="100%"><a href="bigloo-22.html#Eval-and-code-interpretation">Eval and code interpretation</a></td></tr> <tr><td valign="top" align="left">22</td><td colspan="4" width="100%"><a href="bigloo-23.html#Macro-expansion">Macro expansion</a></td></tr> <tr><td valign="top" align="left">23</td><td colspan="4" width="100%"><a href="bigloo-24.html#Parameters">Parameters</a></td></tr> <tr><td valign="top" align="left">24</td><td colspan="4" width="100%"><a href="bigloo-25.html#Explicit-typing">Explicit typing</a></td></tr> <tr><td valign="top" align="left">25</td><td colspan="4" width="100%"><a href="bigloo-26.html#The-C-interface">The C interface</a></td></tr> <tr><td valign="top" align="left">26</td><td colspan="4" width="100%"><a href="bigloo-27.html#The-Java-interface">The Java interface</a></td></tr> <tr><td valign="top" align="left">27</td><td colspan="4" width="100%"><a href="bigloo-28.html#Bigloo-Libraries">Bigloo Libraries</a></td></tr> <tr><td valign="top" align="left">28</td><td colspan="4" width="100%"><a href="bigloo-29.html#Extending-the-Runtime-System">Extending the Runtime System</a></td></tr> <tr><td valign="top" align="left">29</td><td colspan="4" width="100%"><a href="bigloo-30.html#SRFIs">SRFIs</a></td></tr> <tr><td valign="top" align="left">30</td><td colspan="4" width="100%"><a href="bigloo-31.html#Compiler-description">Compiler description</a></td></tr> <tr><td valign="top" align="left">31</td><td colspan="4" width="100%"><a href="bigloo-32.html#User-Extensions">User Extensions</a></td></tr> <tr><td valign="top" align="left">32</td><td colspan="4" width="100%"><a href="bigloo-33.html#Bigloo-Development-Environment">Bigloo Development Environment</a></td></tr> <tr><td valign="top" align="left">33</td><td colspan="4" width="100%"><a href="bigloo-34.html#Global-Index">Global Index</a></td></tr> <tr><td valign="top" align="left">34</td><td colspan="4" width="100%"><a href="bigloo-35.html#Library-Index">Library Index</a></td></tr> <tr><td valign="top" align="left"></td><td colspan="4" width="100%"><a href="bigloo-36.html#Bibliography">Bibliography</a></td></tr> </tbody> </table> </td></tr> </tbody></table> </center> </div></td> <td align="left" valign="top" class="skribe-body"><div class="skribe-body"> <a name="Posix-Regular-Expressions" class="mark"></a><a name="g16497" class="mark"></a> This whole section has been written by <strong id='bold16499' >Dorai Sitaram</strong>. It consists in the documentation of the <code id='code16500' >pregexp</code> package that may be found at <a href="http://www.ccs.neu.edu/~dorai/pregexp/pregexp.html">http://www.ccs.neu.edu/~dorai/pregexp/pregexp.html</a>.<br/><br/><br/> The regexp notation supported is modeled on Perl's, and includes such powerful directives as numeric and nongreedy quantifiers, capturing and non-capturing clustering, POSIX character classes, selective case- and space-insensitivity, backreferences, alternation, backtrack pruning, positive and negative lookahead and lookbehind, in addition to the more basic directives familiar to all regexp users. A <em id='emph16503' >regexp</em> is a string that describes a pattern. A regexp matcher tries to <em id='emph16504' >match</em> this pattern against (a portion of) another string, which we will call the <em id='emph16505' >text string</em>. The text string is treated as raw text and not as a pattern.<br/><br/>Most of the characters in a regexp pattern are meant to match occurrences of themselves in the text string. Thus, the pattern <code id='code16507' >"abc"</code> matches a string that contains the characters <code id='code16508' >a</code>, <code id='code16509' >b</code>, <code id='code16510' >c</code> in succession.<br/><br/>In the regexp pattern, some characters act as <em id='emph16512' >metacharacters</em>, and some character sequences act as <em id='emph16513' >metasequences</em>. That is, they specify something other than their literal selves. For example, in the pattern <code id='code16514' >"a.c"</code>, the characters <code id='code16515' >a</code> and <code id='code16516' >c</code> do stand for themselves but the <em id='emph16517' >metacharacter</em> <code id='code16518' >.</code> can match <em id='emph16519' >any</em> character (other than newline). Therefore, the pattern <code id='code16520' >"a.c"</code> matches an <code id='code16521' >a</code>, followed by <em id='emph16522' >any</em> character, followed by a <code id='code16523' >c</code>. <br/><br/>If we needed to match the character <code id='code16525' >.</code> itself, we <em id='emph16526' >escape</em> it, ie, precede it with a backslash (<code id='code16527' >\</code>). The character sequence <code id='code16528' >\.</code> is thus a <em id='emph16529' >metasequence</em>, since it doesn't match itself but rather just <code id='code16530' >.</code>. So, to match <code id='code16531' >a</code> followed by a literal <code id='code16532' >.</code> followed by <code id='code16533' >c</code>, we use the regexp pattern <code id='code16534' >"a\\.c"</code>.<a href="#footnote-footnote16536"><sup><small>1</small></sup></a> Another example of a metasequence is <code id='code16537' >\t</code>, which is a readable way to represent the tab character.<br/><br/>We will call the string representation of a regexp the <em id='emph16539' >U-regexp</em>, where <em id='emph16540' >U</em> can be taken to mean <em id='emph16541' >Unix-style</em> or <em id='emph16542' >universal</em>, because this notation for regexps is universally familiar. Our implementation uses an intermediate tree-like representation called the <em id='emph16543' >S-regexp</em>, where <em id='emph16544' >S</em> can stand for <em id='emph16545' >Scheme</em>, <em id='emph16546' >symbolic</em>, or <em id='emph16547' >s-expression</em>. S-regexps are more verbose and less readable than U-regexps, but they are much easier for Scheme's recursive procedures to navigate. <br/><br/> <!-- Regular Expressions Procedures --> <a name="Regular-Expressions-Procedures"></a> <div class="section-atitle"><table width="100%"><tr><td bgcolor="#dedeff"><h3><font color="black">13.1 Regular Expressions Procedures</font> </h3></td></tr></table> </div><div class="section"> <a name="Regular-Expressions-Procedures" class="mark"></a> Four procedures <code id='code16549' >pregexp</code>, <code id='code16550' >pregexp-match-positions</code>, <code id='code16551' >pregexp-match</code>, <code id='code16552' >pregexp-replace</code>, and <code id='code16553' >pregexp-replace*</code> enable compilation and matching of regular expressions.<br/><br/><table cellspacing="0" class="frame" cellpadding="10" border="1" width="100%"><tbody> <tr><td><a name="g16556" class="mark"></a><a name="pregexp" class="mark"></a><table width="100%" style="border-collapse: collapse;" frame="void" rules="none"><tbody> <tr><td id="tc16560" align="left" colspan="1"><strong id='bold16558' >pregexp</strong><em id='it16559' > U-regexp</em></td><td id="tc16561" align="right" colspan="1">bigloo procedure</td></tr> </tbody></table> The procedure <code id='code16564' >pregexp</code> takes a U-regexp, which is a string, and returns an S-regexp, which is a tree. <br/><br/><center id='center16573' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16571' >(pregexp <font color="red">"c.r"</font>) => (<strong id='bold28181' >:sub</strong> (<strong id='bold28183' >:or</strong> (<strong id='bold28185' >:seq</strong> #\c <strong id='bold28187' >:any</strong> #\r))) </pre> </td></tr> </tbody></table></center> There is rarely any need to look at the S-regexps returned by <code id='code16574' >pregexp</code>. </td></tr> </tbody></table><br/><br/><br/><table cellspacing="0" class="frame" cellpadding="10" border="1" width="100%"><tbody> <tr><td><a name="g16579" class="mark"></a><a name="pregexp-match-positions" class="mark"></a><table width="100%" style="border-collapse: collapse;" frame="void" rules="none"><tbody> <tr><td id="tc16583" align="left" colspan="1"><strong id='bold16581' >pregexp-match-positions</strong><em id='it16582' > regexp string</em></td><td id="tc16584" align="right" colspan="1">bigloo procedure</td></tr> </tbody></table> The procedure <code id='code16587' >pregexp-match-positions</code> takes a regexp pattern and a text string, and returns a <em id='emph16588' >match</em> if the pattern <em id='emph16589' >matches</em> the text string. The pattern may be either a U- or an S-regexp. (<code id='code16590' >pregexp-match-positions</code> will internally compile a U-regexp to an S-regexp before proceeding with the matching. If you find yourself calling <code id='code16591' >pregexp-match-positions</code> repeatedly with the same U-regexp, it may be advisable to explicitly convert the latter into an S-regexp once beforehand, using <code id='code16592' >pregexp</code>, to save needless recompilation.)<br/><br/><code id='code16594' >pregexp-match-positions</code> returns <code id='code16595' >#f</code> if the pattern did not match the string; and a list of <em id='emph16596' >index pairs</em> if it did match. Eg,<br/><br/><center id='center16604' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16602' >(pregexp-match-positions <font color="red">"brain"</font> <font color="red">"bird"</font>) => #f (pregexp-match-positions <font color="red">"needle"</font> <font color="red">"hay needle stack"</font>) => ((4 . 10)) </pre> </td></tr> </tbody></table></center> In the second example, the integers 4 and 10 identifythe substring that was matched. 1 is the starting (inclusive) index and 2 the ending (exclusive) index of the matching substring.<br/><br/><center id='center16610' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16608' >(substring <font color="red">"hay needle stack"</font> 4 10) => <font color="red">"needle"</font> </pre> </td></tr> </tbody></table></center> Here, <code id='code16611' >pregexp-match-positions</code>'s return list contains only one index pair, and that pair represents the entire substring matched by the regexp. When we discuss <em id='emph16612' >subpatterns</em> later, we will see how a single match operation can yield a list of <em id='emph16613' >submatches</em>.<br/><br/><code id='code16615' >pregexp-match-positions</code> takes optional third and fourth arguments that specify the indices of the text string within which the matching should take place. <br/><br/><center id='center16621' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16619' >(pregexp-match-positions <font color="red">"needle"</font> <font color="red">"his hay needle stack -- my hay needle stack -- her hay needle stack"</font> 24 43) => ((31 . 37)) </pre> </td></tr> </tbody></table></center> Note that the returned indices are still reckoned relative to the full text string. </td></tr> </tbody></table><br/> <table cellspacing="0" class="frame" cellpadding="10" border="1" width="100%"><tbody> <tr><td><a name="g16625" class="mark"></a><a name="pregexp-match" class="mark"></a><table width="100%" style="border-collapse: collapse;" frame="void" rules="none"><tbody> <tr><td id="tc16629" align="left" colspan="1"><strong id='bold16627' >pregexp-match</strong><em id='it16628' > regexp string</em></td><td id="tc16630" align="right" colspan="1">bigloo procedure</td></tr> </tbody></table> The procedure <code id='code16633' >pregexp-match</code> is called like <code id='code16634' >pregexp-match-positions</code> but instead of returning index pairs it returns the matching substrings:<br/><br/><center id='center16643' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16641' >(pregexp-match <font color="red">"brain"</font> <font color="red">"bird"</font>) => #f (pregexp-match <font color="red">"needle"</font> <font color="red">"hay needle stack"</font>) => (<font color="red">"needle"</font>) </pre> </td></tr> </tbody></table></center> <code id='code16644' >pregexp-match</code> also takes optional third and fourth arguments, with the same meaning as does <code id='code16645' >pregexp-match-positions</code>. </td></tr> </tbody></table><br/> <table cellspacing="0" class="frame" cellpadding="10" border="1" width="100%"><tbody> <tr><td><a name="g16649" class="mark"></a><a name="pregexp-replace" class="mark"></a><table width="100%" style="border-collapse: collapse;" frame="void" rules="none"><tbody> <tr><td id="tc16653" align="left" colspan="1"><strong id='bold16651' >pregexp-replace</strong><em id='it16652' > regexp string1 string2</em></td><td id="tc16654" align="right" colspan="1">bigloo procedure</td></tr> </tbody></table> The procedure <code id='code16657' >pregexp-replace</code> replaces the matched portion of the text string by another string. The first argument is the regexp, the second the text string, and the third is the <em id='emph16658' >insert string</em> (string to be inserted).<br/><br/><center id='center16666' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16664' >(pregexp-replace <font color="red">"te"</font> <font color="red">"liberte"</font> <font color="red">"ty"</font>) => <font color="red">"liberty"</font> </pre> </td></tr> </tbody></table></center> If the pattern doesn't occur in the text string, the returned string is identical (<code id='code16667' >eq?</code>) to the text string. </td></tr> </tbody></table><br/> <table cellspacing="0" class="frame" cellpadding="10" border="1" width="100%"><tbody> <tr><td><a name="g16671" class="mark"></a><a name="pregexp-replace*" class="mark"></a><table width="100%" style="border-collapse: collapse;" frame="void" rules="none"><tbody> <tr><td id="tc16675" align="left" colspan="1"><strong id='bold16673' >pregexp-replace*</strong><em id='it16674' > regexp string1 string2</em></td><td id="tc16676" align="right" colspan="1">bigloo procedure</td></tr> </tbody></table> The procedure <code id='code16679' >pregexp-replace*</code> replaces <em id='emph16680' >all</em> matches in the text <code id='code16682' ><em id='it16681' >string1</em></code> by the insert <code id='code16684' ><em id='it16683' >string2</em></code>:<br/><br/><center id='center16692' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16690' >(pregexp-replace* <font color="red">"te"</font> <font color="red">"liberte egalite fraternite"</font> <font color="red">"ty"</font>) => <font color="red">"liberty egality fratyrnity"</font> </pre> </td></tr> </tbody></table></center> As with <code id='code16693' >pregexp-replace</code>, if the pattern doesn't occur in the text string, the returned string is identical (<code id='code16694' >eq?</code>) to the text string. </td></tr> </tbody></table><br/> <table cellspacing="0" class="frame" cellpadding="10" border="1" width="100%"><tbody> <tr><td><a name="g16698" class="mark"></a><a name="pregexp-split" class="mark"></a><table width="100%" style="border-collapse: collapse;" frame="void" rules="none"><tbody> <tr><td id="tc16702" align="left" colspan="1"><strong id='bold16700' >pregexp-split</strong><em id='it16701' > regexp string</em></td><td id="tc16703" align="right" colspan="1">bigloo procedure</td></tr> </tbody></table> The procedure <code id='code16706' >pregexp-split</code> takes two arguments, a regexp pattern and a text string, and returns a list of substrings of the text string, where the pattern identifies the delimiter separating the substrings.<br/><br/><center id='center16721' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16719' >(pregexp-split <font color="red">":"</font> <font color="red">"/bin:/usr/bin:/usr/bin/X11:/usr/local/bin"</font>) => (<font color="red">"/bin"</font> <font color="red">"/usr/bin"</font> <font color="red">"/usr/bin/X11"</font> <font color="red">"/usr/local/bin"</font>)<br/><br/>(pregexp-split <font color="red">" "</font> <font color="red">"pea soup"</font>) => (<font color="red">"pea"</font> <font color="red">"soup"</font>) </pre> </td></tr> </tbody></table></center> If the first argument can match an empty string, then the list of all the single-character substrings is returned.<br/><br/><center id='center16738' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16736' >(pregexp-split <font color="red">""</font> <font color="red">"smithereens"</font>) => (<font color="red">"s"</font> <font color="red">"m"</font> <font color="red">"i"</font> <font color="red">"t"</font> <font color="red">"h"</font> <font color="red">"e"</font> <font color="red">"r"</font> <font color="red">"e"</font> <font color="red">"e"</font> <font color="red">"n"</font> <font color="red">"s"</font>) </pre> </td></tr> </tbody></table></center> To identify one-or-more spaces as the delimiter, take care to use the regexp <code id='code16739' >" +"</code>, not <code id='code16740' >" *"</code>.<br/><br/><center id='center16764' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16762' >(pregexp-split <font color="red">" +"</font> <font color="red">"split pea soup"</font>) => (<font color="red">"split"</font> <font color="red">"pea"</font> <font color="red">"soup"</font>)<br/><br/>(pregexp-split <font color="red">" *"</font> <font color="red">"split pea soup"</font>) => (<font color="red">"s"</font> <font color="red">"p"</font> <font color="red">"l"</font> <font color="red">"i"</font> <font color="red">"t"</font> <font color="red">"p"</font> <font color="red">"e"</font> <font color="red">"a"</font> <font color="red">"s"</font> <font color="red">"o"</font> <font color="red">"u"</font> <font color="red">"p"</font>) </pre> </td></tr> </tbody></table></center> </td></tr> </tbody></table><br/> <table cellspacing="0" class="frame" cellpadding="10" border="1" width="100%"><tbody> <tr><td><a name="g16768" class="mark"></a><a name="pregexp-quote" class="mark"></a><table width="100%" style="border-collapse: collapse;" frame="void" rules="none"><tbody> <tr><td id="tc16772" align="left" colspan="1"><strong id='bold16770' >pregexp-quote</strong><em id='it16771' > string</em></td><td id="tc16773" align="right" colspan="1">bigloo procedure</td></tr> </tbody></table> The procedure <code id='code16776' >pregexp-quote</code> takes an arbitrary <code id='code16778' ><em id='it16777' >string</em></code> and returns a U-regexp (string) that precisely represents it. In particular, characters in the input string that could serve as regexp metacharacters are escaped with a backslash, so that they safely match only themselves.<br/><br/><center id='center16787' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16785' >(pregexp-quote <font color="red">"cons"</font>) => <font color="red">"cons"</font><br/><br/>(pregexp-quote <font color="red">"list?"</font>) => <font color="red">"list\\?"</font> </pre> </td></tr> </tbody></table></center> <code id='code16788' >pregexp-quote</code> is useful when building a composite regexp from a mix of regexp strings and verbatim strings. </td></tr> </tbody></table><br/> </div><br> <!-- Regular Expressions Pattern Language --> <a name="Regular-Expressions-Pattern-Language"></a> <div class="section-atitle"><table width="100%"><tr><td bgcolor="#dedeff"><h3><font color="black">13.2 Regular Expressions Pattern Language</font> </h3></td></tr></table> </div><div class="section"> <a name="The-Regular-Expressions-Pattern-Language" class="mark"></a> Here is a complete description of the regexp pattern language recognized by the <code id='code16791' >pregexp</code> procedures.<br/><br/><!-- Basic assertions --> <a name="Basic-assertions"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.1 Basic assertions</font> </h3></td></tr></table> </div><div class="subsection"> <a name="Basic-assertions" class="mark"></a> The <em id='emph16793' >assertions</em> <code id='code16794' >^</code> and <code id='code16795' >$</code> identify the beginning and the end of the text string respectively. They ensure that their adjoining regexps match at one or other end of the text string. Examples:<br/><br/><center id='center16801' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16799' >(pregexp-match-positions <font color="red">"^contact"</font> <font color="red">"first contact"</font>) => #f </pre> </td></tr> </tbody></table></center> The regexp fails to match because <code id='code16802' >contact</code> does notoccur at the beginning of the text string.<br/><br/><center id='center16808' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16806' >(pregexp-match-positions <font color="red">"laugh$"</font> <font color="red">"laugh laugh laugh laugh"</font>) => ((18 . 23)) </pre> </td></tr> </tbody></table></center> The regexp matches the <em id='emph16809' >last</em> <code id='code16810' >laugh</code>. The metasequence <code id='code16811' >\b</code> asserts that a <em id='emph16812' >word boundary</em> exists. <br/><br/><center id='center16818' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16816' >(pregexp-match-positions <font color="red">"yack\\b"</font> <font color="red">"yackety yack"</font>) => ((8 . 12)) </pre> </td></tr> </tbody></table></center> The <code id='code16819' >yack</code> in <code id='code16820' >yackety</code> doesn't end at a wordboundary so it isn't matched. The second <code id='code16821' >yack</code> does and is.<br/><br/>The metasequence <code id='code16823' >\B</code> has the opposite effect to <code id='code16824' >\b</code>. It asserts that a word boundary does not exist.<br/><br/><center id='center16830' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16828' >(pregexp-match-positions <font color="red">"an\\B"</font> <font color="red">"an analysis"</font>) => ((3 . 5)) </pre> </td></tr> </tbody></table></center> The <code id='code16831' >an</code> that doesn't end in a word boundaryis matched.<br/><br/></div> <!-- Characters and character classes --> <a name="Characters-and-character-classes"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.2 Characters and character classes</font> </h3></td></tr></table> </div><div class="subsection"> <a name="Characters-and-character-classes" class="mark"></a> Typically a character in the regexp matches the same character in the text string. Sometimes it is necessary or convenient to use a regexp metasequence to refer to a single character. Thus, metasequences <code id='code16833' >\n</code>, <code id='code16834' >\r</code>, <code id='code16835' >\t</code>, and <code id='code16836' >\.</code> match the newline, return, tab and period characters respectively.<br/><br/>The <em id='emph16838' >metacharacter</em> period (<code id='code16839' >.</code>) matches <em id='emph16840' >any</em> character other than newline.<br/><br/><center id='center16847' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16845' >(pregexp-match <font color="red">"p.t"</font> <font color="red">"pet"</font>) => (<font color="red">"pet"</font>) </pre> </td></tr> </tbody></table></center> It also matches <code id='code16848' >pat</code>, <code id='code16849' >pit</code>, <code id='code16850' >pot</code>, <code id='code16851' >put</code>,and <code id='code16852' >p8t</code> but not <code id='code16853' >peat</code> or <code id='code16854' >pfffft</code>.<br/><br/>A <em id='emph16856' >character class</em> matches any one character from a set of characters. A typical format for this is the <em id='emph16857' >bracketed character class</em> <code id='code16858' >[</code>...<code id='code16859' >]</code>, which matches any one character from the non-empty sequence of characters enclosed within the brackets.<a href="#footnote-footnote16860"><sup><small>2</small></sup></a> Thus <code id='code16861' >"p[aeiou]t"</code> matches <code id='code16862' >pat</code>, <code id='code16863' >pet</code>, <code id='code16864' >pit</code>, <code id='code16865' >pot</code>, <code id='code16866' >put</code> and nothing else.<br/><br/>Inside the brackets, a hyphen (<code id='code16868' >-</code>) between two characters specifies the ascii range between the characters. Eg, <code id='code16869' >"ta[b-dgn-p]"</code> matches <code id='code16870' >tab</code>, <code id='code16871' >tac</code>, <code id='code16872' >tad</code>, <em id='emph16873' >and</em> <code id='code16874' >tag</code>, <em id='emph16875' >and</em> <code id='code16876' >tan</code>, <code id='code16877' >tao</code>, <code id='code16878' >tap</code>.<br/><br/>An initial caret (<code id='code16880' >^</code>) after the left bracket inverts the set specified by the rest of the contents, ie, it specifies the set of characters <em id='emph16881' >other than</em> those identified in the brackets. Eg, <code id='code16882' >"do[^g]"</code> matches all three-character sequences starting with <code id='code16883' >do</code> except <code id='code16884' >dog</code>.<br/><br/>Note that the metacharacter <code id='code16886' >^</code> inside brackets means something quite different from what it means outside. Most other metacharacters (<code id='code16887' >.</code>, <code id='code16888' >*</code>, <code id='code16889' >+</code>, <code id='code16890' >?</code>, etc) cease to be metacharacters when inside brackets, although you may still escape them for peace of mind. <code id='code16891' >-</code> is a metacharacter only when it's inside brackets, and neither the first nor the last character.<br/><br/>Bracketed character classes cannot contain other bracketed character classes (although they contain certain other types of character classes --- see below). Thus a left bracket (<code id='code16893' >[</code>) inside a bracketed character class doesn't have to be a metacharacter; it can stand for itself. Eg, <code id='code16894' >"[a[b]"</code> matches <code id='code16895' >a</code>, <code id='code16896' >[</code>, and <code id='code16897' >b</code>.<br/><br/>Furthermore, since empty bracketed character classes are disallowed, a right bracket (<code id='code16899' >]</code>) immediately occurring after the opening left bracket also doesn't need to be a metacharacter. Eg, <code id='code16900' >"[]ab]"</code> matches <code id='code16901' >]</code>, <code id='code16902' >a</code>, and <code id='code16903' >b</code>.<br/><br/></div> <!-- Some frequently used character classes --> <a name="Some-frequently-used-character-classes"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.3 Some frequently used character classes</font> </h3></td></tr></table> </div><div class="subsection"> Some standard character classes can be conveniently represented as metasequences instead of as explicit bracketed expressions. <code id='code16905' >\d</code> matches a digit (<code id='code16906' >[0-9]</code>); <code id='code16907' >\s</code> matches a whitespace character; and <code id='code16908' >\w</code> matches a character that could be part of a ``word''.<a href="#footnote-footnote16910"><sup><small>3</small></sup></a><br/><br/>The upper-case versions of these metasequences stand for the inversions of the corresponding character classes. Thus <code id='code16912' >\D</code> matches a non-digit, <code id='code16913' >\S</code> a non-whitespace character, and <code id='code16914' >\W</code> a non-``word'' character.<br/><br/>Remember to include a double backslash when putting these metasequences in a Scheme string:<br/><br/><center id='center16922' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16920' >(pregexp-match <font color="red">"\\d\\d"</font> <font color="red">"0 dear, 1 have 2 read catch 22 before 9"</font>) => (<font color="red">"22"</font>) </pre> </td></tr> </tbody></table></center> These character classes can be used inside a bracketed expression. Eg, <code id='code16923' >"[a-z\\d]"</code> matches a lower-case letter or a digit.<br/><br/></div> <!-- POSIX character classes --> <a name="POSIX-character-classes"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.4 POSIX character classes</font> </h3></td></tr></table> </div><div class="subsection"> A <em id='emph16925' >POSIX character class</em> is a special metasequence of the form <code id='code16926' >[:</code>...<code id='code16927' >:]</code> that can be used only inside a bracketed expression. The POSIX classes supported are <br/><br/><center id='center16953' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ccccff"><pre class="prog" id='prog16951' ><code id='code16929' >[:alnum:]</code> letters and digits <code id='code16930' >[:alpha:]</code> letters <code id='code16931' >[:algor:]</code> the letters <code id='code16932' >c</code>, <code id='code16933' >h</code>, <code id='code16934' >a</code> and <code id='code16935' >d</code> <code id='code16936' >[:ascii:]</code> 7-bit ascii characters <code id='code16937' >[:blank:]</code> widthful whitespace, ie, space and tab <code id='code16938' >[:cntrl:]</code> ``control'' characters, viz, those with code <code id='code16939' ><</code> 32 <code id='code16940' >[:digit:]</code> digits, same as <code id='code16941' >\d</code> <code id='code16942' >[:graph:]</code> characters that use ink <code id='code16943' >[:lower:]</code> lower-case letters <code id='code16944' >[:print:]</code> ink-users plus widthful whitespace <code id='code16945' >[:space:]</code> whitespace, same as <code id='code16946' >\s</code> <code id='code16947' >[:upper:]</code> upper-case letters <code id='code16948' >[:word:]</code> letters, digits, and underscore, same as <code id='code16949' >\w</code> <code id='code16950' >[:xdigit:]</code> hex digits </pre> </td></tr> </tbody></table></center> For example, the regexp <code id='code16954' >"[[:alpha:]_]"</code>matches a letter or underscore. <br/><br/><center id='center16966' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16964' >(pregexp-match <font color="red">"[[:alpha:]_]"</font> <font color="red">"--x--"</font>) => (<font color="red">"x"</font>) (pregexp-match <font color="red">"[[:alpha:]_]"</font> <font color="red">"--_--"</font>) => (<font color="red">"_"</font>) (pregexp-match <font color="red">"[[:alpha:]_]"</font> <font color="red">"--:--"</font>) => #f </pre> </td></tr> </tbody></table></center> The POSIX class notation is valid <em id='emph16967' >only</em> inside a bracketed expression. For instance, <code id='code16968' >[:alpha:]</code>, when not inside a bracketed expression, will <em id='emph16969' >not</em> be read as the letter class. Rather it is (from previous principles) the character class containing the characters <code id='code16970' >:</code>, <code id='code16971' >a</code>, <code id='code16972' >l</code>, <code id='code16973' >p</code>, <code id='code16974' >h</code>.<br/><br/><center id='center16983' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog16981' >(pregexp-match <font color="red">"[[:alpha:]]"</font> <font color="red">"--a--"</font>) => (<font color="red">"a"</font>) (pregexp-match <font color="red">"[[:alpha:]]"</font> <font color="red">"--_--"</font>) => #f </pre> </td></tr> </tbody></table></center> By placing a caret (<code id='code16984' >^</code>) immediately after <code id='code16985' >[:</code>, you get the inversion of that POSIX character class. Thus, <code id='code16986' >[:^alpha]</code> is the class containing all characters except the letters.<br/><br/></div> <!-- Quantifiers --> <a name="Quantifiers"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.5 Quantifiers</font> </h3></td></tr></table> </div><div class="subsection"> <a name="Quantifiers" class="mark"></a> The <em id='emph16988' >quantifiers</em> <code id='code16989' >*</code>, <code id='code16990' >+</code>, and <code id='code16991' >?</code> match respectively: zero or more, one or more, and zero or one instances of the preceding subpattern.<br/><br/><center id='center17011' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17009' >(pregexp-match-positions <font color="red">"c[ad]*r"</font> <font color="red">"cadaddadddr"</font>) => ((0 . 11)) (pregexp-match-positions <font color="red">"c[ad]*r"</font> <font color="red">"cr"</font>) => ((0 . 2))<br/><br/>(pregexp-match-positions <font color="red">"c[ad]+r"</font> <font color="red">"cadaddadddr"</font>) => ((0 . 11)) (pregexp-match-positions <font color="red">"c[ad]+r"</font> <font color="red">"cr"</font>) => #f<br/><br/>(pregexp-match-positions <font color="red">"c[ad]?r"</font> <font color="red">"cadaddadddr"</font>) => #f (pregexp-match-positions <font color="red">"c[ad]?r"</font> <font color="red">"cr"</font>) => ((0 . 2)) (pregexp-match-positions <font color="red">"c[ad]?r"</font> <font color="red">"car"</font>) => ((0 . 3)) </pre> </td></tr> </tbody></table></center> </div> <!-- Numeric quantifiers --> <a name="Numeric-quantifiers"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.6 Numeric quantifiers</font> </h3></td></tr></table> </div><div class="subsection"> You can use braces to specify much finer-tuned quantification than is possible with <code id='code17012' >*</code>, <code id='code17013' >+</code>, <code id='code17014' >?</code>.<br/><br/>The quantifier <code id='code17016' >{m}</code> matches <em id='emph17017' >exactly</em> <code id='code17018' >m</code> instances of the preceding <em id='emph17019' >subpattern</em>. <code id='code17020' >m</code> must be a nonnegative integer.<br/><br/>The quantifier <code id='code17022' >{m,n}</code> matches at least <code id='code17023' >m</code> and at most <code id='code17024' >n</code> instances. <code id='code17025' >m</code> and <code id='code17026' >n</code> are nonnegative integers with <code id='code17027' >m <= n</code>. You may omit either or both numbers, in which case <code id='code17028' >m</code> defaults to 0 and <code id='code17029' >n</code> to infinity.<br/><br/>It is evident that <code id='code17031' >+</code> and <code id='code17032' >?</code> are abbreviations for <code id='code17033' >{1,}</code> and <code id='code17034' >{0,1}</code> respectively. <code id='code17035' >*</code> abbreviates <code id='code17036' >{,}</code>, which is the same as <code id='code17037' >{0,}</code>.<br/><br/><center id='center17047' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17045' >(pregexp-match "[aeiou]{3}<font color="red">" "</font>vacuous") => (<font color="red">"uou"</font>) (pregexp-match "[aeiou]{3}<font color="red">" "</font>evolve") => #f (pregexp-match "[aeiou]{2,3}<font color="red">" "</font>evolve") => #f (pregexp-match "[aeiou]{2,3}<font color="red">" "</font>zeugma") => (<font color="red">"eu"</font>) </pre> </td></tr> </tbody></table></center> </div> <!-- Non-greedy quantifiers --> <a name="Non-greedy-quantifiers"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.7 Non-greedy quantifiers</font> </h3></td></tr></table> </div><div class="subsection"> The quantifiers described above are <em id='emph17048' >greedy</em>, ie, they match the maximal number of instances that would still lead to an overall match for the full pattern.<br/><br/><center id='center17055' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17053' >(pregexp-match <font color="red">"<.*>"</font> <font color="red">"<tag1> <tag2> <tag3>"</font>) => (<font color="red">"<tag1> <tag2> <tag3>"</font>) </pre> </td></tr> </tbody></table></center> To make these quantifiers <em id='emph17056' >non-greedy</em>, append a <code id='code17057' >?</code> to them. Non-greedy quantifiers match the minimal number of instances needed to ensure an overall match.<br/><br/><center id='center17064' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17062' >(pregexp-match <font color="red">"<.*?>"</font> <font color="red">"<tag1> <tag2> <tag3>"</font>) => (<font color="red">"<tag1>"</font>) </pre> </td></tr> </tbody></table></center> The non-greedy quantifiers are respectively: <code id='code17065' >*?</code>, <code id='code17066' >+?</code>, <code id='code17067' >??</code>, <code id='code17068' >{m}?</code>, <code id='code17069' >{m,n}?</code>. Note the two uses of the metacharacter <code id='code17070' >?</code>.<br/><br/></div> <!-- Clusters --> <a name="Clusters"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.8 Clusters</font> </h3></td></tr></table> </div><div class="subsection"> <a name="Clusters" class="mark"></a> <em id='emph17072' >Clustering</em>, ie, enclosure within parens <code id='code17073' >(</code>...<code id='code17074' >)</code>, identifies the enclosed <em id='emph17075' >subpattern</em> as a single entity. It causes the matcher to <em id='emph17076' >capture</em> the <em id='emph17077' >submatch</em>, or the portion of the string matching the subpattern, in addition to the overall match.<br/><br/><center id='center17087' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17085' >(pregexp-match <font color="red">"([a-z]+) ([0-9]+), ([0-9]+)"</font> <font color="red">"jan 1, 1970"</font>) => (<font color="red">"jan 1, 1970"</font> <font color="red">"jan"</font> <font color="red">"1"</font> <font color="red">"1970"</font>) </pre> </td></tr> </tbody></table></center> Clustering also causes a following quantifier to treat the entire enclosed subpattern as an entity.<br/><br/><center id='center17095' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17093' >(pregexp-match <font color="red">"(poo )*"</font> <font color="red">"poo poo platter"</font>) => (<font color="red">"poo poo "</font> <font color="red">"poo "</font>) </pre> </td></tr> </tbody></table></center> The number of submatches returned is always equal to the number of subpatterns specified in the regexp, even if a particular subpattern happens to match more than one substring or no substring at all.<br/><br/><center id='center17103' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17101' >(pregexp-match <font color="red">"([a-z ]+;)*"</font> <font color="red">"lather; rinse; repeat;"</font>) => (<font color="red">"lather; rinse; repeat;"</font> <font color="red">" repeat;"</font>) </pre> </td></tr> </tbody></table></center> Here the <code id='code17104' >*</code>-quantified subpattern matches threetimes, but it is the last submatch that is returned.<br/><br/>It is also possible for a quantified subpattern to fail to match, even if the overall pattern matches. In such cases, the failing submatch is represented by <code id='code17106' >#f</code>.<br/><br/><center id='center17123' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17121' >(<font color="#6959cf"><strong id='bold28323' >define</strong></font> <font color="#6959cf"><strong id='bold28325' >date-re</strong></font> ;match `month year' or `month day, year'. ;subpattern matches day, if present (pregexp <font color="red">"([a-z]+) +([0-9]+,)? *([0-9]+)"</font>))<br/><br/>(pregexp-match date-re <font color="red">"jan 1, 1970"</font>) => (<font color="red">"jan 1, 1970"</font> <font color="red">"jan"</font> <font color="red">"1,"</font> <font color="red">"1970"</font>)<br/><br/>(pregexp-match date-re <font color="red">"jan 1970"</font>) => (<font color="red">"jan 1970"</font> <font color="red">"jan"</font> #f <font color="red">"1970"</font>) </pre> </td></tr> </tbody></table></center> </div> <!-- Backreferences --> <a name="Backreferences"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.9 Backreferences</font> </h3></td></tr></table> </div><div class="subsection"> Submatches can be used in the insert string argument of the procedures <code id='code17124' >pregexp-replace</code> and <code id='code17125' >pregexp-replace*</code>. The insert string can use <code id='code17126' >\n</code> as a <em id='emph17127' >backreference</em> to refer back to the <em id='emph17128' >n</em>th submatch, ie, the substring that matched the <em id='emph17129' >n</em>th subpattern. <code id='code17130' >\0</code> refers to the entire match, and it can also be specified as <code id='code17131' >\&</code>.<br/><br/><center id='center17150' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17148' >(pregexp-replace <font color="red">"_(.+?)_"</font> <font color="red">"the _nina_, the _pinta_, and the _santa maria_"</font> <font color="red">"*\\1*"</font>) => <font color="red">"the *nina*, the _pinta_, and the _santa maria_"</font><br/><br/>(pregexp-replace* <font color="red">"_(.+?)_"</font> <font color="red">"the _nina_, the _pinta_, and the _santa maria_"</font> <font color="red">"*\\1*"</font>) => <font color="red">"the *nina*, the *pinta*, and the *santa maria*"</font><br/><br/>;recall: \S stands for non-whitespace character<br/><br/>(pregexp-replace <font color="red">"(\\S+) (\\S+) (\\S+)"</font> <font color="red">"eat to live"</font> <font color="red">"\\3 \\2 \\1"</font>) => <font color="red">"live to eat"</font> </pre> </td></tr> </tbody></table></center> Use <code id='code17151' >\\</code> in the insert string to specify a literal backslash. Also, <code id='code17152' >\$</code> stands for an empty string, and is useful for separating a backreference <code id='code17153' >\n</code> from an immediately following number.<br/><br/>Backreferences can also be used within the regexp pattern to refer back to an already matched subpattern in the pattern. <code id='code17155' >\n</code> stands for an exact repeat of the <em id='emph17156' >n</em>th submatch.<a href="#footnote-footnote17158"><sup><small>4</small></sup></a> <br/><br/><center id='center17166' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17164' >(pregexp-match <font color="red">"([a-z]+) and \\1"</font> <font color="red">"billions and billions"</font>) => (<font color="red">"billions and billions"</font> <font color="red">"billions"</font>) </pre> </td></tr> </tbody></table></center> Note that the backreference is not simply a repeatof the previous subpattern. Rather it is a repeat of <em id='emph17167' >the particular substring already matched by the subpattern</em>. <br/><br/>In the above example, the backreference can only match <code id='code17169' >billions</code>. It will not match <code id='code17170' >millions</code>, even though the subpattern it harks back to --- <code id='code17171' >([a-z]+)</code> --- would have had no problem doing so: <br/><br/><center id='center17177' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17175' >(pregexp-match <font color="red">"([a-z]+) and \\1"</font> <font color="red">"billions and millions"</font>) => #f </pre> </td></tr> </tbody></table></center> The following corrects doubled words:<br/><br/><center id='center17185' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17183' >(pregexp-replace* <font color="red">"(\\S+) \\1"</font> <font color="red">"now is the the time for all good men to to come to the aid of of the party"</font> <font color="red">"\\1"</font>) => <font color="red">"now is the time for all good men to come to the aid of the party"</font> </pre> </td></tr> </tbody></table></center> The following marks all immediately repeating patterns in a number string:<br/><br/><center id='center17191' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17189' >(pregexp-replace* <font color="red">"(\\d+)\\1"</font> <font color="red">"123340983242432420980980234"</font> "{\\1,\\1}") => "12{3,3}40983{24,24}3242{098,098}0234" </pre> </td></tr> </tbody></table></center> <br/><br/></div> <!-- Non-capturing clusters --> <a name="Non-capturing-clusters"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.10 Non-capturing clusters</font> </h3></td></tr></table> </div><div class="subsection"> It is often required to specify a cluster (typically for quantification) but without triggering the capture of submatch information. Such clusters are called <em id='emph17193' >non-capturing</em>. In such cases, use <code id='code17194' >(?:</code> instead of <code id='code17195' >(</code> as the cluster opener. In the following example, the non-capturing cluster eliminates the ``directory'' portion of a given pathname, and the capturing cluster identifies the basename.<br/><br/><center id='center17203' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17201' >(pregexp-match <font color="red">"^(?:[a-z]*/)*([a-z]+)$"</font> <font color="red">"/usr/local/bin/mzscheme"</font>) => (<font color="red">"/usr/local/bin/mzscheme"</font> <font color="red">"mzscheme"</font>) </pre> </td></tr> </tbody></table></center> </div> <!-- Cloisters --> <a name="Cloisters"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.11 Cloisters</font> </h3></td></tr></table> </div><div class="subsection"> The location between the <code id='code17204' >?</code> and the <code id='code17205' >:</code> of a non-capturing cluster is called a <em id='emph17206' >cloister</em>.<a href="#footnote-footnote17207"><sup><small>5</small></sup></a> You can put <em id='emph17208' >modifiers</em> there that will cause the enclustered subpattern to be treated specially. The modifier <code id='code17209' >i</code> causes the subpattern to match <em id='emph17210' >case-insensitively</em>:<br/><br/><center id='center17217' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17215' >(pregexp-match <font color="red">"(?i:hearth)"</font> <font color="red">"HeartH"</font>) => (<font color="red">"HeartH"</font>) </pre> </td></tr> </tbody></table></center> The modifier <code id='code17218' >x</code> causes the subpattern to match <em id='emph17219' >space-insensitively</em>, ie, spaces and comments within the subpattern are ignored. Comments are introduced as usual with a semicolon (<code id='code17220' >;</code>) and extend till the end of the line. If you need to include a literal space or semicolon in a space-insensitized subpattern, escape it with a backslash.<br/><br/><center id='center17234' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17232' >(pregexp-match <font color="red">"(?x: a lot)"</font> <font color="red">"alot"</font>) => (<font color="red">"alot"</font>)<br/><br/>(pregexp-match <font color="red">"(?x: a \\ lot)"</font> <font color="red">"a lot"</font>) => (<font color="red">"a lot"</font>)<br/><br/>(pregexp-match "(?x: a \\ man \\; \\ ; ignore a \\ plan \\; \\ ; me a \\ canal ; completely )" <font color="red">"a man; a plan; a canal"</font>) => (<font color="red">"a man; a plan; a canal"</font>) </pre> </td></tr> </tbody></table></center> The global variable <code id='code17235' >*pregexp-comment-char*</code>contains the comment character (<code id='code17236' >#\;</code>). For Perl-like comments, <br/><br/><center id='center17241' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17239' >(<strong id='bold28376' >set!</strong> *pregexp-comment-char* #\#) </pre> </td></tr> </tbody></table></center> You can put more than one modifier in the cloister.<br/><br/><center id='center17247' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17245' >(pregexp-match "(?ix: a \\ man \\; \\ ; ignore a \\ plan \\; \\ ; me a \\ canal ; completely )" <font color="red">"A Man; a Plan; a Canal"</font>) => (<font color="red">"A Man; a Plan; a Canal"</font>) </pre> </td></tr> </tbody></table></center> A minus sign before a modifier inverts its meaning. Thus, you can use <code id='code17248' >-i</code> and <code id='code17249' >-x</code> in a <em id='emph17250' >subcluster</em> to overturn the insensitivities caused by an enclosing cluster.<br/><br/><center id='center17257' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17255' >(pregexp-match <font color="red">"(?i:the (?-i:TeX)book)"</font> <font color="red">"The TeXbook"</font>) => (<font color="red">"The TeXbook"</font>) </pre> </td></tr> </tbody></table></center> This regexp will allow any casing for <code id='code17258' >the</code>and <code id='code17259' >book</code> but insists that <code id='code17260' >TeX</code> not be differently cased.<br/><br/></div> <!-- Alternation --> <a name="Alternation"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.12 Alternation</font> </h3></td></tr></table> </div><div class="subsection"> <a name="Alternation" class="mark"></a> You can specify a list of <em id='emph17262' >alternate</em> subpatterns by separating them by <code id='code17263' >|</code>. The <code id='code17264' >|</code> separates subpatterns in the nearest enclosing cluster (or in the entire pattern string if there are no enclosing parens). <br/><br/><center id='center17275' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17273' >(pregexp-match <font color="red">"f(ee|i|o|um)"</font> <font color="red">"a small, final fee"</font>) => (<font color="red">"fi"</font> <font color="red">"i"</font>)<br/><br/>(pregexp-replace* <font color="red">"([yi])s(e[sdr]?|ing|ation)"</font> "it is energising to analyse an organisation pulsing with noisy organisms" <font color="red">"\\1z\\2"</font>) => "it is energizing to analyze an organization pulsing with noisy organisms" </pre> </td></tr> </tbody></table></center> Note again that if you wish to use clustering merely to specify a list of alternate subpatterns but do not want the submatch, use <code id='code17276' >(?:</code> instead of <code id='code17277' >(</code>. <br/><br/><center id='center17284' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17282' >(pregexp-match <font color="red">"f(?:ee|i|o|um)"</font> <font color="red">"fun for all"</font>) => (<font color="red">"fo"</font>) </pre> </td></tr> </tbody></table></center> An important thing to note about alternation is that the leftmost matching alternate is picked regardless of its length. Thus, if one of the alternates is a prefix of a later alternate, the latter may not have a chance to match.<br/><br/><center id='center17291' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17289' >(pregexp-match <font color="red">"call|call-with-current-continuation"</font> <font color="red">"call-with-current-continuation"</font>) => (<font color="red">"call"</font>) </pre> </td></tr> </tbody></table></center> To allow the longer alternate to have a shot at matching, place it before the shorter one:<br/><br/><center id='center17298' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17296' >(pregexp-match <font color="red">"call-with-current-continuation|call"</font> <font color="red">"call-with-current-continuation"</font>) => (<font color="red">"call-with-current-continuation"</font>) </pre> </td></tr> </tbody></table></center> In any case, an overall match for the entire regexp is always preferred to an overall nonmatch. In the following, the longer alternate still wins, because its preferred shorter prefix fails to yield an overall match.<br/><br/><center id='center17305' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17303' >(pregexp-match <font color="red">"(?:call|call-with-current-continuation) constrained"</font> <font color="red">"call-with-current-continuation constrained"</font>) => (<font color="red">"call-with-current-continuation constrained"</font>) </pre> </td></tr> </tbody></table></center> </div> <!-- Backtracking --> <a name="Backtracking"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.13 Backtracking</font> </h3></td></tr></table> </div><div class="subsection"> <a name="Backtracking" class="mark"></a> We've already seen that greedy quantifiers match the maximal number of times, but the overriding priority is that the overall match succeed. Consider<br/><br/><center id='center17311' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17309' >(pregexp-match <font color="red">"a*a"</font> <font color="red">"aaaa"</font>) </pre> </td></tr> </tbody></table></center> The regexp consists of two subregexps,<code id='code17312' >a*</code> followed by <code id='code17313' >a</code>. The subregexp <code id='code17314' >a*</code> cannot be allowed to match all four <code id='code17315' >a</code>'s in the text string <code id='code17316' >"aaaa"</code>, even though <code id='code17317' >*</code> is a greedy quantifier. It may match only the first three, leaving the last one for the second subregexp. This ensures that the full regexp matches successfully.<br/><br/>The regexp matcher accomplishes this via a process called <em id='emph17319' >backtracking</em>. The matcher tentatively allows the greedy quantifier to match all four <code id='code17320' >a</code>'s, but then when it becomes clear that the overall match is in jeopardy, it <em id='emph17321' >backtracks</em> to a less greedy match of <em id='emph17322' >three</em> <code id='code17323' >a</code>'s. If even this fails, as in the call<br/><br/><center id='center17329' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17327' >(pregexp-match <font color="red">"a*aa"</font> <font color="red">"aaaa"</font>) </pre> </td></tr> </tbody></table></center> the matcher backtracks even further. Overallfailure is conceded only when all possible backtracking has been tried with no success. <br/><br/>Backtracking is not restricted to greedy quantifiers. Nongreedy quantifiers match as few instances as possible, and progressively backtrack to more and more instances in order to attain an overall match. There is backtracking in alternation too, as the more rightward alternates are tried when locally successful leftward ones fail to yield an overall match.<br/><br/></div> <!-- Disabling backtracking --> <a name="Disabling-backtracking"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.14 Disabling backtracking</font> </h3></td></tr></table> </div><div class="subsection"> Sometimes it is efficient to disable backtracking. For example, we may wish to <em id='emph17332' >commit</em> to a choice, or we know that trying alternatives is fruitless. A nonbacktracking regexp is enclosed in <code id='code17333' >(?></code>...<code id='code17334' >)</code>.<br/><br/><center id='center17340' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17338' >(pregexp-match <font color="red">"(?>a+)."</font> <font color="red">"aaaa"</font>) => #f </pre> </td></tr> </tbody></table></center> In this call, the subregexp <code id='code17341' >?>a*</code> greedily matches all four <code id='code17342' >a</code>'s, and is denied the opportunity to backpedal. So the overall match is denied. The effect of the regexp is therefore to match one or more <code id='code17343' >a</code>'s followed by something that is definitely non-<code id='code17344' >a</code>.<br/><br/></div> <!-- Looking ahead and behind --> <a name="Looking-ahead-and-behind"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.15 Looking ahead and behind</font> </h3></td></tr></table> </div><div class="subsection"> <a name="Looking-ahead-and-behind" class="mark"></a> You can have assertions in your pattern that look <em id='emph17346' >ahead</em> or <em id='emph17347' >behind</em> to ensure that a subpattern does or does not occur. These ``look around'' assertions are specified by putting the subpattern checked for in a cluster whose leading characters are: <code id='code17348' >?=</code> (for positive lookahead), <code id='code17349' >?!</code> (negative lookahead), <code id='code17350' >?<=</code> (positive lookbehind), <code id='code17351' >?<!</code> (negative lookbehind). Note that the subpattern in the assertion does not generate a match in the final result. It merely allows or disallows the rest of the match.<br/><br/></div> <!-- Lookahead --> <a name="Lookahead"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.16 Lookahead</font> </h3></td></tr></table> </div><div class="subsection"> Positive lookahead (<code id='code17353' >?=</code>) peeks ahead to ensure that its subpattern <em id='emph17354' >could</em> match. <br/><br/><center id='center17360' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17358' >(pregexp-match-positions <font color="red">"grey(?=hound)"</font> <font color="red">"i left my grey socks at the greyhound"</font>) => ((28 . 32)) </pre> </td></tr> </tbody></table></center> The regexp <code id='code17361' >"grey(?=hound)"</code> matches <code id='code17362' >grey</code>, but<em id='emph17363' >only</em> if it is followed by <code id='code17364' >hound</code>. Thus, the first <code id='code17365' >grey</code> in the text string is not matched. <br/><br/>Negative lookahead (<code id='code17367' >?!</code>) peeks ahead to ensure that its subpattern could not possibly match. <br/><br/><center id='center17373' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17371' >(pregexp-match-positions <font color="red">"grey(?!hound)"</font> <font color="red">"the gray greyhound ate the grey socks"</font>) => ((27 . 31)) </pre> </td></tr> </tbody></table></center> The regexp <code id='code17374' >"grey(?!hound)"</code> matches <code id='code17375' >grey</code>, butonly if it is <em id='emph17376' >not</em> followed by <code id='code17377' >hound</code>. Thus the <code id='code17378' >grey</code> just before <code id='code17379' >socks</code> is matched.<br/><br/></div> <!-- Lookbehind --> <a name="Lookbehind"></a> <div class="subsection-atitle"><table width="100%"><tr><td bgcolor="#ffffff"><h3><font color="#8381de">13.2.17 Lookbehind</font> </h3></td></tr></table> </div><div class="subsection"> Positive lookbehind (<code id='code17381' >?<=</code>) checks that its subpattern <em id='emph17382' >could</em> match immediately to the left of the current position in the text string. <br/><br/><center id='center17388' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17386' >(pregexp-match-positions <font color="red">"(?<=grey)hound"</font> <font color="red">"the hound in the picture is not a greyhound"</font>) => ((38 . 43)) </pre> </td></tr> </tbody></table></center> The regexp <code id='code17389' >(?<=grey)hound</code> matches <code id='code17390' >hound</code>, but only if it is preceded by <code id='code17391' >grey</code>. <br/><br/>Negative lookbehind (<code id='code17393' >?<!</code>) checks that its subpattern could not possibly match immediately to the left. <br/><br/><center id='center17399' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17397' >(pregexp-match-positions <font color="red">"(?<!grey)hound"</font> <font color="red">"the greyhound in the picture is not a hound"</font>) => ((38 . 43)) </pre> </td></tr> </tbody></table></center> The regexp <code id='code17400' >(?<!grey)hound</code> matches <code id='code17401' >hound</code>, but only if it is <em id='emph17402' >not</em> preceded by <code id='code17403' >grey</code>.<br/><br/>Lookaheads and lookbehinds can be convenient when they are not confusing. <br/><br/></div> </div><br> <!-- An Extended Example --> <a name="An-Extended-Example"></a> <div class="section-atitle"><table width="100%"><tr><td bgcolor="#dedeff"><h3><font color="black">13.3 An Extended Example</font> </h3></td></tr></table> </div><div class="section"> <a name="An-Extended-Example" class="mark"></a> Here's an extended example from Friedl that covers many of the features described above. The problem is to fashion a regexp that will match any and only IP addresses or <em id='emph17406' >dotted quads</em>, ie, four numbers separated by three dots, with each number between 0 and 255. We will use the commenting mechanism to build the final regexp with clarity. First, a subregexp <code id='code17407' >n0-255</code> that matches 0 through 255.<br/><br/><center id='center17412' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17410' >(<font color="#6959cf"><strong id='bold28414' >define</strong></font> <font color="#6959cf"><strong id='bold28416' >n0-255</strong></font> "(?x: \\d ; 0 through 9 | \\d\\d ; 00 through 99 | [01]\\d\\d ;000 through 199 | 2[0-4]\\d ;200 through 249 | 25[0-5] ;250 through 255 )") </pre> </td></tr> </tbody></table></center> The first two alternates simply get all single- and double-digit numbers. Since 0-padding is allowed, we need to match both 1 and 01. We need to be careful when getting 3-digit numbers, since numbers above 255 must be excluded. So we fashion alternates to get 000 through 199, then 200 through 249, and finally 250 through 255.<a href="#footnote-footnote17414"><sup><small>6</small></sup></a><br/><br/>An IP-address is a string that consists of four <code id='code17416' >n0-255</code>s with three dots separating them.<br/><br/><center id='center17426' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17424' >(<font color="#6959cf"><strong id='bold28418' >define</strong></font> <font color="#6959cf"><strong id='bold28420' >ip-re1</strong></font> (string-append <font color="red">"^"</font> ;nothing before n0-255 ;the first n0-255, <font color="red">"(?x:"</font> ;then the subpattern of <font color="red">"\\."</font> ;a dot followed by n0-255 ;an n0-255, <font color="red">")"</font> ;which is "{3}" ;repeated exactly 3 times <font color="red">"$"</font> ;with nothing following )) </pre> </td></tr> </tbody></table></center> Let's try it out.<br/><br/><center id='center17433' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17431' >(pregexp-match ip-re1 <font color="red">"1.2.3.4"</font>) => (<font color="red">"1.2.3.4"</font>) (pregexp-match ip-re1 <font color="red">"55.155.255.265"</font>) => #f </pre> </td></tr> </tbody></table></center> which is fine, except that we also have<br/><br/><center id='center17439' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17437' >(pregexp-match ip-re1 <font color="red">"0.00.000.00"</font>) => (<font color="red">"0.00.000.00"</font>) </pre> </td></tr> </tbody></table></center> All-zero sequences are not valid IP addresses! Lookahead to the rescue. Before starting to match <code id='code17440' >ip-re1</code>, we look ahead to ensure we don't have all zeros. We could use positive lookahead to ensure there <em id='emph17441' >is</em> a digit other than zero.<br/><br/><center id='center17447' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17445' >(<font color="#6959cf"><strong id='bold28432' >define</strong></font> <font color="#6959cf"><strong id='bold28434' >ip-re</strong></font> (string-append <font color="red">"(?=.*[1-9])"</font> ;ensure there's a non-0 digit ip-re1)) </pre> </td></tr> </tbody></table></center> Or we could use negative lookahead to ensure that what's ahead isn't composed of <em id='emph17448' >only</em> zeros and dots.<br/><br/><center id='center17454' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17452' >(<font color="#6959cf"><strong id='bold28437' >define</strong></font> <font color="#6959cf"><strong id='bold28439' >ip-re</strong></font> (string-append <font color="red">"(?![0.]*$)"</font> ;not just zeros and dots ;(note: dot is not metachar inside []) ip-re1)) </pre> </td></tr> </tbody></table></center> The regexp <code id='code17455' >ip-re</code> will match all and only valid IP addresses.<br/><br/><center id='center17462' ><table cellspacing="0" class="color" cellpadding="0" width="95%"><tbody> <tr><td bgcolor="#ffffcc"><pre class="prog" id='prog17460' >(pregexp-match ip-re <font color="red">"1.2.3.4"</font>) => (<font color="red">"1.2.3.4"</font>) (pregexp-match ip-re <font color="red">"0.0.0.0"</font>) => #f </pre> </td></tr> </tbody></table></center> <br/><br/><br/><br/> </div><br> <div class="footnote"><br><br> <hr width='20%' size='2' align='left'> <a name="footnote-footnote16536"><sup><small>1</small></sup></a>: The double backslash is an artifact of Scheme strings, not the regexp pattern itself. When we want a literal backslash inside a Scheme string, we must escape it so that it shows up in the string at all. Scheme strings use backslash as the escape character, so we end up with two backslashes --- one Scheme-string backslash to escape the regexp backslash, which then escapes the dot. Another character that would need escaping inside a Scheme string is <code id='code16535' >"</code>. <br> <a name="footnote-footnote16860"><sup><small>2</small></sup></a>: Requiring a bracketed character class to be non-empty is not a limitation, since an empty character class can be more easily represented by an empty string. <br> <a name="footnote-footnote16910"><sup><small>3</small></sup></a>: Following regexp custom, we identify ``word'' characters as <code id='code16909' >[A-Za-z0-9_]</code>, although these are too restrictive for what a Schemer might consider a ``word''. <br> <a name="footnote-footnote17158"><sup><small>4</small></sup></a>: <code id='code17157' >\0</code>, which is useful in an insert string, makes no sense within the regexp pattern, because the entire regexp has not matched yet that you could refer back to it. <br> <a name="footnote-footnote17207"><sup><small>5</small></sup></a>: A useful, if terminally cute, coinage from the abbots of Perl. <br> <a name="footnote-footnote17414"><sup><small>6</small></sup></a>: Note that <code id='code17413' >n0-255</code> lists prefixes as preferred alternates, something we cautioned against in section <a href="bigloo-14.html#Alternation" class="inbound">Alternation</a>. However, since we intend to anchor this subregexp explicitly to force an overall match, the order of the alternates does not matter. <br> <div></div></td> </tr></table><div class="skribe-ending"> <hr> <p class="ending" id='paragraph28450' ><font size="-1"> This <span class="sc">Html</span> page has been produced by <a href="http://www.inria.fr/mimosa/fp/Skribe" class="http">Skribe</a>. <br/> Last update <em id='it28448' >Tue Jun 2 11:43:27 2009</em>.</font></p></div> </body> </html>