<appendix id="regular-expressions"> <title >Regular Expressions</title> <synopsis > This Appendix contains a brief but hopefully sufficient and covering introduction to the world of <emphasis >regular expressions</emphasis >. It documents regular expressions in the form available within &kate;, which is not compatible with the regular expressions of perl, nor with those of for example <command >grep</command >.</synopsis> <sect1> <title >Introduction</title> <para ><emphasis >Regular Expressions</emphasis > provides us with a way to describe some possible contents of a text string in a way understood by a small piece of software, so that it can investigate if a text matches, and also in the case of advanced applications with the means of saving pieces or the matching text.</para> <para >An example: Say you want to search a text for paragraphs that starts with either of the names <quote >Henrik</quote > or <quote >Pernille</quote > followed by some form of the verb <quote >say</quote >.</para> <para >With a normal search, you would start out searching for the first name, <quote >Henrik</quote > maybe followed by <quote >sa</quote > like this: <userinput >Henrik sa</userinput >, and while looking for matches, you would have to discard those not being the beginning of a paragraph, as well as those in which the word starting with the letters <quote >sa</quote > was not either <quote >says</quote >, <quote >said</quote > or so. And then of cause repeat all of that with the next name...</para> <para >With Regular Expressions, that task could be accomplished with a single search, and with a larger degree of preciseness.</para> <para >To achieve this, Regular Expressions defines rules for expressing in details a generalisation of a string to match. Our example, which we might literally express like this: <quote >A line starting with either <quote >Henrik</quote > or <quote >Pernille</quote > (possibly following up to 4 blanks or tab characters) followed by a whitespace followed by <quote >sa</quote > and then either <quote >ys</quote > or <quote >id</quote ></quote > could be expressed with the following regular expression:</para > <para ><userinput >^[ \t]{0,4}(Henrik|Pernille) sa(ys|id)</userinput ></para> <para >The above example demonstrates all four major concepts of modern Regular Expressions, namely:</para> <itemizedlist > <listitem ><para >Patterns</para ></listitem > <listitem ><para >Assertions</para ></listitem > <listitem ><para >Quantifiers</para ></listitem > <listitem ><para >Back references</para ></listitem > </itemizedlist> <para >The caret (<literal >^</literal >) starting the expression is an assertion, being true only if the following matching string is at the start of a line.</para> <para >The stings <literal >[ \t]</literal > and <literal >(Henrik|Pernille) sa(ys|id)</literal > are patterns. The first one is a <emphasis >character class</emphasis > that matches either a blank or a (horizontal) tab character; the other pattern contains first a subpattern matching either <literal >Henrik</literal > <emphasis >or</emphasis > <literal >Pernille</literal >, then a piece matching the exact string <literal > sa</literal > and finally a subpattern matching either <literal >ys</literal > <emphasis >or</emphasis > <literal >id</literal ></para> <para >The string <literal >{0,4}</literal > is a quantifier saying <quote >anywhere from 0 up to 4 of the previous</quote >.</para> <para >Because regular expression software supporting the concept of <emphasis >back references</emphasis > saves the entire matching part of the string as well as sub-patterns enclosed in parentheses, given some means of access to those references, we could get our hands on either the whole match (when searching a text document in an editor with a regular expression, that is often marked as selected) or either the name found, or the last part of the verb.</para> <para >All together, the expression will match where we wanted it to, and only there.</para> <para >The following sections will describe in details how to construct and use patterns, character classes, assertions, quantifiers and back references, and the final section will give a few useful examples.</para> </sect1> <sect1 id="regex-patterns"> <title >Patterns</title> <para >Patterns consists of literal strings and character classes. Patterns may contain sub-patterns, which are patterns enclosed in parentheses.</para> <sect2> <title >Escaping characters</title> <para >In patterns as well as in character classes, some characters have a special meaning. To literally match any of those characters, they must be marked or <emphasis >escaped</emphasis > to let the regular expression software know that it should interpret such characters in their literal meaning.</para> <para >This is done by prepending the character with a backslash (<literal >\</literal >).</para> <para >The regular expression software will silently ignore escaping a character that does not have any special meaning in the context, so escaping for example a <quote >j</quote > (<userinput >\j</userinput >) is safe. If you are in doubt whether a character could have a special meaning, you can therefore escape it safely.</para> <para ></para> </sect2> <sect2> <title >Character Classes and abbreviations</title> <para >A <emphasis >character class</emphasis > is an expression that matches one of a defined set of characters. In Regular Expressions, character classes are defined by putting the legal characters for the class in square brackets, <literal >[]</literal >, or by using one of the abbreviated classes described below.</para> <para >Simple character classes just contains one or more literal characters, for example <userinput >[abc]</userinput > (matching either of the letters <quote >a</quote >, <quote >b</quote > or <quote >c</quote >) or <userinput >[0123456789]</userinput > (matching any digit).</para> <para >Because letters and digits have a logical order, you can abbreviate those by specifying ranges of them: <userinput >[a-c]</userinput > is equal to <userinput >[abc]</userinput > and <userinput >[0-9]</userinput > is equal to <userinput >[0123456789]</userinput >. Combining these constructs, for example <userinput >[a-fynot1-38]</userinput > is completely legal (the last one would match, of cause, either of <quote >a</quote >,<quote >b</quote >,<quote >c</quote >,<quote >d</quote >, <quote >e</quote >,<quote >f</quote >,<quote >y</quote >,<quote >n</quote >,<quote >o</quote >,<quote >t</quote >, <quote >1</quote >,<quote >2</quote >,<quote >3</quote > or <quote >8</quote >).</para> <para >As capital letters are different characters from their non-capital equivalents, to create a caseless character class matching <quote >a</quote > or <quote >b</quote >, in any case, you need to write it <userinput >[aAbB]</userinput >.</para> <para >It is of cause possible to create a <quote >negative</quote > class matching as <quote >anything but</quote > To do so put a caret (<literal >^</literal >) at the beginning of the class: </para> <para ><userinput >[^abc]</userinput > will match any character <emphasis >but</emphasis > <quote >a</quote >, <quote >b</quote > or <quote >c</quote >.</para> <para >In addition to literal characters, some abbreviations are defined, making life still a bit easier: <variablelist > <varlistentry > <term ><userinput >\a</userinput ></term > <listitem ><para > This matches the <acronym >ASCII</acronym > bell character (BEL, 0x07).</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\f</userinput ></term > <listitem ><para > This matches the <acronym >ASCII</acronym > form feed character (FF, 0x0C).</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\n</userinput ></term > <listitem ><para > This matches the <acronym >ASCII</acronym > line feed character (LF, 0x0A, Unix newline).</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\r</userinput ></term > <listitem ><para > This matches the <acronym >ASCII</acronym > carriage return character (CR, 0x0D).</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\t</userinput ></term > <listitem ><para > This matches the <acronym >ASCII</acronym > horizontal tab character (HT, 0x09).</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\v</userinput ></term > <listitem ><para > This matches the <acronym >ASCII</acronym > vertical tab character (VT, 0x0B).</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\xhhhh</userinput ></term > <listitem ><para > This matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF). \0ooo (&ie;, \zero ooo) matches the <acronym >ASCII</acronym >/Latin-1 character corresponding to the octal number ooo (between 0 and 0377).</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >.</userinput > (dot)</term > <listitem ><para > This matches any character (including newline).</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\d</userinput ></term > <listitem ><para > This matches a digit. Equal to <literal >[0-9]</literal ></para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\D</userinput ></term > <listitem ><para > This matches a non-digit. Equal to <literal >[^0-9]</literal > or <literal >[^\d]</literal ></para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\s</userinput ></term > <listitem ><para > This matches a whitespace character. Practically equal to <literal >[ \t\n\r]</literal ></para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\S</userinput ></term > <listitem ><para > This matches a non-whitespace. Practically equal to <literal >[^ \t\r\n]</literal >, and equal to <literal >[^\s]</literal ></para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\w</userinput ></term > <listitem ><para >Matches any <quote >word character</quote > - in this case any letter or digit. Note that underscore (<literal >_</literal >) is not matched, as is the case with perl regular expressions. Equal to <literal >[a-zA-Z0-9]</literal ></para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\W</userinput ></term > <listitem ><para >Matches any non-word character - anything but letters or numbers. Equal to <literal >[^a-zA-Z0-9]</literal > or <literal >[^\w]</literal ></para ></listitem > </varlistentry > </variablelist > </para> <para >The abbreviated classes can be put inside a custom class, for example to match a word character, a blank or a dot, you could write <userinput >[\w \.]</userinput ></para > <note > <para >The POSIX notation of classes, <userinput >[:<class name>:]</userinput > is currently not supported.</para > </note> <sect3> <title >Characters with special meanings inside character classes</title> <para >The following characters has a special meaning inside the <quote >[]</quote > character class construct, and must be escaped to be literally included in a class:</para> <variablelist > <varlistentry > <term ><userinput >]</userinput ></term > <listitem ><para >Ends the character class. Must be escaped unless it is the very first character in the class (may follow an unescaped caret)</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >^</userinput > (caret)</term > <listitem ><para >Denotes a negative class, if it is the first character. Must be escaped to match literally if it is the first character in the class.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >-</userinput > (dash)</term > <listitem ><para >Denotes a logical range. Must always be escaped within a character class.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\</userinput > (backslash)</term > <listitem ><para >The escape character. Must always be escaped.</para ></listitem > </varlistentry > </variablelist> </sect3> </sect2> <sect2> <title >Alternatives: matching <quote >one of</quote ></title> <para >If you want to match one of a set of alternative patterns, you can separate those with <literal >|</literal > (vertical bar character).</para> <para >For example to find either <quote >John</quote > or <quote >Harry</quote > you would use an expression <userinput >John|Harry</userinput >.</para> </sect2> <sect2> <title >Sub Patterns</title> <para ><emphasis >Sub patterns</emphasis > are patterns enclosed in parentheses, and they have several uses in the world of regular expressions.</para> <sect3> <title >Specifying alternatives</title> <para >You may use a sub pattern to group a set of alternatives within a larger pattern. The alternatives are separated by the character <quote >|</quote > (vertical bar).</para> <para >For example to match either of the words <quote >int</quote >, <quote >float</quote > or <quote >double</quote >, you could use the pattern <userinput >int|float|double</userinput >. If you only want to find one if it is followed by some whitespace and then some letters, put the alternatives inside a subpattern: <userinput >(int|float|double)\s+\w+</userinput >.</para> </sect3> <sect3> <title >Capturing matching text (back references)</title> <para >If you want to use a back reference, use a sub pattern to have the desired part of the pattern remembered.</para> <para >For example, it you want to find two occurrences of the same word separated by a comma and possibly some whitespace, you could write <userinput >(\w+),\s*\1</userinput >. The sub pattern <literal >\w+</literal > would find a chunk of word characters, and the entire expression would match if those were followed by a comma, 0 or more whitespace and then an equal chunk of word characters. (The string <literal >\1</literal > references <emphasis >the first sub pattern enclosed in parentheses</emphasis >)</para> <para >See also <link linkend="backreferences" >Back references</link >.</para> </sect3> <sect3 id="lookahead-assertions"> <title >Lookahead Assertions</title> <para >A lookahead assertion is a sub pattern, starting with either <literal >?=</literal > or <literal >?!</literal >.</para> <para >For example to match the literal string <quote >Bill</quote > but only if not followed by <quote > Gates</quote >, you could use this expression: <userinput >Bill(?! Gates)</userinput >. (This would find <quote >Bill Clinton</quote > as well as <quote >Billy the kid</quote >, but silently ignore the other matches.)</para> <para >Sub patterns used for assertions are not captured.</para> <para >See also <link linkend="assertions" >Assertions</link ></para> </sect3> </sect2> <sect2 id="special-characters-in-patterns"> <title >Characters with a special meaning inside patterns</title> <para >The following characters have meaning inside a pattern, and must be escaped if you want to literally match them: <variablelist > <varlistentry > <term ><userinput >\</userinput > (backslash)</term > <listitem ><para >The escape character.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >^</userinput > (caret)</term > <listitem ><para >Asserts the beginning of the string.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >$</userinput ></term > <listitem ><para >Asserts the end of string.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >()</userinput > (left and right parentheses)</term > <listitem ><para >Denotes sub patterns.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >{}</userinput > (left and right curly braces)</term > <listitem ><para >Denotes numeric quantifiers.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >[]</userinput > (left and right square brackets)</term > <listitem ><para >Denotes character classes.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >|</userinput > (vertical bar)</term > <listitem ><para >logical OR. Separates alternatives.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >+</userinput > (plus sign)</term > <listitem ><para >Quantifier, 1 or more.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >*</userinput > (asterisk)</term > <listitem ><para >Quantifier, 0 or more.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >?</userinput > (question mark)</term > <listitem ><para >An optional character. Can be interpreted as a quantifier, 0 or 1.</para ></listitem > </varlistentry > </variablelist > </para> </sect2> </sect1> <sect1 id="quantifiers"> <title >Quantifiers</title> <para ><emphasis >Quantifiers</emphasis > allows a regular expression to match a specified number or range of numbers of either a character, character class or sub pattern.</para> <para >Quantifiers are enclosed in curly brackets (<literal >{</literal > and <literal >}</literal >) and have the general form <literal >{[minimum-occurrences][,[maximum-occurrences]]}</literal > </para> <para >The usage is best explained by example: <variablelist > <varlistentry > <term ><userinput >{1}</userinput ></term > <listitem ><para >Exactly 1 occurrence</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >{0,1}</userinput ></term > <listitem ><para >Zero or 1 occurrences</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >{,1}</userinput ></term > <listitem ><para >The same, with less work;)</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >{5,10}</userinput ></term > <listitem ><para >At least 5 but maximum 10 occurrences.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >{5,}</userinput ></term > <listitem ><para >At least 5 occurrences, no maximum.</para ></listitem > </varlistentry > </variablelist > </para> <para >Additionally, there are some abbreviations: <variablelist > <varlistentry > <term ><userinput >*</userinput > (asterisk)</term > <listitem ><para >similar to <literal >{0,}</literal >, find any number of occurrences.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >+</userinput > (plus sign)</term > <listitem ><para >similar to <literal >{1,}</literal >, at least 1 occurrence.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >?</userinput > (question mark)</term > <listitem ><para >similar to <literal >{0,1}</literal >, zero or 1 occurrence.</para ></listitem > </varlistentry > </variablelist > </para> <sect2> <title >Greed</title> <para >When using quantifiers with no maximum, regular expressions defaults to match as much of the searched string as possible, commonly known as <emphasis >greedy</emphasis > behaviour.</para> <para >Modern regular expression software provides the means of <quote >turning off greediness</quote >, though in a graphical environment it is up to the interface to provide you with access to this feature. For example a search dialogue providing a regular expression search could have a check box labelled <quote >Minimal matching</quote > as well as it ought to indicate if greediness is the default behaviour.</para> </sect2> <sect2> <title >In context examples</title> <para >Here are a few examples of using quantifiers</para> <variablelist > <varlistentry > <term ><userinput >^\d{4,5}\s</userinput ></term > <listitem ><para >Matches the digits in <quote >1234 go</quote > and <quote >12345 now</quote >, but neither in <quote >567 eleven</quote > nor in <quote >223459 somewhere</quote ></para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\s+</userinput ></term > <listitem ><para >Matches one or more whitespace characters</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >(bla){1,}</userinput ></term > <listitem ><para >Matches all of <quote >blablabla</quote > and the <quote >bla</quote > in <quote >blackbird</quote > or <quote >tabla</quote ></para ></listitem > </varlistentry > <varlistentry > <term ><userinput >/?></userinput ></term > <listitem ><para >Matches <quote >/></quote > in <quote ><closeditem/></quote > as well as <quote >></quote > in <quote ><openitem></quote >.</para ></listitem > </varlistentry > </variablelist> </sect2> </sect1> <sect1 id="assertions"> <title >Assertions</title> <para ><emphasis >Assertions</emphasis > allows a regular expression to match only under certain controlled conditions.</para> <para >An assertion does not need a character to match, it rather investigates the surroundings of a possible match before acknowledging it. For example the <emphasis >word boundary</emphasis > assertion does not try to find a non word character opposite a word one at its position, instead it makes sure that there is not a word character. This means that the assertion can match where there is no character, &ie; at the ends of a searched string.</para> <para >Some assertions actually does have a pattern to match, but the part of the string matching that will not be a part of the result of the match of the full expression.</para> <para >Regular Expressions as documented here supports the following assertions: <variablelist > <varlistentry > <term ><userinput >^</userinput > (caret: beginning of string)</term > <listitem ><para >Matches the beginning of the searched string.</para > <para >The expression <userinput >^Peter</userinput > will match at <quote >Peter</quote > in the string <quote >Peter, hey!</quote > but not in <quote >Hey, Peter!</quote > </para > </listitem > </varlistentry > <varlistentry > <term ><userinput >$</userinput > (end of string)</term > <listitem ><para >Matches the end of the searched string.</para > <para >The expression <userinput >you\?$</userinput > will match at the last you in the string <quote >You didn't do that, did you?</quote > but nowhere in <quote >You didn't do that, right?</quote ></para > </listitem > </varlistentry > <varlistentry > <term ><userinput >\b</userinput > (word boundary)</term > <listitem ><para >Matches if there is a word character at one side and not a word character at the other.</para > <para >This is useful to find word ends, for example both ends to find a whole word. The expression <userinput >\bin\b</userinput > will match at the separate <quote >in</quote > in the string <quote >He came in through the window</quote >, but not at the <quote >in</quote > in <quote >window</quote >.</para ></listitem > </varlistentry > <varlistentry > <term ><userinput >\B</userinput > (non word boundary)</term > <listitem ><para >Matches wherever <quote >\b</quote > does not.</para > <para >That means that it will match for example within words: The expression <userinput >\Bin\B</userinput > will match at in <quote >window</quote > but not in <quote >integer</quote > or <quote >I'm in love</quote >.</para > </listitem > </varlistentry > <varlistentry > <term ><userinput >(?=PATTERN)</userinput > (Positive lookahead)</term > <listitem ><para >A lookahead assertion looks at the part of the string following a possible match. The positive lookahead will prevent the string from matching if the text following the possible match does not match the <emphasis >PATTERN</emphasis > of the assertion, but the text matched by that will not be included in the result.</para > <para >The expression <userinput >handy(?=\w)</userinput > will match at <quote >handy</quote > in <quote >handyman</quote > but not in <quote >That came in handy!</quote ></para > </listitem > </varlistentry > <varlistentry > <term ><userinput >(?!PATTERN)</userinput > (Negative lookahead)</term > <listitem ><para >The negative lookahead prevents a possible match to be acknowledged if the following part of the searched string does match its <emphasis >PATTERN</emphasis >.</para > <para >The expression <userinput >const \w+\b(?!\s*&)</userinput > will match at <quote >const char</quote > in the string <quote >const char* foo</quote > while it can not match <quote >const QString</quote > in <quote >const QString& bar</quote > because the <quote >&</quote > matches the negative lookahead assertion pattern.</para > </listitem > </varlistentry > </variablelist > </para> </sect1> <sect1 id="backreferences"> <title >Back References</title> <para ></para> </sect1> </appendix>