<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <title>Character classes</title> </head> <body><div class="manualnavbar" style="text-align: center;"> <div class="prev" style="text-align: left; float: left;"><a href="regexp.reference.dot.html">Dot</a></div> <div class="next" style="text-align: right; float: right;"><a href="regexp.reference.alternation.html">Alternation</a></div> <div class="up"><a href="reference.pcre.pattern.syntax.html">PCRE regex syntax</a></div> <div class="home"><a href="index.html">PHP Manual</a></div> </div><hr /><div id="regexp.reference.character-classes" class="section"> <h2 class="title">Character classes</h2> <p class="para"> An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special. If a closing square bracket is required as a member of the class, it should be the first data character in the class (after an initial circumflex, if present) or escaped with a backslash. </p> <p class="para"> A character class matches a single character in the subject; the character must be in the set of characters defined by the class, unless the first character in the class is a circumflex, in which case the subject character must not be in the set defined by the class. If a circumflex is actually required as a member of the class, ensure it is not the first character, or escape it with a backslash. </p> <p class="para"> For example, the character class [aeiou] matches any lower case vowel, while [^aeiou] matches any character that is not a lower case vowel. Note that a circumflex is just a convenient notation for specifying the characters which are in the class by enumerating those that are not. It is not an assertion: it still consumes a character from the subject string, and fails if the current pointer is at the end of the string. </p> <p class="para"> When case-insensitive (caseless) matching is set, any letters in a class represent both their upper case and lower case versions, so for example, an insensitive [aeiou] matches "A" as well as "a", and an insensitive [^aeiou] does not match "A", whereas a sensitive (caseful) version would. </p> <p class="para"> The newline character is never treated in any special way in character classes, whatever the setting of the <a href="reference.pcre.pattern.modifiers.html" class="link">PCRE_DOTALL</a> or <a href="reference.pcre.pattern.modifiers.html" class="link">PCRE_MULTILINE</a> options is. A class such as [^a] will always match a newline. </p> <p class="para"> The minus (hyphen) character can be used to specify a range of characters in a character class. For example, [d-m] matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class. </p> <p class="para"> It is not possible to have the literal character "]" as the end character of a range. A pattern such as [W-]46] is interpreted as a class of two characters ("W" and "-") followed by a literal string "46]", so it would match "W46]" or "-46]". However, if the "]" is escaped with a backslash it is interpreted as the end of range, so [W-\]46] is interpreted as a single class containing a range followed by two separate characters. The octal or hexadecimal representation of "]" can also be used to end a range. </p> <p class="para"> Ranges operate in ASCII collating sequence. They can also be used for characters specified numerically, for example [\000-\037]. If a range that includes letters is used when case-insensitive (caseless) matching is set, it matches the letters in either case. For example, [W-c] is equivalent to [][\^_`wxyzabc], matched case-insensitively, and if character tables for the "fr" locale are in use, [\xc8-\xcb] matches accented E characters in both cases. </p> <p class="para"> The character types \d, \D, \s, \S, \w, and \W may also appear in a character class, and add the characters that they match to the class. For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can conveniently be used with the upper case character types to specify a more restricted set of characters than the matching lower case type. For example, the class [^\W_] matches any letter or digit, but not underscore. </p> <p class="para"> All non-alphanumeric characters other than \, -, ^ (at the start) and the terminating ] are non-special in character classes, but it does no harm if they are escaped. The pattern terminator is always special and must be escaped when used within an expression. </p> <p class="para"> Perl supports the POSIX notation for character classes. This uses names enclosed by <em>[:</em> and <em>:]</em> within the enclosing square brackets. PCRE also supports this notation. For example, <em>[01[:alpha:]%]</em> matches "0", "1", any alphabetic character, or "%". The supported class names are: <table class="doctable table"> <caption><strong>Character classes</strong></caption> <tbody class="tbody"> <tr><td><em>alnum</em></td><td>letters and digits</td></tr> <tr><td><em>alpha</em></td><td>letters</td></tr> <tr><td><em>ascii</em></td><td>character codes 0 - 127</td></tr> <tr><td><em>blank</em></td><td>space or tab only</td></tr> <tr><td><em>cntrl</em></td><td>control characters</td></tr> <tr><td><em>digit</em></td><td>decimal digits (same as \d)</td></tr> <tr><td><em>graph</em></td><td>printing characters, excluding space</td></tr> <tr><td><em>lower</em></td><td>lower case letters</td></tr> <tr><td><em>print</em></td><td>printing characters, including space</td></tr> <tr><td><em>punct</em></td><td>printing characters, excluding letters and digits</td></tr> <tr><td><em>space</em></td><td>white space (not quite the same as \s)</td></tr> <tr><td><em>upper</em></td><td>upper case letters</td></tr> <tr><td><em>word</em></td><td>"word" characters (same as \w)</td></tr> <tr><td><em>xdigit</em></td><td>hexadecimal digits</td></tr> </tbody> </table> The <em>space</em> characters are HT (9), LF (10), VT (11), FF (12), CR (13), and space (32). Notice that this list includes the VT character (code 11). This makes "space" different to <em>\s</em>, which does not include VT (for Perl compatibility). </p> <p class="para"> The name <em>word</em> is a Perl extension, and <em>blank</em> is a GNU extension from Perl 5.8. Another Perl extension is negation, which is indicated by a <em>^</em> character after the colon. For example, <em>[12[:^digit:]]</em> matches "1", "2", or any non-digit. </p> <p class="para"> In UTF-8 mode, characters with values greater than 128 do not match any of the POSIX character classes. </p> </div><hr /><div class="manualnavbar" style="text-align: center;"> <div class="prev" style="text-align: left; float: left;"><a href="regexp.reference.dot.html">Dot</a></div> <div class="next" style="text-align: right; float: right;"><a href="regexp.reference.alternation.html">Alternation</a></div> <div class="up"><a href="reference.pcre.pattern.syntax.html">PCRE regex syntax</a></div> <div class="home"><a href="index.html">PHP Manual</a></div> </div></body></html>